KEMBAR78
Behavior Foundation Model Research Paper | PDF | Learning | Cognitive Science
0% found this document useful (0 votes)
77 views19 pages

Behavior Foundation Model Research Paper

This document presents a comprehensive overview of behavior foundation models (BFMs) for whole-body control (WBC) in humanoid robots, addressing the challenges of traditional and learning-based control methods. BFMs leverage large-scale pretraining to enable rapid adaptation and generalization across diverse tasks, enhancing the intelligence and versatility of humanoid robots. The paper discusses the evolution of WBC systems, current limitations, and future opportunities for research in this area.

Uploaded by

chaitanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views19 pages

Behavior Foundation Model Research Paper

This document presents a comprehensive overview of behavior foundation models (BFMs) for whole-body control (WBC) in humanoid robots, addressing the challenges of traditional and learning-based control methods. BFMs leverage large-scale pretraining to enable rapid adaptation and generalization across diverse tasks, enhancing the intelligence and versatility of humanoid robots. The paper discusses the evolution of WBC systems, current limitations, and future opportunities for research in this area.

Uploaded by

chaitanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

PREPRINT. UNDER REVIEW.

Behavior Foundation Model: Towards


Next-Generation Whole-Body Control System
of Humanoid Robots
Mingqi Yuan123* , Tao Yu2* , Wenqi Ge24* , Xiuyong Yao2 , Dapeng Li2 , Huijiang Wang5 , Jiayu Chen4 ,
Xin Jin3† , Bo Li1 , Hua Chen2 , Wei Zhang2 , Senior Member, IEEE, Wenjun Zeng3 , Fellow, IEEE


arXiv:2506.20487v1 [cs.RO] 25 Jun 2025

Abstract—Humanoid robots are drawing significant attention as ver- Consequently, the development of robust and generalizable
satile platforms for complex motor control, human-robot interaction, whole-body control (WBC) systems has become an urgent
and general-purpose physical intelligence. However, achieving efficient priority. In the following sections, we begin by reviewing
whole-body control (WBC) in humanoids remains a fundamental chal- the previous humanoid WBC methods comprehensively,
lenge due to sophisticated dynamics, underactuation, and diverse task
from traditional model-based to task-specific and learning-
requirements. While learning-based controllers have shown promise for
complex tasks, their reliance on labor-intensive and costly retraining
based controllers, before introducing the transformative
for new scenarios limits real-world applicability. To address these lim- approach–the behavior foundation model. This evolution-
itations, behavior(al) foundation models (BFMs) have emerged as a ary trajectory not only reflects the field’s advancement to-
new paradigm that leverages large-scale pretraining to learn reusable ward enhanced intelligence and generalizability but also
primitive skills and behavioral priors, enabling zero-shot or rapid adap- paves the way for the next-generation humanoid robot
tation to a wide range of downstream tasks. In this paper, we present control systems.
a comprehensive overview of BFMs for humanoid WBC, tracing their
development across diverse pre-training pipelines. Furthermore, we Tasks

discuss real-world applications, current limitations, urgent challenges, Behavior


Motion
and future opportunities, positioning BFMs as a key approach toward Data
Foundation Model Tracking
scalable and general-purpose humanoid intelligence. Finally, we provide
ior Co
a curated and long-term list of BFM papers and projects to facilitate Demonstration av v Goal
h

er
Be

Reaching

ag
more subsequent research, which is available at https://github.com/
e
Adaptation
yuanmingqi/awesome-bfm-papers. Interaction
Training
Command
Following
e
ag

Be

Index Terms—Humanoid robot, whole-body control, behavior founda- Simulation


er

v av
Text to
tion model, pre-training, preview. Co ior
Motion
Trajectory

Scene
1 I NTRODUCTION Interaction

H UMANOID robots are increasingly being developed


and deployed in a variety of real-world scenarios
due to their human-like morphology and high degrees
Fig. 1. A behavior foundation model learns broad behavior priors from
large-scale and diverse behavior data, which can then be conveniently
of freedom (DoF). These qualities enable them to operate adapted to a wide range of downstream tasks.
seamlessly in environments originally designed for humans,
allowing them to perform locomotion, manipulation, and 1.1 Traditional Model-based Controller
interaction tasks with versatility and agility. However, hu-
Traditional WBC methods have served as the cornerstone
manoid robots must coordinate full-body motions under
for locomotion and manipulation in early humanoid robots
complex conditions, such as underactuation, frequent con-
[1, 2]. These methods rely heavily on physics-based models
tact changes, and dynamically shifting task goals, all while
and are typically structured into a predictive-reactive hier-
maintaining balance and safety. This presents significant
archy: high-level planners like centroidal model predictive
challenges in realizing the full potential of humanoid robots.
control (MPC) [3, 4] generate reference trajectories, while
1
Department of Computing, The Hong Kong Polytechnic University, Hong low-level task-space whole-body controllers solve optimal
Kong SAR, China. 2 LimX Dynamics, Shenzhen, China. 3 Ningbo Institute of control problems (OCPs) [5] to track these objectives under
Digital Twin, Eastern Institute of Technology, Ningbo, China. 4 Department of dynamic constraints. For instance, operational space control
Data and Systems Engineering, University of Hong Kong, Hong Kong SAR,
China. 5 CREATE Lab, EPFL, Lausanne, Switzerland. [6] and hierarchical task control [7] establish the theoretical
*
Work done at LimX Dynamics, and these authors contributed equally. foundation, while [8] advances hierarchical quadratic pro-
† Corresponding author: Xin Jin (jinxin@eitech.edu.cn). gramming (QP) solvers and enables real-time performance
PREPRINT. UNDER REVIEW. 2

Behavior
Learning-based and Foundation Model
Task-specific Controller
Model-based
Controller Motivo, Hover, MaskedMimic, etc.
Task DeepMimic, AMP, etc.
Large-scale data pre-training (RL, IL, etc.)
Broad behavior coverage
Solving MPC, WBOSC, etc. Reinforcement learning
Fast adaptation capability
Capacity Physics-based models
Flexible task design
Poor cross-task generalization
Labor-instensive configuration
Low robustness

Towards primitive Towards specific and complex Towards diverse


humanoid WBC tasks humanoid WBC tasks humanoid WBC tasks
2006 - 2018 - 2024 -

Fig. 2. Evolution map of the whole-body controller for humanoid robots.

in multi-task scenarios such as balance, walking, and ma- virtual reality (VR) teleoperation with WBC for humanoid
nipulation. Such frameworks have been widely applied in loco-manipulation, demonstrating 85% success rate in real-
humanoids like Atlas [9], HRP-2 [10], and DLR’s torque- world bimanual tasks [27]. In addition, [28] develops an
controlled robots [11], achieving robust locomotion and expressive WBC framework that decouples upper-body IL
multi-contact interactions. (for stylistic motions) from robust lower-body locomotion,
Despite their success, traditional WBC systems face crit- enabling humanoid robots to dynamically adapt their gait
ical limitations: (i) task design, gain tuning, and heuristic while performing diverse movements. This approach over-
adjustments for complex behaviors (e.g., uneven terrain or comes the instability of full-body imitation caused by mor-
dynamic transitions) remain labor-intensive and brittle, (ii) phological mismatches between humans and robots.
real-time MPC struggles with high-dimensional systems, While learning-based methods have demonstrated re-
often requiring simplifications that sacrifice dynamic fidelity markable success in diverse humanoid WBC tasks, they face
[12], (iii) the lack of flexibility to execute highly dynamic fundamental challenges that limit their broader applicabil-
skills (e.g., backflips or rapid contact switches) or adapt to ity. RL-based approaches suffer from sample inefficiency,
unforeseen disturbances [13], and (iv) weak robustness, as often requiring millions of environment interactions to con-
even a little push may topple a robot with model-based verge, while remaining highly sensitive to reward function
walking controller. These challenges are especially signifi- design—poorly shaped rewards can lead to unintended
cant in humanoids, where tasks often require rich coordina- behaviors or local optima [29]. Furthermore, the simulation-
tion, contact reasoning, and situational awareness [14, 15]. to-reality (Sim2Real) gap exacerbates these limitations, as
As a result, recent research increasingly shifts toward data- policies trained in simulation frequently degrade when con-
driven approaches, aiming to learn motor skills, coordina- fronted with real-world dynamics, sensor noise, and hard-
tion policies, and behavioral priors from demonstrations or ware imperfections [30–33]. In contrast, IL-based methods
reinforcement learning [16, 17]. are more sample-efficient yet pose a significant challenge to
data collection, and learned policies often inherit the biases
1.2 Learning-based and Task-specific Controller and limitations of the demonstrator [34–38]. Moreover, both
paradigms struggle with generalization, in which learned
Learning-based methods, particularly reinforcement learn- policies typically excel only at narrow tasks and fail to
ing (RL) and imitation learning (IL), have emerged as adapt to new scenarios without extensive retraining. These
promising alternatives to traditional WBC methods, en- challenges collectively underscore the need for approaches
abling robots to acquire complex skills through environ- that combine the flexibility of learning with structured pri-
mental interaction or human demonstrations [18–23]. For ors for robustness and generalizability—a gap that behavior
example, [16] presents a framework entitled DeepMimic foundation models aim to bridge.
that combines deep RL with motion capture data to en-
able physically simulated characters to learn dynamic skills
while maintaining natural motion quality. [24] further ex- 1.3 Behavior Foundation Model
tends DeepMimic by introducing adversarial motion pri- The term ”behavior(al) foundation model (BFM)” is first
ors (AMP) to enable more stylized and diverse character introduced in [39], which proposes a successor measure-
control while maintaining physical realism. In contrast, [25] based framework for training generalist policies capable of
proposes HoST, a RL-based framework to learn humanoid instantly imitating diverse behaviors from minimal demon-
standing-up control from scratch and achieves adaptive strations. It demonstrates that BFMs pretrained on unsu-
and stable standing-up motions across diverse laboratory pervised interaction data are promising to eliminate task-
and outdoor environments, highlighting the outstanding specific RL fine-tuning by solving imitation tasks through
learning capability and robustness of RL in specific tasks. forward-backward state feature matching, while simulta-
For IL-based methods, [26] proposes TRILL that combines neously supporting multiple IL paradigms via a unified
PREPRINT. UNDER REVIEW. 3

representation, such as behavioral cloning, reward infer- TABLE 1


ence, and distribution matching. Subsequent work [40– Main notations and their semantics used in this article.
44] has established BFMs as a class of RL agents capable
of unsupervised training on reward-free transitions while Notation Semantics
yielding approximately optimal policies for broad classes of M Markov decision process
reward functions at test time without additional learning or S State space
A Action space
planning. r(s) Reward function
In this paper, we extend the definition of BFMs as a P (ds′ |s, a) Transition probability measure
specialized class of foundation models [45] designed to γ Discount factor
control agent behaviors in dynamical environments. Rooted π(da|s) Policy
Qπr (s, a) Action-value function of policy π and reward r
in the principles of general foundation models (e.g., GPT- M π (X |s, a) Successor measure of policy π
4 [46], CLIP [47], and SAM [48]) that leverage broad, self- F (s, a) Forward embedding
supervised pre-training on large-scale static data, BFMs are B(s′ ) Backward embedding
L Loss function
often trained on extensive behavior data (e.g., trajectories,
human demonstrations, or agent-environment interactions),
encoding a comprehensive spectrum of behaviors rather
ing control systems that are highly versatile, robust, re-
than specializing narrowly in single-task scenarios. This
configurable, dexterous, and mobile compared to less agile
property ensures that the model can readily generalize
robotic platforms [53]. To that end, the humanoid WBC
across different tasks, contexts, or environments, demon-
is proposed to coordinate the motion of multiple robots’
strating versatile and adaptive behavior generation capabil-
appendages to execute multiple tasks simultaneously and
ities. Recent advancements in vision-language-action (VLA)
reliably. It considers the entire robot body as a single
models [47, 49–51] have focused on integrating vision, lan-
and integrated system, managing locomotion, manipula-
guage, and action to handle multi-modal tasks, excelling
tion, and interaction with the environment using a unified
in dynamic settings where they generate context-aware re-
set of control algorithms [54]. As illustrated in Fig 2, the
sponses based on visual and linguistic inputs. In contrast,
humanoid WBC has evolved from traditional model-based
BFMs are primarily designed for directly controlling agent
approaches to flexible learning-based approaches, moving
behaviors such as locomotion, manipulation, and interac-
toward a generalist that solves broad tasks in diverse sce-
tion. Moreover, most existing VLA models apply to rela-
narios [28, 52, 53, 55–57]. In line with this trend, BFMs
tively stable platforms like mechanical arms or wheeled hu-
have emerged as a promising approach to achieve general-
manoid robots [52], while BFMs are developed toward han-
purpose WBC through large-scale pre-training on diverse
dling the sophisticated WBC of literally humanoid robots.
motion data, which will be comprehensively discussed in
Inspired by the discussions above, it is worthwhile to
the following contents.
conduct a systematic and comprehensive preview to pro-
vide a holistic perspective for subsequent research. To the
best of our knowledge, this is the first study focused on 2.2 Reinforcement Learning
the development of BFMs, particularly their applications in We study the RL problem consider a Markov decision pro-
humanoid robots. The structure of this paper is organized cess (MDP) defined by a tuple M = (S, A, P, r(S → R), γ)
as follows: Section 2 introduces the essential background [58], where S is the state space, A is the action space,
information of this paper. Section 3 discusses the application P (s′ |s, a) is the probability measure on S defining the
of BFMs in humanoid WBC, including diverse pre-training stochastic transition to the next state s′ obtained by taking
and adaptation strategies. Section 4 explores the potential action a in state s, r(s), and γ ∈ [0, 1] is a discount
applications of BFMs towards multiple industries, while factor. A policy π is defined as the probability measure
summarizing the limitations of current BFMs. Section 5 π(a|s) that maps each state to a distribution over actions.
highlights the opportunities for future advancements in Furthermore, we denote Pr(·|s0 , a0 , π) and E(·|s0 , a0 , π)
BFMs, as well as the risks and ethical concerns associated probability and expectation operators under state-action
with their development and deployment. Finally, Section 6 sequences (st , at )t≥0 starting at (s0 , a0 ) and following the
summarizes the key findings and contributions of this pa- policy π with st ∼ P (st |st−1 , at−1 ) and at ∼ π(at |st ). The
per. goal of RL is to learn a policy that maximizes the expected
discounted return:
"∞
2 BACKGROUND
#
X
t
J π = Eπ γ r(st ) . (1)
In this section, we introduce basic background to support
t=0
the subsequent analysis of BFM approaches. More specifi-
cally, we begin with the definition and evolution overview Finally, we list all the main notations and their semantics
of the humanoid WBC systems. Then, we introduce the used in this paper in Table 1.
formulation of RL that is currently widely employed to
build learning-based and task-specific controllers. 3 BFM FOR H UMANOID W HOLE - BODY C ONTROL
In this section, we introduce representative methods for
2.1 Humanoid Whole-body Control constructing and adapting BFMs towards humanoid WBC,
Humanoid robots are expected to operate in unstructured analyzing extensively their main motivations, typical im-
environments that are dynamic and unpredictable, demand- plementations, and empirical properties. Note that some
PREPRINT. UNDER REVIEW. 4

TeamPlay [61], ASE [63], PHC [59],


Goal-Conditioned CALM [64], CASE [65], InterMimic [69],
Learning MoConVQ [67], MTM [60], MaskedMimic
[68], HOVER [56], ModSkill [66]

Pre-training Intrinsic Reward-Driven DIAYN [75], RND [71], APS [77],


Learning ProtoRL [81], RE3 [80]

Forward-Backward
Behavior FB-IL [39], FB-AWARE [40], FB-CPR [41]
Representation Learning
Foundation
Model
Fine-tuning Belief-FB [44], Rotation-FB [44],
Techniques Task Tokens [43], ReLA [42], LoLA [42]
Adaptation
Towards Hierarchical UniHSI [93], TokenHSI [94], UniPhys [96],
Control CLoSD [95], LangWBC [91], LeVERB [92]

Fig. 3. Taxonomy of the pre-training and adaptation approaches of the BFMs.

approaches might be proposed before the emergence of the agent’s policy, enabling it to adapt its actions toward
the concept of BFMs, yet we still discuss them as long as achieving that specific goal. The goal can be specified in
they adhere to the properties of BFMs, or their physical various forms, such as a target state, an objective function,
meaning is analogous to BFMs. or an external task description [59, 60]. This approach allows
the agent to generalize across different goals by learning a
shared policy that can effectively handle diverse objectives.
3.1 Pre-training
The key advantage of goal-conditioned learning lies in its
Pre-training of BFMs seeks to learn reusable primitive skills ability to learn a more flexible and transferable policy that
and behavioral priors from large-scale data sources, creating can be applied to a wide range of tasks, as it directly incor-
a foundation for efficient downstream adaptation. Current porates the task’s goal during training, rather than requiring
approaches can be broadly categorized into three types (as retraining for each specific task. This makes it particularly
depicted in Figure 3): goal-conditioned learning, intrinsic useful in environments where the agent needs to solve
reward-driven learning, and the forward-backward frame- multiple tasks or interact with changing environments.
work.
Skill learning from motion tracking. Among the diverse
approaches to goal-conditioned learning, tracking-based
3.1.1 Goal-conditioned Learning
learning represents a specialized form where the target
behavior is explicitly defined by dense reference supervision
Diverse Goals Environment or guidance, typically derived from motion capture data or
"Keep expert demonstrations. At each time step, the agent is often
Running trained to track the given reference motion’s joint angle
foward" or kinematic pose of the next time step [16]. The primary
Location Text Pose
motivation behind tracking-based learning is that learning
to track a single pose is more achievable and general than
State Action directly imitating a whole motion, especially a complex
Encoder
motion.
Agent For example, [61] trains an agent to imitate a large
Goal amount of football motion capture data via a DeepMimic-
Embedding Reward
like [16] approach, aiming to realize the complete behavior
coverage for the football game. Then the agent is leveraged
to sample substantial state-action pairs to train a neural
probabilistic motor primitive (NPMP) model [62] and derive
Fig. 4. Workflow of the goal-conditioned learning, which enables versa- a low-level latent-conditioned controller. Finally, the con-
tile skill acquisition by training policies to achieve diverse target states
troller is applied for further drill learning by conducting
specified through goal embeddings.
RL with a drill-specific (e.g., follow, dribble, shoot, and
As shown in Figure 4, goal-conditioned learning is a kick-to-target) reward function. Here, the learned low-level
framework in RL where an agent’s behavior is condi- controller can be viewed as a BFM as it learns realistic
tioned on a specific goal or objective, typically provided human-like movement based on motion capture data and
as input. Unlike traditional RL, where the agent learns can be rapidly adapted to diverse higher-level drill learning.
from raw state-action pairs without explicit task-specific Similarly, [63] introduces adversarial skill embeddings
guidance, goal-conditioned learning integrates the goal into (ASE), a framework that learns a reusable latent space of
PREPRINT. UNDER REVIEW. 5

motor skills by combining adversarial IL with unsupervised adapt to complex scenes and support applications rang-
RL. Trained on unstructured motion data, ASE produces a ing from VR control to complex human-object interaction
latent-conditioned low-level controller capable of generat- (HOI). In addition, InterMimic [69] focuses on the HOI sce-
ing diverse and physically plausible behaviors, serving as a nario and designs a two-stage teacher-student framework
general-purpose motor prior for downstream tasks. Build- that distills imperfect motion capture interaction data into
ing on ASE, [64] proposes conditional adversarial latent robust and physics-based controllers. Teacher policies are
models (CALM), which incorporate a conditional discrimi- trained on subsets of noisy data and refined through sim-
nator to enable fine-grained control over generated motions ulation, then distilled into a student policy with RL-based
via latent manipulation. [65] further extends this line with fine-tuning. This curriculum strategy enables generalization
CASE, introducing skill-conditioned IL with training tech- across diverse interactions with high physical fidelity.
niques such as focal skill sampling and skeletal residual For real-world robotic applications, HOVER [56] intro-
forces to enhance agility and motion diversity. duces a multi-mode policy distillation framework that al-
While the methods above achieve efficient skill acquisi- lows humanoid robots to switch seamlessly between tasks
tion from large behavior datasets, [55] proposes HugWBC like locomotion, manipulation, and navigation using a sin-
that explores learning versatile locomotion skills without gle unified policy distilled from an oracle. This elimi-
relying on pre-collected motion data. The framework auto- nates the need for task-specific controllers, demonstrating
matically generates adaptive behaviors through a structured general-purpose control in real-world environments, akin to
RL process, where a general command space dynamically the versatility seen in MaskedMimic for virtual characters.
produces feasible velocity, gait, and posture targets dur- All the above methods follow the idea of BFM, which is
ing training. By reformulating WBC as a self-supervised revealed from two aspects: (i) their ability to learn a broad
command-tracking problem, the work establishes a new behavior coverage from diverse data sources, and (ii) their
direction for developing general-purpose humanoid con- fast adaptation ability to downstream tasks. These models
trollers that learn robust skills through environmental in- are trained on large-scale datasets and can generalize across
teraction rather than data imitation. a wide range of motor skills, such as locomotion, HOI, and
Moving beyond body-level skill learning, ModSkill [66] task-specific behaviors, without being limited to a single
introduces a modular framework that decouples full-body task. For instance, InterMimic is capable of handling various
motion into part-specific skills for individual body parts. HOT tasks, while MoConVQ adapts to different tasks, such
This modularization allows for efficient and scalable learn- as goal-reaching and text-conditioned motion generation.
ing, as each body part is controlled independently by a low- Additionally, these models exhibit fast adaptation to new
level controller driven by part-specific skill embeddings. tasks, with minimal retraining, demonstrating their capacity
ModSkill’s ability to focus on body-part-level skills makes to apply learned behaviors to new and unseen scenarios.
it a powerful system for controlling complex motions and Thus, the broad behavior coverage combined with the adap-
adapting learned behaviors across different tasks. By uti- tation ability collectively characterizes these methods as
lizing a skill modularization attention layer, ModSkill en- BFMs.
hances the generalization of motor skills across various tasks
like reaching or striking, further improving task-specific 3.1.2 Intrinsic Reward-driven Learning
adaptation. In tracking-based learning, the agent is consistently pro-
From primitive skills to high-level goal execution. The vided with an explicit objective (e.g., joint angles or ve-
success of BFMs in learning diverse primitive skills has locities) and trained via a well-specified reward function
propelled the development of more advanced BFMs capable to achieve targeted skill acquisition. In contrast, intrinsic
of interpreting and executing high-level goals, including reward-driven learning presents a distinct approach, where
language instructions and multi-task objectives. A notable the agent is motivated to explore the environment without
example is MoConVQ [67], which introduces a unified relying on explicit task-specific rewards. Instead, the agent
motion control framework based on discrete latent codes is guided by intrinsic rewards, which are self-generated sig-
learned via vector quantized variational autoencoder (VQ- nals that encourage exploration, skill acquisition, or novelty
VAE). The model supports a wide range of downstream detection. Extensive strategies for intrinsic reward-driven
tasks—including motion tracking, interactive control, and learning have been developed, including curiosity-driven
text-to-motion generation—by offering a compact and mod- exploration [70–73], skill discovery [74–77], and maximiz-
ular representation. MoConVQ also integrates with large ing data coverage [78–81], each encouraging the agent to
language models (LLMs) and enables the simulated agents explore different aspects of the environment.
to be directed via in-context language prompts, thereby For example, [70] introduces an intrinsic curiosity mod-
bridging symbolic reasoning and physical control. ule (ICM) that encourages the agent to explore unfamiliar
Meanwhile, MaskedMimic [68] addresses physics-based states by providing an intrinsic reward based on the predic-
character control as a general motion inpainting problem, tion error between the current and predicted next state. The
producing full-body motions from partial descriptions like intrinsic reward is proportional to the discrepancy between
masked keyframes, objects, or text instructions. Masked- the predicted state and the actual state, motivating the agent
Mimic involves a two-phase training process: first, a fully- to interact with environments that it cannot predict, thereby
constrained motion tracking controller learns to imitate di- driving exploration and learning. ICM has been shown to
verse reference motions, then a partially-constrained VAE- significantly improve the agent’s ability to explore complex
based policy distills this knowledge through masked goal environments with sparse or no external rewards. In con-
conditioning. As a result, MaskedMimic can dynamically trast, DIAYN [75] uses latent variable discovery to guide the
PREPRINT. UNDER REVIEW. 6

challenges by developing hybrid approaches that combine


the exploratory benefits of intrinsic rewards with the task-
Self-supervised Environment relevant utility guarantees of goal-conditioned learning,
Tasks achieving reliable BFMs for humanoid robots.

3.1.3 Forward-backward Representation Learning


Recent advances in BFM are propelled by a novel frame-
State Action work entitled forward-backward (FB) representation learn-
Curiosity-driven Data Skill
Exploration Coverage Discovery ing [84] that disentangles the policy learning from task-
Agent specific objectives, demonstrating a fundamentally differ-
ent approach with goal-conditioned learning and intrinsic
reward-driven learning. By learning a universal policy rep-
resentation, it can be rapidly adapted to new tasks through
Intrinsic Reward
reward inference or demonstration alignment, without ad-
ditional environment interaction or policy optimization.
Fig. 5. Workflow of the intrinsic reward-driven learning. The agent is
trained to explore and comprehend the environment via self-supervised
reward signals, thereby achieving non-directional skill acquisition. Environment
Replay
agent’s exploration by encouraging the agent to maximize Buffer
the diversity of the behaviors it exhibits. DIAYN introduces
an intrinsic reward based on the mutual information be- FB Repr. FB Critic
State Action
tween the agent’s latent skill variable and its environment BEN FEN
state, encouraging the agent to discover and explore a wide
range of distinct behaviors. By learning a set of diverse, Agent
reusable skills, DIAYN enables agents to tackle complex
tasks without requiring domain-specific rewards, making it
a valuable approach for unsupervised skill discovery. Latent
Vector
In addition, RE3 [80] focuses on state coverage maxi-
mization, where the intrinsic reward is driven by the state
visitation frequency. The goal is to encourage the agent to Fig. 6. Workflow of the forward-backward representation Learning,
where a forward embedding network (FEN) and backward embedding
explore states that are infrequent or underrepresented in network (BEN) are employed to learn the approximation of the succes-
its past experience. By learning from these underexplored sor measure, thereby achieving the universal policy representation.
regions, RE3 enhances the agent’s understanding of the en-
vironment and helps to avoid getting stuck in local optima. At its core, the FB representation Learning seeks to learn
RE3’s approach allows the agent to explore a broader range a finite-rank approximation of the successor measure, which
of states, improving the diversity of the learned representa- is an extension of the successor representation [85, 86]. It
tions and making it suitable for environments with sparse depicts the discounted future state visitation distribution
external rewards. Following DIAYN and RE3, ODPP [82] as a measure over states. For each policy π , its successor
is proposed as a unified framework to discover skills that measure is defined as
are both diverse and have superior state coverage, based
on a novel use of determinant point process. Specifically, ∞
X
the unsupervised objective is to maximize (i) the number of M π (X |s, a) := γ t Pr (st+1 ∈ X |s, a, π) , ∀X ⊂ S. (2)
modes covered within each individual trajectory to enhance t=0
state coverage, and (ii) the number of modes across the The successor measure satisfies a measure-valued Bellman
trajectory space to promote skill diversity. equation [84]:
These unsupervised RL agents are considered BFMs
based on two key observations: (i) they can effectively M π (X |s, a) = P (X |s, a)
(3)
explore the environment and action space and discover +γ Es′ ∼P (·|s,a),a′ ∼π(·|s′ ) [M π (X |s′ , a′ )] , X ⊂ S.
generalizable behaviors under the motivation of intrinsic
Equipped with the successor measure, the action-value
rewards, and (ii) they exhibit the capability to effectively
function Qπr (s, a) of π for any reward function r : S → R
learn and adapt to various tasks, as demonstrated in [81]
satisfies
and [83]. However, BFMs trained solely with intrinsic re- "∞ #
wards face significant limitations. The agent often requires a π
X
t
Qr (s, a) := E γ r (st+1 ) |s, a, π
huge amount of training to achieve broad behavior coverage
under the guidance of intrinsic rewards, while consistently Z t=0 (4)
producing unreliable motion prior (e.g., unsafe or impracti- = M π (ds′ |s, a) r (s′ )
cal motions), especially for humanoid robots with extremely s′ ∈S

complex dynamics [39]. Despite the convenience of this Eq. (4) decouples the action-value function as two separate
paradigm, future work may address these fundamental terms: (i) the successor measure that models the evolution of
PREPRINT. UNDER REVIEW. 7

the policy in the environment, and (ii) the reward function encoding and better task representations in BFMs. While the
that captures task-relevant information. This factorization standard FB method uses a linear task projection that can
suggests that learning the successor measure for π allows blur rewards and reduce spatial precision, auto-regressive
for the zero-shot evaluation of Qπr on any reward without features improve expressivity and performance, especially
further training. for tasks requiring spatial accuracy or generalization. Ad-
Notably, [87] proposes an estimation of the success mea- ditionally, [40] introduces advantage-weighted regression
sure as (AWR) to address challenges with offline learning from
Z complex datasets. The modified FB approach, FB-AWARE,
M π (X |s, a) ≈ F π (s, a)⊤ B(s′ )ρ(ds′ ), (5) combines auto-regressive features with advantage weight-
s′ ∈X
ing, which performs well across new environments and
where ρ is an arbitrary distribution over states, F π : S × even matches the performance of standard offline RL agents
A → Rd is the forward embedding and B : S → Rd in benchmarks like D4RL.
is the backward embedding, respectively. Denote by z = The FB framework provides a general and flexible ap-
Es∼ρ [B(s)r(s)], the action-value function is rewritten as proach for training BFMs by learning successor measure
Qπr = F π (s, a)⊤ z. (6) representations and applying pre-trained policies to new
tasks. However, it suffers from several limitations: (i) when
To learn a family of polices, [87] suggests that both the the latent dimension d is finite, it relies on a low-rank
forward embedding F π and policy π can be parameterized dynamics assumption, leading to limited inductive bias for
by the same task encoding vector z , such that policy selection; (ii) poor coverage in the training dataset
Z causes offline learning to fail in reliably optimizing policies,
M πz (X |s, a) ≈ F π (s, a, z)⊤ B(s′ )ρ(ds′ ), (7) often collapsing to a few suboptimal behaviors with weak
s′ ∈X
performance on downstream tasks. These limitations greatly
d hinder the application of the FB framework for humanoid
where z ⊆ R , and the policy πz is defined as
robots. To address these limitations, [41] proposes FB with
πz = argmax F π (s, a, z)⊤ z. (8) conditional policy regularization (FB-CPR) and introduces
a
Motivo, the first BFM for humanoid WBC in the true sense
Then, the forward-backward embedding network is trained that solves diverse tasks in a zero-shot manner, including
to minimize the temporal difference (TD) loss derived as the motion tracking, goal reaching, and reward optimization.
Bellman residual: Specifically, FB-CPR learns the FB representations with

a discriminator-based regularization scheme, whose loss
LFB = E z∼ν,(s,a,s′ )∼ρ, F (s, a, z)⊤ B(s+ )
function is defined as
s+ ∼ρ,a′ ∼πz (s′ )
 h i
− γ · sg(F )(s′ , a′ , z)⊤ sg(B)(s+ )
2 (9) LFB−CPR = − Ez∼ν,s∼Donline ,a∼πz (·|s) F (s, a, z)⊤ z
(12)
h i + αKL (pπ , pE ) ,
− 2Ez∼ν,(s,a,s′ )∼ρ F (s, a, z)⊤ B(s′ ) ,
where Donline is the associated replay buffer of unsuper-
where s+ denotes a future state and sg(·) denote the stop- vised transitions, pπ (s, z) is the joint distribution of (s, z)
gradient operation. Consider continuous action spaces, the induced by FB, and pE is the joint distribution of the dataset.
policies can be obtained by training an actor network to However, it is intractable to optimize the divergence term
minimize directly via a RL procedure. To tackle the problem, FB-CPR
h i interprets the divergence as an expected return under the
Lactor = −Ez∼ν,s∼ρ,a∼πz (s) F (s, a, z)⊤ z . (10) polices and defines a divergence-based reward rdiv as
Once the FB model is trained, it can be utilized to solve 
pπ (s, z)

diverse tasks in a zero-shot manner without performing KL (pπ , pE ) = E z∼ν,π log
s∼ρ pE (s, z)
additional task-specific learning, planning, or fine-tuning. "∞ z # (13)
For example, given a task reward function r, the policy can
X
t p E (st+1 , z)
= −Ez∼ν E γ log s0 ∼ µ, πz .
be inferred by computing t=0
pπ (st+1 , z)
n
1X Then, a discriminator network D : S × A → [0, 1] is trained
zr = r(si )B(si ), (11)
n i=1 to estimate the rdiv :

where {si }n
i=1 is a set of sample states. Similarly, for a goal-
pπ (s, z) D
rdiv = log ≈ log . (14)
reaching problem, it suffices to compute the encoder vector pE (s, z) 1−D
by zs = B(s), s ∈ S .
As introduced in Section 1.3, [39] first introduced the The estimation holds due to the optimal discriminator sat-
term ”BFM” literally and proposed FB-IL for fast IL based isfies D∗ = pEp+p
E
π
[88]. Finally, the divergence term can be
on BFMs, which supports multiple IL principles, such as estimated by training another critic network via off-policy
behavioral cloning, feature matching, and goal-based re- TD learning, and the actor loss for FB-CPR is rewritten as
duction, without needing separate RL routines for each h i
new task. [40] further enhances the FB framework by in- LFB−CPR = − Ez∼ν,s∼Donline ,a∼πz (·|s) F (s, a, z)⊤ z
(15)
corporating auto-regressive features for more precise task + αQdiv (s, a, z),
PREPRINT. UNDER REVIEW. 8

Motion Imitation

Pose Reaching

Composite Reward Optimization

"Move forward + crouching." "Move forward + crouching


"Move forward." + left hand up."

Fig. 7. Motivo [41] learns broad behavior coverage and demonstrates outstanding zero-shot adaptation capability to a diverse downstream tasks,
including complex motion imitation, pose reaching, and composite reward optimization. Moreover, Motivo achieves real-time motor control while
ensuring motion naturalness.

It is natural to find that FB-CPR is not a rigorous un- and lookahead latent adaptation (LoLA), which allow BFMs
supervised method, as it leverages unlabeled demo data to achieve zero-shot adaptation by using minimal online
to assist motion prior learning. By aligning unsupervised interactions after pre-training. These methods improve task
RL with human-like behavioral priors from unlabeled data, performance by up to 40%, demonstrating the effectiveness
FB-CPR enhances policy diversity and dataset coverage, of fast adaptation strategies that enable efficient task switch-
enabling the agent to learn a rich latent space of behaviors ing without requiring extensive retraining. In contrast, [43]
(e.g., walking, jumping, handstands) and achieve robust introduces ”Task Tokens”, which enhance goal-conditioned
zero-shot performance across diverse tasks. Experimental BFMs by generating task-specific tokens through a task
results demonstrate that Motivo achieves 83% success rate encoder, enabling the BFM to perform complex tasks like
in motion tracking tasks, and performs 61% of the top- motion tracking and goal-reaching with high success rates.
line performance in reward optimization tasks, surpassing Their method shows up to 99.75% success in tasks such as
DIFFUSER in computational efficiency by requiring only 12 the long jump. Additionally, [44] addresses the limitation
seconds per 300-step episode. Additionally, it outperforms of FB representations by introducing belief-FB, which uses
ASE and CALM in motion diversity, achieving a score of a transformer encoder to infer environmental dynamics,
4.70 (±0.66), reflecting its ability to capture a broader range improving adaptation to unseen changes. Their approach
of behaviors. achieves up to 2× improvement in performance under
dynamic variations, demonstrating enhanced zero-shot ca-
pabilities. Future work could investigate additional post-
3.2 Adaptation
training techniques, such as test-time scaling [89] or RL from
Equipped with the derived BFMs through the aforemen- human feedback [90], to further improve the adaptability
tioned pre-training frameworks, we further introduce recent and efficiency of BFMs and ensure alignment with human
advancements on the adaptation techniques of BFMs. These preferences in real-world applications.
work can be broadly categorized into two types: fine-tuning
and towards hierarchical control. 3.2.2 Towards Hierarchical Control
While fine-tuning techniques enhance the performance of
3.2.1 Fine-tuning BFMs for specific tasks through minimal modifications
Fine-tuning of BFMs seeks to bridge the gap between during test-time, several pioneering works have attempted
general-purpose motion priors and task-specific require- to establish a hierarchical control architecture based on
ments. While pre-trained BFMs capture broad motion dis- BFM, which decouples high-level planning from low-level
tributions, they often lack precision for specialized tasks motion execution to achieve more scalable and flexible
or novel environments. For example, [42] introduces fast control [91, 92]. For example, UniHSI [93] introduces a
adaptation techniques like residual latent adaptation (ReLA) unified framework for human-scene interaction by using
PREPRINT. UNDER REVIEW. 9

language commands to guide a chain-of-contacts, which a rich behavioral prior that accelerates downstream adap-
represents the sequence of human-object contact pairs. The tation exponentially [42]. Moreover, advanced BFMs with
system translates language inputs into structured task plans, zero-shot adaptation capabilities, such as Motivo [41], can
which are then executed by a unified controller based on directly map high-level task specifications (e.g., goal states
the AMP architecture. This framework achieves semantic and reward functions) to low-level control actions, bypass-
alignment between language commands and physical mo- ing traditional RL loops entirely. This capability efficiently
tions, supporting diverse interactions with single or mul- solves basic control tasks but also facilitates rapid prototyp-
tiple objects. TokenHSI [94] further extends this line by ing, allowing developers to evaluate robot behaviors in both
proposing a transformer-based unified policy that tokenizes simulation and real-world environments within minutes,
human proprioception and task states. By separating shared dramatically shortening the development cycle.
motor knowledge (proprioception token) from task-specific
parameters (task tokens), it enables seamless multi-skill uni- 4.1.2 Virtual Agents and Gaming
fication and flexible adaptation to novel tasks, such as skill Generative AI has significantly revolutionized digital con-
composition (e.g., carrying while sitting), object or terrain tent creation, particularly in art, animation, and game de-
shape variation, and long-horizon task completion. sign [99–105]. Among these, controlling dynamic and inter-
In contrast, CLoSD [95] introduces a text-driven RL con- active behaviors for non-player characters (NPCs) in games
troller that combines motion diffusion models with physics- remains a significant challenge. Traditional rule-based or
based simulations for robust multi-task human character scripted NPCs often exhibit limited diversity, unnatural
control. By utilizing a real-time diffusion planner and a movements, and poor adaptability to player actions [106–
motion tracking controller in a closed-loop feedback system, 108]. BFMs offer a groundbreaking solution by enabling
CLoSD can handle complex tasks like goal-reaching, strik- lifelike, context-aware NPC behaviors without extensive
ing, and human-object interactions, all controlled through manual scripting. Pre-trained on diverse human behavior
text prompts and target locations. Similarly, UniPhys [96] datasets, BFMs generate adaptive actions, such as tactical
introduces a diffusion-based behavior cloning framework combat, social engagement, or exploration, seamlessly re-
unifying planning and control with diffusion forcing to sponding to dynamic player inputs. By integrating BFMs
handle prediction errors, enabling flexible control via text, with LLMs, NPCs can interpret complex player instructions
velocity, and goal guidance for applications like dynamic (e.g., dialogue-driven commands in role-playing games) and
obstacle avoidance and long-horizon planning. These ap- foster immersive and responsive interactions. This capa-
plications demonstrate that BFMs serve as a pivotal bridge bility positions BFMs as a pivotal technology for revolu-
between high-level semantic instructions and low-level tionizing virtual agents, enabling next-generation gaming
physical execution, leveraging pre-trained behavioral priors experiences with unprecedented behavioral realism and in-
to enable zero-shot adaptation, multi-task generalization, teractivity.
and physics-aware motion synthesis. By integrating vari-
ous control paradigms (e.g., transformer-based tokenization 4.1.3 Towards Industry 5.0
and closed-loop diffusion planning), these highlight BFMs’ While Industry 4.0 introduced smart factories with cyber-
potential to democratize humanoid control across complex, physical systems, IoT, and AI-driven automation, Industry
real-world scenarios, from interactive robotics to dynamic 5.0 shifts toward human-centric, resilient, and sustainable
environment adaptation. manufacturing, emphasizing collaborative robotics, adap-
tive intelligence, and personalized production [109–113].
To that end, robots must move beyond rigid automation
4 A PPLICATIONS AND L IMITATIONS and instead exhibit generalizable, adaptive, and explainable
BFMs are foreseen to significantly enhance humanoid behaviors [114–118]. BFMs are expected to empower this
robotics by providing a universal pre-trained controller landscape by enabling humanoid robots to seamlessly blend
capable of generalizing across diverse tasks. In this section, pre-trained motor skills with real-time adaptability, effort-
we explore the potential applications of BFMs in diverse lessly switching between tasks like precision welding and
industries macroscopically, such as healthcare robotics and adaptive part handling. By integrating large multimodal
gaming. Furthermore, we identify the key limitations of cur- models with BFMs, robots can process diverse inputs like
rent BFMs, including the Sim2Real gap, data bottleneck, and gestures, voice commands (e.g., ”handle gently”), or envi-
embodiment generalization. Our analysis is inspired by the ronmental cues, fostering intuitive human-robot collabora-
current development and applications of other foundation tion in shared workspaces. BFMs also ensure resilience, au-
models, such as LLMs [97] and large vision models [98]. tonomously recovering from disturbances like unbalanced
loads in logistics, and support personalized production
through zero-shot or few-shot learning.
4.1 Applications
4.1.1 General Accelerator for Humanoid Robotics 4.1.4 Healthcare and Assistive Robotics
BFMs will act as a transformative general accelerator for The global population aging presents unprecedented chal-
humanoid robotics, accelerating the development and de- lenges for healthcare systems [119–122], increasing demand
ployment of advanced WBC systems. Unlike the previous for assistive technologies that support independent living
pipelines that require resource-intensive and task-specific and rehabilitation. Extensive and diverse robots have been
training, BFMs eliminate training from scratch by pre- developed for robot-assisted dressing, rehabilitation ther-
training on vast and diverse behavior datasets, embedding apy, medical treatment, and caregiving for the elderly and
PREPRINT. UNDER REVIEW. 10

General Accelerator for Humanoid Robotics Universal motor controller, rapid task adaptation

Virtual Agents and Gaming Lifelike NPCs, immersive and responsive interactions

Applications Towards Industry 5.0 Flexible manufacturing control, intuitive HRI

Healthcare and Assistive Robotics Adaptive rehabilitation support, personalized care protocols

Sim2Real Gap Improbable behavior, simulation dynamics mismatch

Data Bottleneck Scarce demonstrations, low data quality, real-world data


Limitations
Embodiment Generalization Embodiment-dependent policies, actuator compatibility issues

Multimodal BFMs Exteroceptive signals such as vision, acoustics, and tactile feedback
Behavior
Foundation High-level ML System LLMs-based planning and BFMs-based control
Model
Scaling Law Model architecture, parameter size, data scale

Opportunities Post-training Human feedback alignment, test-time policy refinement

Multi-agent System Shift the focus to high-level design, interaction-enhanced BFMs

Evaluation Mechanism Quantitative metrics, qualitative human evaluation

Ethical Issues Privacy-sensitive training data, unintended harmful behaviors

Risks Safety Mechanism Fail-safe motion constraints, real-time anomaly detection

Fig. 8. An overview of the application, limitations, research opportunities, and potential risks of BFMs.

children [123–131]. Humanoid robots are ideally suited for sensor noise, resulting in unstable or unsafe execution. The
these tasks due to their anthropomorphic design, navigating integration of visual signals into control systems further
human-centric environments and perform precise, natural, compounds the challenge, and perceptual domain shifts
and intuitive interactions. BFMs offer a promising solu- (e.g., lighting, texture, or camera calibration mismatches)
tion by enabling robots to adapt to diverse user needs and generalization gaps in visual features can destabilize
and unstructured environments. For instance, BFMs can motion policies based on simulated visual inputs. Current
empower assistive robots to perform tasks like mobility BFMs remain largely confined to simulation, with no doc-
support (e.g., fall prevention, gait assistance) or daily tasks umented large-scale real-world deployments. While several
(e.g., object retrieval, meal preparation) with minimal user- pioneering works have been devoted to developing BFM-
specific tuning. In rehabilitation, BFMs trained on clinician- like controllers in real humanoid robots, like HugWBC [55]
guided demonstrations can personalize therapy protocols and CLONE [137], their motion skills remain narrow, high-
by dynamically adjusting task difficulty or providing real- lighting a significant challenge to Sim2Real feasibility for
time feedback based on patient progress. maintaining behavior richness. This gap stems from behav-
ioral overgeneralization, dynamics mismatches, and latent
4.2 Limitations space instability, which hinder scaling targeted Sim2Real
successes (e.g., quadruped locomotion or grasping) to BFM’s
4.2.1 Sim2Real Gap
complexity.
The Sim2Real gap is a persistent challenge in robotics,
representing the performance discrepancy between the poli-
4.2.2 Data Bottleneck
cies trained in simulators and their real-world deployment
[30–33]. Traditional model-based controllers address this The data bottleneck poses a fundamental constraint in de-
through explicit physics modeling and robust optimiza- veloping BFMs for humanoid robots. While datasets listed
tion, while data-driven approaches employ domain ran- in Table 2 have been successfully employed to train the
domization and system identification [132–135]. Recent ad- current BFMs, their scale remains significantly smaller than
vancements like ASAP [136] mitigate dynamics mismatch the datasets used to train LLMs or large vision models. This
via residual action learning yet face policy-specific limita- scarcity is exacerbated when retargeting motions to specific
tions, as their residuals are trained on trajectories from a robotic platforms [138], where subtle morphological differ-
single pre-trained policy, restricting generalization. BFMs ences can incur severe policy performance loss. Real-world
exacerbate this challenge by encoding a vast spectrum of robot data is even more constrained due to the hardware
behaviors, from locomotion to multi-contact interactions, limitations and safety concerns. The challenge compounds
introducing high-dimensional transfer risks. For example, when considering multimodal data requirements.
a BFM trained for diverse humanoid motions may fail to Current BFMs predominantly rely on proprioceptive
adapt to real-world actuator delays, friction variations, or inputs, integrating exteroceptive sensing for real-world
PREPRINT. UNDER REVIEW. 11

TABLE 2 such as ethical concerns in human-robot interaction and


Available humanoid motion datasets for training BFMs. safety mechanisms for real-world deployment. By address-
ing these opportunities and risks, the robotics community
Dataset Clip Hour Date can ensure that BFMs evolve into robust, adaptable, and so-
KIT-ML [139] 3911 11.2 2016 cially responsible controllers for next-generation humanoid
AMASS [140] 11265 40.0 2019 systems.
LAFAN [141] 77 4.6 2020
BABEL [142] 13220 43.5 2021
Posescript [143] - - 2022 5.1 Opportunities
HumanML3D [144] 14616 28.6 2022
Motion-X [145] 81084 144.2 2023 5.1.1 Multimodal BFMs
Motion-X++ [146] 120462 180.9 2025 A promising direction for BFMs lies in expanding beyond
proprioceptive inputs to multimodal sensory integration,
incorporating exteroceptive signals such as vision, acous-
deployment introduces new bottlenecks. Pioneering work tic signals, and tactile feedback [148–150]. While current
like OmniH2O [147] explores multimodal data collection BFMs primarily rely on proprioceptive information, richer
through teleoperation and pairing head-mounted RGBD perceptual inputs could enable more robust and adaptive
cameras with whole-body proprioception and motor com- behaviors in unstructured environments. For example, inte-
mands. While this enables tasks like vision-based autonomy grating real-time visual perception could enable humanoids
and imitation learning, its 40-minute dataset (OmniH2O- to dynamically adjust their movements based on object
6) remains limited to lab environments and lacks diverse positions, terrain conditions, or human interactions, enhanc-
contact dynamics. Currently, there is no existing dataset ing both safety and task performance. However, achieving
that provides large-scale and temporally aligned record- multimodal BFMs requires the advancement of both large-
ings of proprioception, vision, and contact dynamics across scale datasets and scalable training paradigms, which are
diverse environments. Therefore, larger-scale, high-quality, key challenges for the field.
and more curated datasets are urgently needed to enhance
the effectiveness of current BFMs and to aid in the develop- 5.1.2 High-level ML System
ment of future models. Recent advances in foundation models like LLMs and VLAs
have demonstrated their potential as controllers for coor-
4.2.3 Embodiment Generalization dinating specialized AI systems [29, 151–153]. For exam-
Despite the use of the term ”BFM”, current BFMs are typi- ple, HuggingGPT [151] leverages LLMs to orchestrate task
cally trained on specific humanoid robot embodiments with planning across diverse models, while Eureka [29] employs
fixed morphology, actuator dynamics, and sensor configu- LLMs to automate reward function design in RL. These
rations. While they excel at controlling the robot they were projects present a natural pathway for integrating BFMs into
trained on, it is currently intractable for them to generalize high-level ML systems, where they could serve as universal
to novel embodiments with different body shapes, actuator low-level controllers for humanoid robots. By combining the
types, or degrees of freedom. This limitation stems from reasoning capabilities of LLMs (for task planning and adap-
several key challenges in embodiment generalization. One tation) with the motor priors encoded in BFMs (for real-time
major issue is morphological mismatch, where policies de- execution), such systems could achieve unprecedented flex-
signed for a specific kinematic or dynamic structure struggle ibility in handling complex and multi-step physical tasks,
to adapt to robots with differing link lengths, joint types, while minimizing task-specific engineering. The ultimate
or mass distributions. Additionally, differences in actuator vision is to create a unified cognitive-physical architecture
dynamics—such as variations in torque, latency, or control based on LLMs and BFMs, mirroring the seamless integra-
modes (e.g., position vs. torque control)—can destabilize tion of human cognition and physical control.
policies when transferred between systems. Sensor diversity
adds another layer of complexity, as BFMs often assume 5.1.3 Scaling Law
consistent sensor inputs and have difficulty managing situa- The concept of scaling laws describes how the perfor-
tions with missing or additional sensing modalities. Further- mance of neural language models improves with increases
more, reward functions created for one type of embodiment in model size, data, and compute resources. While these
(e.g., humanoid walking) may not effectively translate to laws are well-established in domains like language and
others (e.g., hexapod gait patterns), leading to confusion in vision, their applicability to BFMs remains an open prob-
task adaptation. These challenges underscore the necessity lem [154–158]. Preliminary evidence suggests that scaling
for more adaptable BFM architectures that can abstract skills BFMs through larger architectures, diverse training datasets,
across various robotic platforms. and expanded computational resources can enhance their
generalization and zero-shot adaptation capabilities. For
instance, FB-CPR [41] demonstrated improved performance
5 O PPORTUNITIES AND R ISKS in humanoid control tasks with more parameters (from 25M
In this section, we explore key research opportunities that to 288M), achieving more robust zero-shot motion tracking
could enhance the broader adoption of BFMs, including and reward optimization. On the other hand, data scaling
their integration into high-level ML systems, scaling laws, appears more critical for BFMs, as it directly determines the
post-training techniques, and multi-agent coordination. We quality and physical plausibility of the learned behavior
also critically examine the risks associated with BFMs, priors, the foundational requirements that model scaling
PREPRINT. UNDER REVIEW. 12

alone cannot address. However, the effect of data scaling robust assessment of BFMs must consider multiple interde-
remains underexplored, with open questions as discussed pendent factors, including task generalization across unseen
in Section 4.2.2. Furthermore, unlike LLMs, BFM scaling scenarios, adaptability to new skills with minimal data, ro-
must strike an appropriate balance between the behavior bustness against physical perturbations, and alignment with
coverage (diverse motor skills) and control efficiency (preci- human safety and interpretability standards. For example,
sion and real-time stability). Future work should rigorously Motivo [41] combines quantitative metrics like task success
quantify these scaling dynamics to unlock BFMs’ full poten- rates with qualitative human evaluations of motion natu-
tial as general-purpose controllers. ralness, yet critical gaps remain in assessing compositional
skill combinations, hardware-specific constraints, and long-
5.1.4 Post-training term behavioral stability. Future benchmarks should focus
Post-training techniques have emerged as critical tools for on progressive difficulty levels and cross-domain transfer
the continued success and refinement of foundation models, tests to effectively assess BFMs’ potential as general-purpose
especially for LLMs [159], which refine models to improve physical controllers in both simulated and real-world sce-
reasoning, address limitations, and better align with user narios. This approach will ultimately guide the field towards
intents and ethical considerations. Among these methods, developing more capable and reliable humanoid systems.
fine-tuning [160–165], integration of RL [90, 166–172], and
test-time scaling [89, 173, 174, 174, 175] have been the most 5.2 Risks
prominent strategies for optimizing LLMs’ performance. 5.2.1 Ethical Issues
Integrating these post-training strategies presents unique
Ethical issues consistently accompany the development of
research opportunities for BFMs. For instance, leveraging
diverse foundation models [185–187], which involve biased
RL techniques like RLHF and RLAIF for BFMs can be crucial
or unlicensed data, racial discrimination, and uncontrollable
for refining the alignment between the agent’s behavior
behaviors, etc. For BFMs, training on non-diverse motion
and real-world human expectations, especially in human-
datasets may encode demographic biases such as favoring
centric task environments. This opens avenues for devel-
movements natural to specific age groups or body types,
oping more robust models that can adapt dynamically to
which then propagate into robotic behaviors and create em-
user feedback. Additionally, test-time scaling for BFMs can
bodied forms of discrimination. Meanwhile, privacy risks
optimize the computational efficiency during deployment,
escalate beyond data memorization to movement analytics,
especially for real-time robot control or decision-making
where rehabilitation or performance data could leak sen-
systems. Research could focus on improving the scalability
sitive health information through generated motions. The
of BFMs while ensuring that model outputs remain accurate
physical instantiation of BFMs introduces unprecedented
and contextually appropriate across varying operational
risks: unlike purely digital models, misaligned BFMs might
conditions.
reproduce unsafe or socially harmful behaviors (e.g., ag-
gressive gestures or exclusionary motions) with real-world
5.1.5 Multi-agent System
consequences. While techniques like differential privacy and
A multi-agent system (MAS) consists of multiple interacting federated learning offer partial solutions, they struggle with
agents that can be either cooperative, competitive, or a the continuous motion data’s temporal nature. In summary,
mix of both, aiming to address complex tasks that require BFMs demand novel governance frameworks that address
collaboration, decision-making, and behavior coordination both data provenance and normativity of real-time behavior.
among agents [176–181]. BFMs can fundamentally acceler-
ate the construction of MAS consists of humanoid robots, 5.2.2 Safety Mechanism
eliminating the need to effortlessly teach each robot basic As BFMs will be increasingly deployed in real-world robotic
survival skills like balance and locomotion before they systems, they introduce critical safety requirements beyond
can collaborate. Instead, researchers can focus directly on those of digital foundation models [188–190]. A key is-
higher-level coordination challenges like role allocation and sue is maintaining model behavior integrity, particularly
team strategy. However, current BFMs trained on single- in safety-critical scenarios such as human-robot interaction
robot data lack specialized interaction capabilities needed and autonomous navigation. When trained on large-scale
for optimal collaboration. This presents a promising re- yet weakly curated motion datasets, BFMs may uninten-
search direction: developing next-generation BFMs trained tionally learn unsafe or undesirable behaviors. Even minor
explicitly on multi-robot interaction scenarios. Such mod- changes in sensory input—whether caused by adversarial
els could better handle physical coordination challenges attacks or sensor noise—can lead to control failures. This
like object handovers, formation maintenance, and collision highlights the need for robustness against shifts in data
avoidance while preserving their generalizability. distribution and protection against malicious input manip-
ulation.
5.1.6 Evaluation Mechanism While most current BFMs focus primarily on propri-
While foundation models like LLMs benefit from well- oceptive inputs, integrating multimodal information (e.g.,
established benchmarks (e.g., GPQA [182] for broad knowl- visual, linguistic, and auditory cues) has emerged as a
edge recall, MATH [183] for mathematical problem solving, promising direction for more generalizable and situationally
or MUSR [184] for multi-step reasoning), there is no specific aware control. However, multimodality introduces new vul-
and comprehensive evaluation mechanism for evaluating nerabilities. Adversaries can exploit inconsistencies across
BFM’s capability and guiding the evolution direction. A modalities, as seen in the well-known CLIP case where
PREPRINT. UNDER REVIEW. 13

an apple image was misclassified as an ”iPod” due to an formulation,” IEEE Journal on Robotics and Automation,
overlaid label [191]. Such cross-modal confusion can be vol. 3, no. 1, pp. 43–53, 2003.
especially dangerous when a BFM is adapted for a unimodal [7] L. Sentis and O. Khatib, “Synthesis of whole-body
task but retains sensitivity to irrelevant signals from other behaviors through hierarchical control of behavioral
modalities. These challenges underscore the need to develop primitives,” International Journal of Humanoid Robotics,
robust safety mechanisms for BFMs. Future work could vol. 2, no. 04, pp. 505–518, 2005.
prioritize adversarial robustness, cross-modal consistency [8] A. Escande, N. Mansard, and P.-B. Wieber, “Hierar-
checks, and disentangling modality-specific information to chical quadratic programming: Fast online humanoid-
ensure predictable, trustworthy robot behavior in open- robot motion generation,” The International Journal of
world environments. Robotics Research, vol. 33, no. 7, pp. 1006–1028, 2014.
[9] S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela,
6 C ONCLUSION H. Dai, F. Permenter, T. Koolen, P. Marion, and
R. Tedrake, “Optimization-based locomotion plan-
In this paper, we present a systematic preview of the
ning, estimation, and control design for the atlas
behavior foundation model, an emerging yet transforma-
humanoid robot,” Autonomous robots, vol. 40, pp. 429–
tive paradigm for humanoid whole-body control systems.
455, 2016.
By pre-training on large-scale and diverse humanoid be-
[10] E. Dantec, M. Naveau, P. Fernbach, N. Villa, G. Saurel,
havior data, BFMs learn a broad behavior coverage that
O. Stasse, M. Taix, and N. Mansard, “Whole-body
enables few-shot or zero-shot adaptation to broad down-
model predictive control for biped locomotion on
stream tasks, eliminating the need for resource-intensive
a torque-controlled humanoid robot,” in 2022 IEEE-
task-specific training. We establish a comprehensive tax-
RAS 21st International Conference on Humanoid Robots
onomy categorizing BFM approaches into supervised and
(Humanoids), pp. 638–644, 2022.
unsupervised frameworks, while demonstrating their real-
[11] B. Henze, A. Dietrich, and C. Ott, “An approach
world applicability across healthcare, gaming, and indus-
to combine balancing with hierarchical whole-body
trial domains. Furthermore, we identify key research oppor-
control for legged humanoid robots,” IEEE Robotics
tunities in high-level ML system integration, post-training
and Automation Letters, vol. 1, no. 2, pp. 700–707, 2015.
optimization, and standardized evaluation mechanisms that
[12] S. Sovukluk, J. Englsberger, and C. Ott, “Whole
could accelerate BFM development.
body control formulation for humanoid robots with
Despite their unprecedented capabilities, BFMs face sig-
closed/parallel kinematic chains: Kangaroo case
nificant challenges, including the Sim2Real gap, embodi-
study,” in 2023 IEEE/RSJ International Conference on
ment dependence, and data scarcity. The physical instantia-
Intelligent Robots and Systems (IROS), pp. 10390–10396,
tion of BFMs introduces unique safety risks requiring robust
IEEE, 2023.
verification mechanisms, while their training on human mo-
[13] K. Ishihara, T. D. Itoh, and J. Morimoto, “Full-body
tion data probably raises ethical concerns regarding privacy
optimal control toward versatile and agile behaviors
and bias mitigation. Addressing these limitations in future
in a humanoid robot,” IEEE Robotics and Automation
work will lead to more reliable and generalizable BFMs. Our
Letters, vol. 5, no. 1, pp. 119–126, 2019.
work is expected to inspire more subsequent research on
[14] Y. Hao, G. C. R. Bethala, N. Pudasaini, H. Huang,
BFMs.
S. Yuan, C. Wen, B. Huang, A. Nguyen, and
Y. Fang, “Embodied chain of action reasoning with
R EFERENCES multi-modal foundation model for humanoid loco-
[1] D. Kulić, G. Venture, K. Yamane, E. Demircan, I. Mizu- manipulation,” arXiv preprint arXiv:2504.09532, 2025.
uchi, and K. Mombaur, “Anthropomorphic movement [15] M. Murooka, K. Fukumitsu, M. Hamze, M. Mori-
analysis and synthesis: A survey of methods and sawa, H. Kaminaga, F. Kanehiro, and E. Yoshida,
applications,” IEEE Transactions on Robotics, vol. 32, “Whole-body multi-contact motion control for hu-
no. 4, pp. 776–795, 2016. manoid robots based on distributed tactile sensors,”
[2] A. Goswami and P. Vadakkepat, Humanoid robotics: a IEEE Robotics and Automation Letters, 2024.
reference. Springer Dordrecht, 2019. [16] X. B. Peng, P. Abbeel, S. Levine, and M. Van de
[3] M. Schwenzer, M. Ay, T. Bergs, and D. Abel, “Review Panne, “Deepmimic: Example-guided deep reinforce-
on model predictive control: An engineering perspec- ment learning of physics-based character skills,” ACM
tive,” The International Journal of Advanced Manufactur- Transactions On Graphics (TOG), vol. 37, no. 4, pp. 1–14,
ing Technology, vol. 117, no. 5, pp. 1327–1349, 2021. 2018.
[4] G. Romualdi, S. Dafarra, G. L’Erario, I. Sorrentino, [17] A. Singh, H. Liu, G. Zhou, A. Yu, N. Rhine-
S. Traversaro, and D. Pucci, “Online non-linear cen- hart, and S. Levine, “Parrot: Data-driven behav-
troidal mpc for humanoid robot locomotion with ioral priors for reinforcement learning,” arXiv preprint
step adjustment,” in 2022 International Conference on arXiv:2011.10024, 2020.
Robotics and Automation (ICRA), pp. 10412–10419, [18] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement
IEEE, 2022. learning in robotics: A survey,” The International Jour-
[5] S. P. Sethi and S. P. Sethi, What is optimal control theory? nal of Robotics Research, vol. 32, no. 11, pp. 1238–1274,
Springer, 2021. 2013.
[6] O. Khatib, “A unified approach for motion and force [19] N. Sünderhauf, O. Brock, W. Scheirer, R. Hadsell,
control of robot manipulators: The operational space D. Fox, J. Leitner, B. Upcroft, P. Abbeel, W. Bur-
PREPRINT. UNDER REVIEW. 14

gard, M. Milford, et al., “The limits and potentials of [33] E. Su, C. Jia, Y. Qin, W. Zhou, A. Macaluso, B. Huang,
deep learning for robotics,” The International journal of and X. Wang, “Sim2real manipulation on unknown
robotics research, vol. 37, no. 4-5, pp. 405–420, 2018. objects with tactile-based reinforcement learning,” in
[20] H. Nguyen and H. La, “Review of deep reinforce- 2024 IEEE International Conference on Robotics and Au-
ment learning for robot manipulation,” in 2019 Third tomation (ICRA), pp. 9234–9241, IEEE, 2024.
IEEE international conference on robotic computing (IRC), [34] S. Schaal, “Is imitation learning the route to humanoid
pp. 590–595, IEEE, 2019. robots?,” Trends in cognitive sciences, vol. 3, no. 6,
[21] A. I. Károly, P. Galambos, J. Kuti, and I. J. Rudas, pp. 233–242, 1999.
“Deep learning in robotics: Survey on model struc- [35] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne,
tures and training strategies,” IEEE Transactions on “Imitation learning: A survey of learning methods,”
Systems, Man, and Cybernetics: Systems, vol. 51, no. 1, ACM Computing Surveys (CSUR), vol. 50, no. 2, pp. 1–
pp. 266–279, 2020. 35, 2017.
[22] R. Liu, F. Nageotte, P. Zanne, M. de Mathelin, and [36] M. Laskey, J. Lee, R. Fox, A. Dragan, and K. Goldberg,
B. Dresp-Langley, “Deep reinforcement learning for “Dart: Noise injection for robust imitation learning,”
the control of robotic manipulation: a focussed mini- in Conference on robot learning, pp. 143–156, PMLR,
review,” Robotics, vol. 10, no. 1, p. 22, 2021. 2017.
[23] J. Chen, D. Tamboli, T. Lan, and V. Aggarwal, “Multi- [37] B. Fang, S. Jia, D. Guo, M. Xu, S. Wen, and F. Sun,
task hierarchical adversarial inverse reinforcement “Survey of imitation learning for robotic manipula-
learning,” in International Conference on Machine Learn- tion,” International Journal of Intelligent Robotics and
ing, vol. 202 of Proceedings of Machine Learning Research, Applications, vol. 3, no. 4, pp. 362–369, 2019.
pp. 4895–4920, PMLR, 2023. [38] J. Hua, L. Zeng, G. Li, and Z. Ju, “Learning for a
[24] X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and robot: Deep reinforcement learning, imitation learn-
A. Kanazawa, “Amp: Adversarial motion priors for ing, transfer learning,” Sensors, vol. 21, no. 4, 2021.
stylized physics-based character control,” ACM Trans- [39] M. Pirotta, A. Tirinzoni, A. Touati, A. Lazaric, and
actions on Graphics (ToG), vol. 40, no. 4, pp. 1–20, 2021. Y. Ollivier, “Fast imitation via behavior foundation
[25] T. Huang, J. Ren, H. Wang, Z. Wang, Q. Ben, M. Wen, models,” in The Twelfth International Conference on
X. Chen, J. Li, and J. Pang, “Learning humanoid Learning Representations, 2024.
standing-up control across diverse postures,” arXiv [40] E. Cetin, A. Touati, and Y. Ollivier, “Finer be-
preprint arXiv:2502.08378, 2025. havioral foundation models via auto-regressive fea-
[26] M. Seo, S. Han, K. Sim, S. Bang, C. Gonzalez, L. Sentis, tures and advantage weighting,” arXiv preprint
and Y. Zhu, “Deep imitation learning for humanoid arXiv:2412.04368, 2024.
loco-manipulation through human teleoperation. in [41] A. Tirinzoni, A. Touati, J. Farebrother, M. Guzek,
2023 ieee-ras 22nd international conference on hu- A. Kanervisto, Y. Xu, A. Lazaric, and M. Pirotta, “Zero-
manoid robots (humanoids),” 2023. shot whole-body humanoid control via behavioral
[27] S. H. Bang, C. Gonzalez, J. Ahn, N. Paine, and L. Sen- foundation models,” in The Thirteenth International
tis, “Control and evaluation of a humanoid robot with Conference on Learning Representations, 2025.
rolling contact joints on its lower body,” Frontiers in [42] H. Sikchi, A. Tirinzoni, A. Touati, Y. Xu, A. Kanervisto,
Robotics and AI, vol. 10, p. 1164660, 2023. S. Niekum, A. Zhang, A. Lazaric, and M. Pirotta, “Fast
[28] X. Cheng, Y. Ji, J. Chen, R. Yang, G. Yang, and adaptation with behavioral foundation models,” in
X. Wang, “Expressive whole-body control for hu- Reinforcement Learning Conference, 2025.
manoid robots,” arXiv preprint arXiv:2402.16796, 2024. [43] R. Vainshtein, Z. Rimon, S. Mannor, and C. Tessler,
[29] Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, “Task tokens: A flexible approach to adapt-
D. Jayaraman, Y. Zhu, L. Fan, and A. Anandkumar, ing behavior foundation models,” arXiv preprint
“Eureka: Human-level reward design via coding large arXiv:2503.22886, 2025.
language models,” in The Twelfth International Confer- [44] M. Bobrin, I. Zisman, A. Nikulin, V. Kurenkov, and
ence on Learning Representations, 2024. D. Dylov, “Zero-shot adaptation of behavioral foun-
[30] A. Kadian, J. Truong, A. Gokaslan, A. Clegg, E. Wi- dation models to unseen dynamics,” arXiv preprint
jmans, S. Lee, M. Savva, S. Chernova, and D. Batra, arXiv:2505.13150, 2025.
“Sim2real predictivity: Does evaluation in simulation [45] R. Bommasani, D. A. Hudson, E. Adeli, R. Alt-
predict real-world performance?,” IEEE Robotics and man, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg,
Automation Letters, vol. 5, no. 4, pp. 6670–6677, 2020. A. Bosselut, E. Brunskill, et al., “On the opportuni-
[31] S. Höfer, K. Bekris, A. Handa, J. C. Gamboa, M. Moz- ties and risks of foundation models,” arXiv preprint
ifian, F. Golemo, C. Atkeson, D. Fox, K. Goldberg, arXiv:2108.07258, 2021.
J. Leonard, et al., “Sim2real in robotics and automa- [46] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya,
tion: Applications and challenges,” IEEE transactions F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman,
on automation science and engineering, vol. 18, no. 2, S. Anadkat, et al., “Gpt-4 technical report,” arXiv
pp. 398–400, 2021. preprint arXiv:2303.08774, 2023.
[32] K. Iyengar, S. H. Sadati, C. Bergeles, S. Spurgeon, [47] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh,
and D. Stoyanov, “Sim2real transfer of reinforcement S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,
learning for concentric tube robots,” IEEE Robotics and et al., “Learning transferable visual models from nat-
Automation Letters, vol. 8, no. 10, pp. 6147–6154, 2023. ural language supervision,” in International conference
PREPRINT. UNDER REVIEW. 15

on machine learning, pp. 8748–8763, PMLR, 2021. shafiei, A. Abdolmaleki, et al., “From motor control
[48] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, to team play in simulated humanoid football,” Science
L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.- Robotics, vol. 7, no. 69, p. eabo0235, 2022.
Y. Lo, et al., “Segment anything,” in Proceedings of the [62] J. Merel, L. Hasenclever, A. Galashov, A. Ahuja,
IEEE/CVF international conference on computer vision, V. Pham, G. Wayne, Y. W. Teh, and N. Heess, “Neural
pp. 4015–4026, 2023. probabilistic motor primitives for humanoid control,”
[49] B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, in International Conference on Learning Representations,
J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al., “Rt- 2019.
2: Vision-language-action models transfer web knowl- [63] X. B. Peng, Y. Guo, L. Halper, S. Levine, and S. Fi-
edge to robotic control,” in Conference on Robot Learn- dler, “Ase: Large-scale reusable adversarial skill em-
ing, pp. 2165–2183, PMLR, 2023. beddings for physically simulated characters,” ACM
[50] M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Bal- Transactions On Graphics (TOG), vol. 41, no. 4, pp. 1–
akrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, 17, 2022.
Q. Vuong, et al., “Openvla: An open-source vision- [64] C. Tessler, Y. Kasten, Y. Guo, S. Mannor, G. Chechik,
language-action model,” in Conference on Robot Learn- and X. B. Peng, “Calm: Conditional adversarial latent
ing, PMLR, 2024. models for directable virtual characters,” in ACM
[51] J. Wen, Y. Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, SIGGRAPH 2023 Conference Proceedings, pp. 1–9, 2023.
N. Liu, R. Cheng, C. Shen, et al., “Tinyvla: Towards [65] Z. Dou, X. Chen, Q. Fan, T. Komura, and W. Wang,
fast, data-efficient vision-language-action models for “C· ase: Learning conditional adversarial skill embed-
robotic manipulation,” IEEE Robotics and Automation dings for physics-based characters,” in SIGGRAPH
Letters, 2025. Asia 2023 Conference Papers, pp. 1–11, 2023.
[52] G. Zambella, G. Lentini, M. Garabini, G. Grioli, M. G. [66] Y. Huang, Z. Dou, and L. Liu, “Modskill: Phys-
Catalano, A. Palleschi, L. Pallottino, A. Bicchi, A. Set- ical character skill modularization,” arXiv preprint
timi, and D. Caporale, “Dynamic whole-body control arXiv:2502.14140, 2025.
of unstable wheeled humanoid robots,” IEEE Robotics [67] H. Yao, Z. Song, Y. Zhou, T. Ao, B. Chen, and L. Liu,
and Automation Letters, vol. 4, no. 4, pp. 3489–3496, “Moconvq: Unified physics-based motion control via
2019. scalable discrete representations,” ACM Transactions
[53] F. L. Moro and L. Sentis, “Whole-body control of on Graphics (TOG), vol. 43, no. 4, pp. 1–21, 2024.
humanoid robots,” Humanoid robotics: a reference, [68] C. Tessler, Y. Guo, O. Nabati, G. Chechik, and X. B.
pp. 1161–1183, 2019. Peng, “Maskedmimic: Unified physics-based charac-
[54] L. Sentis and O. Khatib, “A whole-body control frame- ter control through masked motion inpainting,” ACM
work for humanoids operating in human environ- Transactions on Graphics (TOG), vol. 43, no. 6, pp. 1–21,
ments,” in Proceedings 2006 IEEE International Con- 2024.
ference on Robotics and Automation, 2006. ICRA 2006., [69] S. Xu, H. Y. Ling, Y.-X. Wang, and L.-Y. Gui, “In-
pp. 2641–2648, IEEE, 2006. termimic: Towards universal whole-body control for
[55] Y. Xue, W. Dong, M. Liu, W. Zhang, and J. Pang, “A physics-based human-object interactions,” in Proceed-
unified and general humanoid whole-body controller ings of the IEEE/CVF Computer Vision and Pattern Recog-
for versatile locomotion,” in Robotics: Science and Sys- nition Conference, 2025.
tems (RSS), 2025. [70] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell,
[56] T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, “Curiosity-driven exploration by self-supervised pre-
C. Liu, G. Shi, X. Wang, et al., “Hover: Versatile neural diction,” in International conference on machine learning,
whole-body controller for humanoid robots,” in 2025 pp. 2778–2787, PMLR, 2017.
IEEE International Conference on Robotics and Automa- [71] Y. Burda, H. Edwards, A. Storkey, and O. Klimov,
tion (ICRA), IEEE, 2025. “Exploration by random network distillation,” in In-
[57] Y. Wang, M. Yang, W. Zeng, Y. Zhang, X. Xu, H. Jiang, ternational Conference on Learning Representations, 2019.
Z. Ding, and Z. Lu, “From experts to a generalist: [72] D. Pathak, D. Gandhi, and A. Gupta, “Self-supervised
Toward general whole-body control for humanoid exploration via disagreement,” in International confer-
robots,” arXiv preprint arXiv:2506.12779, 2025. ence on machine learning, pp. 5062–5071, PMLR, 2019.
[58] R. S. Sutton, A. G. Barto, et al., Reinforcement learning: [73] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel,
An introduction, vol. 1. MIT press Cambridge, 1998. D. Hafner, and D. Pathak, “Planning to explore via
[59] Z. Luo, J. Cao, K. Kitani, W. Xu, et al., “Perpetual self-supervised world models,” in International confer-
humanoid control for real-time simulated avatars,” in ence on machine learning, pp. 8583–8592, PMLR, 2020.
Proceedings of the IEEE/CVF International Conference on [74] K. Gregor, D. J. Rezende, and D. Wierstra, “Variational
Computer Vision, pp. 10895–10904, 2023. intrinsic control,” in International Conference on Learn-
[60] P. Wu, A. Majumdar, K. Stone, Y. Lin, I. Mor- ing Representations, 2017.
datch, P. Abbeel, and A. Rajeswaran, “Masked trajec- [75] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Di-
tory models for prediction, representation, and con- versity is all you need: Learning skills without a re-
trol,” in International Conference on Machine Learning, ward function,” in International Conference on Learning
pp. 37607–37623, PMLR, 2023. Representations, 2019.
[61] S. Liu, G. Lever, Z. Wang, J. Merel, S. A. Eslami, [76] S. Hansen, W. Dabney, A. Barreto, D. Warde-Farley,
D. Hennes, W. M. Czarnecki, Y. Tassa, S. Omid- T. Van de Wiele, and V. Mnih, “Fast task inference
PREPRINT. UNDER REVIEW. 16

with variational intrinsic successor features,” in Inter- directed humanoid whole-body control via end-to-
national Conference on Learning Representations, 2020. end learning,” arXiv preprint arXiv:2504.21738, 2025.
[77] H. Liu and P. Abbeel, “Aps: Active pretraining with [92] H. Xue, X. Huang, D. Niu, Q. Liao, T. Kragerud, J. T.
successor features,” in International Conference on Ma- Gravdahl, X. B. Peng, G. Shi, T. Darrell, K. Screenath,
chine Learning, pp. 6736–6747, PMLR, 2021. et al., “Leverb: Humanoid whole-body control with
[78] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, latent vision-language instruction,” arXiv preprint
D. Saxton, and R. Munos, “Unifying count-based ex- arXiv:2506.13751, 2025.
ploration and intrinsic motivation,” Advances in neural [93] Z. Xiao, T. Wang, J. Wang, J. Cao, W. Zhang, B. Dai,
information processing systems, vol. 29, 2016. D. Lin, and J. Pang, “Unified human-scene interaction
[79] G. Ostrovski, M. G. Bellemare, A. Oord, and R. Munos, via prompted chain-of-contacts,” in The Twelfth Inter-
“Count-based exploration with neural density mod- national Conference on Learning Representations, 2024.
els,” in International conference on machine learning, [94] L. Pan, Z. Yang, Z. Dou, W. Wang, B. Huang, B. Dai,
pp. 2721–2730, PMLR, 2017. T. Komura, and J. Wang, “Tokenhsi: Unified synthesis
[80] Y. Seo, L. Chen, J. Shin, H. Lee, P. Abbeel, and K. Lee, of physical human-scene interactions through task
“State entropy maximization with random encoders tokenization,” arXiv preprint arXiv:2503.19901, 2025.
for efficient exploration,” in International Conference on [95] G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B.
Machine Learning, pp. 9443–9454, PMLR, 2021. Peng, A. H. Bermano, and M. van de Panne, “Closd:
[81] D. Yarats, R. Fergus, A. Lazaric, and L. Pinto, “Re- Closing the loop between simulation and diffusion for
inforcement learning with prototypical representa- multi-task character control,” in The Thirteenth Interna-
tions,” in International Conference on Machine Learning, tional Conference on Learning Representations, 2025.
pp. 11920–11931, PMLR, 2021. [96] Y. Wu, K. Karunratanakul, Z. Luo, and S. Tang, “Uni-
[82] J. Chen, V. Aggarwal, and T. Lan, “A unified algorithm phys: Unified planner and controller with diffusion
framework for unsupervised discovery of skills based for flexible physics-based character control,” arXiv
on determinantal point process,” Advances in Neu- preprint arXiv:2504.12540, 2025.
ral Information Processing Systems, vol. 36, pp. 67925– [97] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar,
67947, 2023. M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A
[83] M. Laskin, D. Yarats, H. Liu, K. Lee, A. Zhan, K. Lu, comprehensive overview of large language models,”
C. Cang, L. Pinto, and P. Abbeel, “Urlb: Unsuper- arXiv preprint arXiv:2307.06435, 2023.
vised reinforcement learning benchmark,” in Thirty- [98] Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen,
fifth Conference on Neural Information Processing Systems Z. Yuan, Y. Huang, H. Sun, J. Gao, et al., “Sora: A
Datasets and Benchmarks Track (Round 2). review on background, technology, limitations, and
[84] L. Blier, C. Tallec, and Y. Ollivier, “Learning successor opportunities of large vision models,” arXiv preprint
states and goal-dependent values: A mathematical arXiv:2402.17177, 2024.
viewpoint,” arXiv preprint arXiv:2101.07123, 2021. [99] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah,
[85] P. Dayan, “Improving generalization for temporal dif- “Diffusion models in vision: A survey,” IEEE Trans-
ference learning: The successor representation,” Neu- actions on Pattern Analysis and Machine Intelligence,
ral computation, vol. 5, no. 4, pp. 613–624, 1993. vol. 45, no. 9, pp. 10850–10869, 2023.
[86] T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gersh- [100] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao,
man, “Deep successor reinforcement learning,” arXiv W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models:
preprint arXiv:1606.02396, 2016. A comprehensive survey of methods and applica-
[87] A. Touati and Y. Ollivier, “Learning one representation tions,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–
to optimize all rewards,” Advances in Neural Informa- 39, 2023.
tion Processing Systems, vol. 34, pp. 13–23, 2021. [101] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and
[88] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, L. Sun, “A comprehensive survey of ai-generated
D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- content (aigc): A history of generative ai from gan to
gio, “Generative adversarial nets,” in Advances in chatgpt,” arXiv preprint arXiv:2303.04226, 2023.
Neural Information Processing Systems (Z. Ghahramani, [102] P. Pilaniwala, “Integrating genai in advancing game
M. Welling, C. Cortes, N. Lawrence, and K. Wein- product management and development,” in 2024
berger, eds.), vol. 27, Curran Associates, Inc., 2014. Eighth International Conference on Parallel, Distributed
[89] K. Gandhi, D. Lee, G. Grand, M. Liu, W. Cheng, and Grid Computing (PDGC), pp. 558–563, IEEE, 2024.
A. Sharma, and N. D. Goodman, “Stream of search [103] P. Pilaniwala, G. Chhabra, and P. Kaur, “The future
(sos): Learning to search in language,” arXiv preprint of game development in the era of gen ai,” in 2024
arXiv:2404.03683, 2024. Artificial Intelligence for Business (AIxB), pp. 39–42,
[90] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wain- IEEE, 2024.
wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, [104] D. Jiang, M. Ku, T. Li, Y. Ni, S. Sun, R. Fan, and
A. Ray, et al., “Training language models to follow W. Chen, “Genai arena: An open evaluation platform
instructions with human feedback,” Advances in neural for generative models,” Advances in Neural Information
information processing systems, vol. 35, pp. 27730–27744, Processing Systems, vol. 37, pp. 79889–79908, 2024.
2022. [105] Z. Wu, Z. Chen, D. Zhu, C. Mousas, and D. Kao, “A
[91] Y. Shao, X. Huang, B. Zhang, Q. Liao, Y. Gao, Y. Chi, systematic review of generative ai on game character
Z. Li, S. Shao, and K. Sreenath, “Langwbc: Language- creation: Applications, challenges, and future trends,”
PREPRINT. UNDER REVIEW. 17

IEEE Transactions on Games, 2025. population aging: Facts, challenges, solutions & per-
[106] M. Kopel and T. Hajas, “Implementing ai for non- spectives,” Daedalus, vol. 144, no. 2, pp. 80–92, 2015.
player characters in 3d video games,” in Intelligent [121] K. Navaneetham and D. Arunachalam, “Global popu-
Information and Database Systems: 10th Asian Conference, lation aging, 1950–2050,” in Handbook of Aging, Health
ACIIDS 2018, Dong Hoi City, Vietnam, March 19-21, and Public Policy: Perspectives from Asia, pp. 1–18,
2018, Proceedings, Part I 10, pp. 610–619, Springer, 2018. Springer, 2023.
[107] A. Mehta, Y. Kunjadiya, A. Kulkarni, and M. Nagar, [122] P. Zhang, H. Yang, C. Chen, T. Wang, and X. Jia,
“Exploring the viability of conversational ai for non- “The impact of population aging on corporate digital
playable characters: A comprehensive survey,” in 2021 transformation: Evidence from china,” Technological
4th International Conference on Recent Trends in Com- Forecasting and Social Change, vol. 214, p. 124070, 2025.
puter Science and Technology (ICRTCST), pp. 96–102, [123] D. Feil-Seifer and M. J. Mataric, “Defining socially
IEEE, 2022. assistive robotics,” in 9th International Conference on
[108] M. Ç. Uludağlı and K. Oğuz, “Non-player character Rehabilitation Robotics, 2005. ICORR 2005., pp. 465–468,
decision-making in computer games,” Artificial Intelli- IEEE, 2005.
gence Review, vol. 56, no. 12, pp. 14159–14191, 2023. [124] D. P. Miller, “Assistive robotics: an overview,” Assis-
[109] X. Xu, Y. Lu, B. Vogel-Heuser, and L. Wang, “Indus- tive Technology and Artificial Intelligence: Applications in
try 4.0 and industry 5.0—inception, conception and Robotics, User Interfaces and Natural Language Process-
perception,” Journal of manufacturing systems, vol. 61, ing, pp. 126–136, 2006.
pp. 530–535, 2021. [125] A. M. Okamura, M. J. Matarić, and H. I. Christensen,
[110] J. Leng, W. Sha, B. Wang, P. Zheng, C. Zhuang, Q. Liu, “Medical and health-care robotics,” IEEE Robotics &
T. Wuest, D. Mourtzis, and L. Wang, “Industry 5.0: Automation Magazine, vol. 17, no. 3, pp. 26–37, 2010.
Prospect and retrospect,” Journal of Manufacturing Sys- [126] L. D. Riek, “Healthcare robotics,” Communications of
tems, vol. 65, pp. 279–295, 2022. the ACM, vol. 60, no. 11, pp. 68–78, 2017.
[111] S. Huang, B. Wang, X. Li, P. Zheng, D. Mourtzis, and [127] J. Holland, L. Kingston, C. McCarthy, E. Armstrong,
L. Wang, “Industry 5.0 and society 5.0—comparison, P. O’Dwyer, F. Merz, and M. McConnell, “Service
complementation and co-evolution,” Journal of manu- robots in the healthcare sector,” Robotics, vol. 10, no. 1,
facturing systems, vol. 64, pp. 424–428, 2022. p. 47, 2021.
[112] A. Akundi, D. Euresti, S. Luna, W. Ankobiah, [128] M. Kyrarini, F. Lygerakis, A. Rajavenkatanarayanan,
A. Lopes, and I. Edinbarough, “State of industry C. Sevastopoulos, H. R. Nambiappan, K. K. Chai-
5.0—analysis and identification of current research tanya, A. R. Babu, J. Mathew, and F. Makedon, “A
trends,” Applied System Innovation, vol. 5, no. 1, p. 27, survey of robots in healthcare,” Technologies, vol. 9,
2022. no. 1, p. 8, 2021.
[113] M. A. Hassan, S. Zardari, M. U. Farooq, M. M. [129] V. Sanchez, C. J. Walsh, and R. J. Wood, “Textile tech-
Alansari, and S. A. Nagro, “Systematic analysis of nology for soft robotic and autonomous garments,”
risks in industry 5.0 architecture,” Applied Sciences, Advanced functional materials, vol. 31, no. 6, p. 2008278,
vol. 14, no. 4, p. 1466, 2024. 2021.
[114] A. Dzedzickis, J. Subačiūtė-Žemaitienė, E. Šutinys, [130] F. Zhang and Y. Demiris, “Learning garment manipu-
U. Samukaitė-Bubnienė, and V. Bučinskas, “Advanced lation policies toward robot-assisted dressing,” Science
applications of industrial robotics: New trends and robotics, vol. 7, no. 65, p. eabm6010, 2022.
possibilities,” Applied Sciences, vol. 12, no. 1, p. 135, [131] M. Javaid, A. Haleem, R. Pratap Singh, S. Rab,
2021. R. Suman, and L. Kumar, “Utilization of robotics
[115] M. Bartoš, V. Bulej, M. Bohušı́k, J. Stanček, V. Ivanov, for healthcare: a scoping review,” Journal of Industrial
and P. Macek, “An overview of robot applications in Integration and Management, vol. 10, no. 01, pp. 43–65,
automotive industry,” Transportation Research Procedia, 2025.
vol. 55, pp. 837–844, 2021. [132] M. Kaspar, J. D. M. Osorio, and J. Bock, “Sim2real
[116] J. Arents and M. Greitans, “Smart industrial robot transfer for reinforcement learning without dynamics
control trends, challenges and opportunities within randomization,” in 2020 IEEE/RSJ International Confer-
manufacturing,” Applied Sciences, vol. 12, no. 2, p. 937, ence on Intelligent Robots and Systems (IROS), pp. 4383–
2022. 4388, IEEE, 2020.
[117] M. Soori, R. Dastres, B. Arezoo, and F. K. G. Jough, [133] D. Horváth, G. Erdős, Z. Istenes, T. Horváth, and
“Intelligent robotic systems in industry 4.0: A review,” S. Földi, “Object detection using sim2real domain
Journal of Advanced Manufacturing Science and Technol- randomization for robotic applications,” IEEE Trans-
ogy, pp. 2024007–0, 2024. actions on Robotics, vol. 39, no. 2, pp. 1225–1243, 2022.
[118] C.-C. Lee, S. Qin, and Y. Li, “Does industrial robot [134] J. Huber, F. Hélénon, H. Watrelot, F. B. Amar,
application promote green technology innovation in and S. Doncieux, “Domain randomization for
the manufacturing industry?,” Technological Forecast- sim2real transfer of automatically generated grasp-
ing and Social Change, vol. 183, p. 121893, 2022. ing datasets,” in 2024 IEEE International Conference on
[119] D. T. Rowland, “Global population aging: History and Robotics and Automation (ICRA), pp. 4112–4118, IEEE,
prospects,” in International handbook of population aging, 2024.
pp. 37–65, Springer, 2009. [135] T. Yao, H. Wang, B. Lu, J. Ge, Z. Pei, M. Kowarschik,
[120] D. E. Bloom, D. Canning, and A. Lubet, “Global L. Sun, L. Seneviratne, and P. Qi, “Sim2real learn-
PREPRINT. UNDER REVIEW. 18

ing with domain randomization for autonomous vol. 36, no. 19, p. 2308829, 2024.
guidewire navigation in robotic-assisted endovascular [150] T. Wang, P. Zheng, S. Li, and L. Wang, “Multi-
procedures,” IEEE Transactions on Automation Science modal human–robot interaction for human-centric
and Engineering, 2025. smart manufacturing: a survey,” Advanced Intelligent
[136] T. He, J. Gao, W. Xiao, Y. Zhang, Z. Wang, J. Wang, Systems, vol. 6, no. 3, p. 2300359, 2024.
Z. Luo, G. He, N. Sobanbab, C. Pan, et al., “Asap: [151] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang,
Aligning simulation and real-world physics for learn- “Hugginggpt: Solving ai tasks with chatgpt and its
ing agile humanoid whole-body skills,” in Robotics: friends in hugging face,” Advances in Neural Informa-
Science and Systems (RSS), 2025. tion Processing Systems, vol. 36, pp. 38154–38180, 2023.
[137] Y. Li, Y. Lin, J. Cui, T. Liu, W. Liang, Y. Zhu, and [152] S.-C. Dai, A. Xiong, and L.-W. Ku, “LLM-in-the-loop:
S. Huang, “Clone: Closed-loop whole-body humanoid Leveraging large language model for thematic anal-
teleoperation for long-horizon tasks,” arXiv preprint ysis,” in The 2023 Conference on Empirical Methods in
arXiv:2506.08931, 2025. Natural Language Processing, 2023.
[138] M. Gleicher, “Retargetting motion to new characters,” [153] C.-M. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang,
in Proceedings of the 25th annual conference on Computer J. Fu, and Z. Liu, “Chateval: Towards better LLM-
graphics and interactive techniques, pp. 33–42, 1998. based evaluators through multi-agent debate,” in The
[139] M. Plappert, C. Mandery, and T. Asfour, “The Twelfth International Conference on Learning Representa-
kit motion-language dataset,” arXiv preprint tions, 2024.
arXiv:1607.03827, 2016. [154] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown,
[140] N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and
and M. J. Black, “Amass: Archive of motion capture D. Amodei, “Scaling laws for neural language mod-
as surface shapes,” in Proceedings of the IEEE/CVF els,” arXiv preprint arXiv:2001.08361, 2020.
international conference on computer vision, pp. 5442– [155] A. Clark, D. de Las Casas, A. Guy, A. Mensch, M. Pa-
5451, 2019. ganini, J. Hoffmann, B. Damoc, B. Hechtman, T. Cai,
[141] F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and S. Borgeaud, et al., “Unified scaling laws for routed
C. Pal, “Robust motion in-betweening,” ACM Transac- language models,” in International conference on ma-
tions on Graphics (TOG), vol. 39, no. 4, pp. 60–1, 2020. chine learning, pp. 4057–4086, PMLR, 2022.
[142] A. R. Punnakkal, A. Chandrasekaran, N. Athanasiou, [156] A. Aghajanyan, L. Yu, A. Conneau, W.-N. Hsu,
A. Quiros-Ramirez, and M. J. Black, “Babel: Bodies, K. Hambardzumyan, S. Zhang, S. Roller, N. Goyal,
action and behavior with english labels,” in Proceed- O. Levy, and L. Zettlemoyer, “Scaling laws for gener-
ings of the IEEE/CVF Conference on Computer Vision and ative mixed-modal language models,” in International
Pattern Recognition, pp. 722–731, 2021. Conference on Machine Learning, pp. 265–279, PMLR,
[143] G. Delmas, P. Weinzaepfel, T. Lucas, F. Moreno- 2023.
Noguer, and G. Rogez, “Posescript: 3d human poses [157] B. Isik, N. Ponomareva, H. Hazimeh, D. Paparas,
from natural language,” in European Conference on S. Vassilvitskii, and S. Koyejo, “Scaling laws for down-
Computer Vision, pp. 346–362, Springer, 2022. stream task performance of large language models,”
[144] C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and in ICLR 2024 Workshop on Mathematical and Empirical
L. Cheng, “Generating diverse and natural 3d human Understanding of Foundation Models, 2024.
motions from text,” in Proceedings of the IEEE/CVF [158] H. Que, J. Liu, G. Zhang, C. Zhang, X. Qu, Y. Ma,
Conference on Computer Vision and Pattern Recognition F. Duan, Z. Bai, J. Wang, Y. Zhang, et al., “D-cpt law:
(CVPR), pp. 5152–5161, June 2022. Domain-specific continual pre-training scaling law for
[145] J. Lin, A. Zeng, S. Lu, Y. Cai, R. Zhang, H. Wang, large language models,” Advances in Neural Informa-
and L. Zhang, “Motion-x: A large-scale 3d expressive tion Processing Systems, vol. 37, pp. 90318–90354, 2024.
whole-body human motion dataset,” Advances in Neu- [159] K. Kumar, T. Ashraf, O. Thawakar, R. M. Anwer,
ral Information Processing Systems, vol. 36, pp. 25268– H. Cholakkal, M. Shah, M.-H. Yang, P. H. Torr, F. S.
25280, 2023. Khan, and S. Khan, “Llm post-training: A deep dive
[146] Y. Zhang, J. Lin, A. Zeng, G. Wu, S. Lu, Y. Fu, Y. Cai, into reasoning large language models,” arXiv preprint
R. Zhang, H. Wang, and L. Zhang, “Motion-x++: A arXiv:2502.21321, 2025.
large-scale multimodal 3d whole-body human motion [160] S. Yue, W. Chen, S. Wang, B. Li, C. Shen, S. Liu,
dataset,” arXiv preprint arXiv:2501.05098, 2025. Y. Zhou, Y. Xiao, S. Yun, X. Huang, et al., “Disc-lawllm:
[147] T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, Fine-tuning large language models for intelligent legal
K. M. Kitani, C. Liu, and G. Shi, “Omnih2o: Universal services,” arXiv preprint arXiv:2309.11325, 2023.
and dexterous human-to-humanoid whole-body tele- [161] Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and
operation and learning,” in 8th Annual Conference on J. Jia, “Longlora: Efficient fine-tuning of long-context
Robot Learning, 2024. large language models,” in The Twelfth International
[148] S. Liu, L. Wang, and X. Vincent Wang, “Multimodal Conference on Learning Representations, 2024.
data-driven robot control for human–robot collabo- [162] T. Q. Luong, X. Zhang, Z. Jie, P. Sun, X. Jin, and H. Li,
rative assembly,” Journal of Manufacturing Science and “Reft: Reasoning with reinforced fine-tuning,” arXiv
Engineering, vol. 144, no. 5, p. 051012, 2022. preprint arXiv:2401.08967, vol. 3, 2024.
[149] D. R. Yao, I. Kim, S. Yin, and W. Gao, “Multimodal soft [163] N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu,
robotic actuation and locomotion,” Advanced Materials, Y. Chen, C.-M. Chan, W. Chen, et al., “Parameter-
PREPRINT. UNDER REVIEW. 19

efficient fine-tuning of large-scale pre-trained lan- pp. 1–5, IEEE, 2012.


guage models,” Nature Machine Intelligence, vol. 5, [178] A. Dorri, S. S. Kanhere, and R. Jurdak, “Multi-agent
no. 3, pp. 220–235, 2023. systems: A survey,” Ieee Access, vol. 6, pp. 28573–
[164] S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, 28593, 2018.
D. Chen, and S. Arora, “Fine-tuning language models [179] Á. Madridano, A. Al-Kaff, D. Martı́n, and A. De La Es-
with just forward passes,” Advances in Neural Informa- calera, “Trajectory planning for multi-robot systems:
tion Processing Systems, vol. 36, pp. 53038–53075, 2023. Methods and applications,” Expert Systems with Appli-
[165] L. Xu, H. Xie, S.-Z. J. Qin, X. Tao, and F. L. cations, vol. 173, p. 114660, 2021.
Wang, “Parameter-efficient fine-tuning methods for [180] Z. Zhou, J. Liu, and J. Yu, “A survey of underwater
pretrained language models: A critical review and multi-robot systems,” IEEE/CAA Journal of Automatica
assessment,” arXiv preprint arXiv:2312.12148, 2023. Sinica, vol. 9, no. 1, pp. 1–18, 2021.
[166] H. Yuan, Z. Yuan, C. Tan, W. Wang, S. Huang, and [181] C. Ju, J. Kim, J. Seol, and H. I. Son, “A review
F. Huang, “Rrhf: Rank responses to align language on multirobot systems in agriculture,” Computers and
models with human feedback,” Advances in Neural In- Electronics in Agriculture, vol. 202, p. 107336, 2022.
formation Processing Systems, vol. 36, pp. 10935–10950, [182] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang,
2023. J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A
[167] C. Y. Liu, L. Zeng, J. Liu, R. Yan, J. He, C. Wang, graduate-level google-proof q&a benchmark,” in First
S. Yan, Y. Liu, and Y. Zhou, “Skywork-reward: Bag Conference on Language Modeling, 2024.
of tricks for reward modeling in llms,” arXiv preprint [183] D. Hendrycks, C. Burns, S. Kadavath, A. Arora,
arXiv:2410.18451, 2024. S. Basart, E. Tang, D. Song, and J. Steinhardt, “Mea-
[168] H. Dong, W. Xiong, B. Pang, H. Wang, H. Zhao, suring mathematical problem solving with the math
Y. Zhou, N. Jiang, D. Sahoo, C. Xiong, and T. Zhang, dataset,” in Thirty-fifth Conference on Neural Information
“Rlhf workflow: From reward modeling to online Processing Systems Datasets and Benchmarks Track.
rlhf,” Transactions on Machine Learning Research, 2024. [184] Z. R. Sprague, X. Ye, K. Bostrom, S. Chaudhuri,
[169] G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, and G. Durrett, “Musr: Testing the limits of chain-
G. Xie, Z. Liu, and M. Sun, “Ultrafeedback: Boosting of-thought with multistep soft reasoning,” in The
language models with high-quality feedback,” 2023. Twelfth International Conference on Learning Representa-
[170] Y. Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, tions, 2024.
P. Abbeel, A. Gupta, and J. Andreas, “Guiding pre- [185] L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Ue-
training in reinforcement learning with large language sato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle,
models,” in International Conference on Machine Learn- A. Kasirzadeh, et al., “Ethical and social risks
ing, pp. 8657–8677, PMLR, 2023. of harm from language models,” arXiv preprint
[171] J. Song, Z. Zhou, J. Liu, C. Fang, Z. Shu, and L. Ma, arXiv:2112.04359, 2021.
“Self-refined large language model as automated re- [186] L. Yan, L. Sha, L. Zhao, Y. Li, R. Martinez-Maldonado,
ward function designer for deep reinforcement learn- G. Chen, X. Li, Y. Jin, and D. Gašević, “Practical and
ing in robotics,” arXiv preprint arXiv:2309.06687, 2023. ethical challenges of large language models in educa-
[172] Z. Xu, F. Jiang, L. Niu, Y. Deng, R. Poovendran, tion: A systematic scoping review,” British Journal of
Y. Choi, and B. Y. Lin, “Magpie: Alignment data syn- Educational Technology, vol. 55, no. 1, pp. 90–112, 2024.
thesis from scratch by prompting aligned llms with [187] R. Bommasani, S. Kapoor, K. Klyman, S. Longpre,
nothing,” arXiv preprint arXiv:2406.08464, 2024. A. Ramaswami, D. Zhang, M. Schaake, D. E. Ho,
[173] J. Jiang, D. He, and J. Allan, “Searching, browsing, A. Narayanan, and P. Liang, “Considerations for gov-
and clicking in a search session: changes in user erning open foundation models,” Science, vol. 386,
behavior by task and over time,” in Proceedings of the no. 6718, pp. 151–153, 2024.
37th international ACM SIGIR conference on Research & [188] M. Vasic and A. Billard, “Safety issues in human-robot
development in information retrieval, pp. 607–616, 2014. interactions,” in 2013 ieee international conference on
[174] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, robotics and automation, pp. 197–204, IEEE, 2013.
and K. Narasimhan, “Tree of thoughts: Deliberate [189] P. A. Lasota, T. Fong, J. A. Shah, et al., “A survey of
problem solving with large language models,” Ad- methods for safe human-robot interaction,” Founda-
vances in neural information processing systems, vol. 36, tions and Trends® in Robotics, vol. 5, no. 4, pp. 261–349,
pp. 11809–11822, 2023. 2017.
[175] Y. Tian, B. Peng, L. Song, L. Jin, D. Yu, L. Han, [190] I. Kumagai, M. Morisawa, T. Sakaguchi, S. Nakaoka,
H. Mi, and D. Yu, “Toward self-improvement of K. Kaneko, H. Kaminaga, S. Kajita, M. Benallegue,
llms via imagination, searching, and criticizing,” Ad- R. Cisneros, and F. Kanehiro, “Toward industrializa-
vances in Neural Information Processing Systems, vol. 37, tion of humanoid robots: Autonomous plasterboard
pp. 52723–52748, 2024. installation to improve safety and efficiency,” IEEE
[176] T. Arai, E. Pagello, L. E. Parker, et al., “Advances in Robotics & Automation Magazine, vol. 26, no. 4, pp. 20–
multi-robot systems,” IEEE Transactions on robotics and 29, 2019.
automation, vol. 18, no. 5, pp. 655–661, 2002. [191] G. Goh, N. Cammarata, C. Voss, S. Carter, M. Petrov,
[177] A. Gautam and S. Mohan, “A review of research in L. Schubert, A. Radford, and C. Olah, “Multimodal
multi-robot systems,” in 2012 IEEE 7th international neurons in artificial neural networks,” Distill, vol. 6,
conference on industrial and information systems (ICIIS), no. 3, p. e30, 2021.

You might also like