Behavior Foundation Model Research Paper
Behavior Foundation Model Research Paper
✦
arXiv:2506.20487v1 [cs.RO] 25 Jun 2025
Abstract—Humanoid robots are drawing significant attention as ver- Consequently, the development of robust and generalizable
satile platforms for complex motor control, human-robot interaction, whole-body control (WBC) systems has become an urgent
and general-purpose physical intelligence. However, achieving efficient priority. In the following sections, we begin by reviewing
whole-body control (WBC) in humanoids remains a fundamental chal- the previous humanoid WBC methods comprehensively,
lenge due to sophisticated dynamics, underactuation, and diverse task
from traditional model-based to task-specific and learning-
requirements. While learning-based controllers have shown promise for
complex tasks, their reliance on labor-intensive and costly retraining
based controllers, before introducing the transformative
for new scenarios limits real-world applicability. To address these lim- approach–the behavior foundation model. This evolution-
itations, behavior(al) foundation models (BFMs) have emerged as a ary trajectory not only reflects the field’s advancement to-
new paradigm that leverages large-scale pretraining to learn reusable ward enhanced intelligence and generalizability but also
primitive skills and behavioral priors, enabling zero-shot or rapid adap- paves the way for the next-generation humanoid robot
tation to a wide range of downstream tasks. In this paper, we present control systems.
a comprehensive overview of BFMs for humanoid WBC, tracing their
development across diverse pre-training pipelines. Furthermore, we Tasks
er
Be
Reaching
ag
more subsequent research, which is available at https://github.com/
e
Adaptation
yuanmingqi/awesome-bfm-papers. Interaction
Training
Command
Following
e
ag
Be
v av
Text to
tion model, pre-training, preview. Co ior
Motion
Trajectory
Scene
1 I NTRODUCTION Interaction
Behavior
Learning-based and Foundation Model
Task-specific Controller
Model-based
Controller Motivo, Hover, MaskedMimic, etc.
Task DeepMimic, AMP, etc.
Large-scale data pre-training (RL, IL, etc.)
Broad behavior coverage
Solving MPC, WBOSC, etc. Reinforcement learning
Fast adaptation capability
Capacity Physics-based models
Flexible task design
Poor cross-task generalization
Labor-instensive configuration
Low robustness
in multi-task scenarios such as balance, walking, and ma- virtual reality (VR) teleoperation with WBC for humanoid
nipulation. Such frameworks have been widely applied in loco-manipulation, demonstrating 85% success rate in real-
humanoids like Atlas [9], HRP-2 [10], and DLR’s torque- world bimanual tasks [27]. In addition, [28] develops an
controlled robots [11], achieving robust locomotion and expressive WBC framework that decouples upper-body IL
multi-contact interactions. (for stylistic motions) from robust lower-body locomotion,
Despite their success, traditional WBC systems face crit- enabling humanoid robots to dynamically adapt their gait
ical limitations: (i) task design, gain tuning, and heuristic while performing diverse movements. This approach over-
adjustments for complex behaviors (e.g., uneven terrain or comes the instability of full-body imitation caused by mor-
dynamic transitions) remain labor-intensive and brittle, (ii) phological mismatches between humans and robots.
real-time MPC struggles with high-dimensional systems, While learning-based methods have demonstrated re-
often requiring simplifications that sacrifice dynamic fidelity markable success in diverse humanoid WBC tasks, they face
[12], (iii) the lack of flexibility to execute highly dynamic fundamental challenges that limit their broader applicabil-
skills (e.g., backflips or rapid contact switches) or adapt to ity. RL-based approaches suffer from sample inefficiency,
unforeseen disturbances [13], and (iv) weak robustness, as often requiring millions of environment interactions to con-
even a little push may topple a robot with model-based verge, while remaining highly sensitive to reward function
walking controller. These challenges are especially signifi- design—poorly shaped rewards can lead to unintended
cant in humanoids, where tasks often require rich coordina- behaviors or local optima [29]. Furthermore, the simulation-
tion, contact reasoning, and situational awareness [14, 15]. to-reality (Sim2Real) gap exacerbates these limitations, as
As a result, recent research increasingly shifts toward data- policies trained in simulation frequently degrade when con-
driven approaches, aiming to learn motor skills, coordina- fronted with real-world dynamics, sensor noise, and hard-
tion policies, and behavioral priors from demonstrations or ware imperfections [30–33]. In contrast, IL-based methods
reinforcement learning [16, 17]. are more sample-efficient yet pose a significant challenge to
data collection, and learned policies often inherit the biases
1.2 Learning-based and Task-specific Controller and limitations of the demonstrator [34–38]. Moreover, both
paradigms struggle with generalization, in which learned
Learning-based methods, particularly reinforcement learn- policies typically excel only at narrow tasks and fail to
ing (RL) and imitation learning (IL), have emerged as adapt to new scenarios without extensive retraining. These
promising alternatives to traditional WBC methods, en- challenges collectively underscore the need for approaches
abling robots to acquire complex skills through environ- that combine the flexibility of learning with structured pri-
mental interaction or human demonstrations [18–23]. For ors for robustness and generalizability—a gap that behavior
example, [16] presents a framework entitled DeepMimic foundation models aim to bridge.
that combines deep RL with motion capture data to en-
able physically simulated characters to learn dynamic skills
while maintaining natural motion quality. [24] further ex- 1.3 Behavior Foundation Model
tends DeepMimic by introducing adversarial motion pri- The term ”behavior(al) foundation model (BFM)” is first
ors (AMP) to enable more stylized and diverse character introduced in [39], which proposes a successor measure-
control while maintaining physical realism. In contrast, [25] based framework for training generalist policies capable of
proposes HoST, a RL-based framework to learn humanoid instantly imitating diverse behaviors from minimal demon-
standing-up control from scratch and achieves adaptive strations. It demonstrates that BFMs pretrained on unsu-
and stable standing-up motions across diverse laboratory pervised interaction data are promising to eliminate task-
and outdoor environments, highlighting the outstanding specific RL fine-tuning by solving imitation tasks through
learning capability and robustness of RL in specific tasks. forward-backward state feature matching, while simulta-
For IL-based methods, [26] proposes TRILL that combines neously supporting multiple IL paradigms via a unified
PREPRINT. UNDER REVIEW. 3
Forward-Backward
Behavior FB-IL [39], FB-AWARE [40], FB-CPR [41]
Representation Learning
Foundation
Model
Fine-tuning Belief-FB [44], Rotation-FB [44],
Techniques Task Tokens [43], ReLA [42], LoLA [42]
Adaptation
Towards Hierarchical UniHSI [93], TokenHSI [94], UniPhys [96],
Control CLoSD [95], LangWBC [91], LeVERB [92]
approaches might be proposed before the emergence of the agent’s policy, enabling it to adapt its actions toward
the concept of BFMs, yet we still discuss them as long as achieving that specific goal. The goal can be specified in
they adhere to the properties of BFMs, or their physical various forms, such as a target state, an objective function,
meaning is analogous to BFMs. or an external task description [59, 60]. This approach allows
the agent to generalize across different goals by learning a
shared policy that can effectively handle diverse objectives.
3.1 Pre-training
The key advantage of goal-conditioned learning lies in its
Pre-training of BFMs seeks to learn reusable primitive skills ability to learn a more flexible and transferable policy that
and behavioral priors from large-scale data sources, creating can be applied to a wide range of tasks, as it directly incor-
a foundation for efficient downstream adaptation. Current porates the task’s goal during training, rather than requiring
approaches can be broadly categorized into three types (as retraining for each specific task. This makes it particularly
depicted in Figure 3): goal-conditioned learning, intrinsic useful in environments where the agent needs to solve
reward-driven learning, and the forward-backward frame- multiple tasks or interact with changing environments.
work.
Skill learning from motion tracking. Among the diverse
approaches to goal-conditioned learning, tracking-based
3.1.1 Goal-conditioned Learning
learning represents a specialized form where the target
behavior is explicitly defined by dense reference supervision
Diverse Goals Environment or guidance, typically derived from motion capture data or
"Keep expert demonstrations. At each time step, the agent is often
Running trained to track the given reference motion’s joint angle
foward" or kinematic pose of the next time step [16]. The primary
Location Text Pose
motivation behind tracking-based learning is that learning
to track a single pose is more achievable and general than
State Action directly imitating a whole motion, especially a complex
Encoder
motion.
Agent For example, [61] trains an agent to imitate a large
Goal amount of football motion capture data via a DeepMimic-
Embedding Reward
like [16] approach, aiming to realize the complete behavior
coverage for the football game. Then the agent is leveraged
to sample substantial state-action pairs to train a neural
probabilistic motor primitive (NPMP) model [62] and derive
Fig. 4. Workflow of the goal-conditioned learning, which enables versa- a low-level latent-conditioned controller. Finally, the con-
tile skill acquisition by training policies to achieve diverse target states
troller is applied for further drill learning by conducting
specified through goal embeddings.
RL with a drill-specific (e.g., follow, dribble, shoot, and
As shown in Figure 4, goal-conditioned learning is a kick-to-target) reward function. Here, the learned low-level
framework in RL where an agent’s behavior is condi- controller can be viewed as a BFM as it learns realistic
tioned on a specific goal or objective, typically provided human-like movement based on motion capture data and
as input. Unlike traditional RL, where the agent learns can be rapidly adapted to diverse higher-level drill learning.
from raw state-action pairs without explicit task-specific Similarly, [63] introduces adversarial skill embeddings
guidance, goal-conditioned learning integrates the goal into (ASE), a framework that learns a reusable latent space of
PREPRINT. UNDER REVIEW. 5
motor skills by combining adversarial IL with unsupervised adapt to complex scenes and support applications rang-
RL. Trained on unstructured motion data, ASE produces a ing from VR control to complex human-object interaction
latent-conditioned low-level controller capable of generat- (HOI). In addition, InterMimic [69] focuses on the HOI sce-
ing diverse and physically plausible behaviors, serving as a nario and designs a two-stage teacher-student framework
general-purpose motor prior for downstream tasks. Build- that distills imperfect motion capture interaction data into
ing on ASE, [64] proposes conditional adversarial latent robust and physics-based controllers. Teacher policies are
models (CALM), which incorporate a conditional discrimi- trained on subsets of noisy data and refined through sim-
nator to enable fine-grained control over generated motions ulation, then distilled into a student policy with RL-based
via latent manipulation. [65] further extends this line with fine-tuning. This curriculum strategy enables generalization
CASE, introducing skill-conditioned IL with training tech- across diverse interactions with high physical fidelity.
niques such as focal skill sampling and skeletal residual For real-world robotic applications, HOVER [56] intro-
forces to enhance agility and motion diversity. duces a multi-mode policy distillation framework that al-
While the methods above achieve efficient skill acquisi- lows humanoid robots to switch seamlessly between tasks
tion from large behavior datasets, [55] proposes HugWBC like locomotion, manipulation, and navigation using a sin-
that explores learning versatile locomotion skills without gle unified policy distilled from an oracle. This elimi-
relying on pre-collected motion data. The framework auto- nates the need for task-specific controllers, demonstrating
matically generates adaptive behaviors through a structured general-purpose control in real-world environments, akin to
RL process, where a general command space dynamically the versatility seen in MaskedMimic for virtual characters.
produces feasible velocity, gait, and posture targets dur- All the above methods follow the idea of BFM, which is
ing training. By reformulating WBC as a self-supervised revealed from two aspects: (i) their ability to learn a broad
command-tracking problem, the work establishes a new behavior coverage from diverse data sources, and (ii) their
direction for developing general-purpose humanoid con- fast adaptation ability to downstream tasks. These models
trollers that learn robust skills through environmental in- are trained on large-scale datasets and can generalize across
teraction rather than data imitation. a wide range of motor skills, such as locomotion, HOI, and
Moving beyond body-level skill learning, ModSkill [66] task-specific behaviors, without being limited to a single
introduces a modular framework that decouples full-body task. For instance, InterMimic is capable of handling various
motion into part-specific skills for individual body parts. HOT tasks, while MoConVQ adapts to different tasks, such
This modularization allows for efficient and scalable learn- as goal-reaching and text-conditioned motion generation.
ing, as each body part is controlled independently by a low- Additionally, these models exhibit fast adaptation to new
level controller driven by part-specific skill embeddings. tasks, with minimal retraining, demonstrating their capacity
ModSkill’s ability to focus on body-part-level skills makes to apply learned behaviors to new and unseen scenarios.
it a powerful system for controlling complex motions and Thus, the broad behavior coverage combined with the adap-
adapting learned behaviors across different tasks. By uti- tation ability collectively characterizes these methods as
lizing a skill modularization attention layer, ModSkill en- BFMs.
hances the generalization of motor skills across various tasks
like reaching or striking, further improving task-specific 3.1.2 Intrinsic Reward-driven Learning
adaptation. In tracking-based learning, the agent is consistently pro-
From primitive skills to high-level goal execution. The vided with an explicit objective (e.g., joint angles or ve-
success of BFMs in learning diverse primitive skills has locities) and trained via a well-specified reward function
propelled the development of more advanced BFMs capable to achieve targeted skill acquisition. In contrast, intrinsic
of interpreting and executing high-level goals, including reward-driven learning presents a distinct approach, where
language instructions and multi-task objectives. A notable the agent is motivated to explore the environment without
example is MoConVQ [67], which introduces a unified relying on explicit task-specific rewards. Instead, the agent
motion control framework based on discrete latent codes is guided by intrinsic rewards, which are self-generated sig-
learned via vector quantized variational autoencoder (VQ- nals that encourage exploration, skill acquisition, or novelty
VAE). The model supports a wide range of downstream detection. Extensive strategies for intrinsic reward-driven
tasks—including motion tracking, interactive control, and learning have been developed, including curiosity-driven
text-to-motion generation—by offering a compact and mod- exploration [70–73], skill discovery [74–77], and maximiz-
ular representation. MoConVQ also integrates with large ing data coverage [78–81], each encouraging the agent to
language models (LLMs) and enables the simulated agents explore different aspects of the environment.
to be directed via in-context language prompts, thereby For example, [70] introduces an intrinsic curiosity mod-
bridging symbolic reasoning and physical control. ule (ICM) that encourages the agent to explore unfamiliar
Meanwhile, MaskedMimic [68] addresses physics-based states by providing an intrinsic reward based on the predic-
character control as a general motion inpainting problem, tion error between the current and predicted next state. The
producing full-body motions from partial descriptions like intrinsic reward is proportional to the discrepancy between
masked keyframes, objects, or text instructions. Masked- the predicted state and the actual state, motivating the agent
Mimic involves a two-phase training process: first, a fully- to interact with environments that it cannot predict, thereby
constrained motion tracking controller learns to imitate di- driving exploration and learning. ICM has been shown to
verse reference motions, then a partially-constrained VAE- significantly improve the agent’s ability to explore complex
based policy distills this knowledge through masked goal environments with sparse or no external rewards. In con-
conditioning. As a result, MaskedMimic can dynamically trast, DIAYN [75] uses latent variable discovery to guide the
PREPRINT. UNDER REVIEW. 6
complex dynamics [39]. Despite the convenience of this Eq. (4) decouples the action-value function as two separate
paradigm, future work may address these fundamental terms: (i) the successor measure that models the evolution of
PREPRINT. UNDER REVIEW. 7
the policy in the environment, and (ii) the reward function encoding and better task representations in BFMs. While the
that captures task-relevant information. This factorization standard FB method uses a linear task projection that can
suggests that learning the successor measure for π allows blur rewards and reduce spatial precision, auto-regressive
for the zero-shot evaluation of Qπr on any reward without features improve expressivity and performance, especially
further training. for tasks requiring spatial accuracy or generalization. Ad-
Notably, [87] proposes an estimation of the success mea- ditionally, [40] introduces advantage-weighted regression
sure as (AWR) to address challenges with offline learning from
Z complex datasets. The modified FB approach, FB-AWARE,
M π (X |s, a) ≈ F π (s, a)⊤ B(s′ )ρ(ds′ ), (5) combines auto-regressive features with advantage weight-
s′ ∈X
ing, which performs well across new environments and
where ρ is an arbitrary distribution over states, F π : S × even matches the performance of standard offline RL agents
A → Rd is the forward embedding and B : S → Rd in benchmarks like D4RL.
is the backward embedding, respectively. Denote by z = The FB framework provides a general and flexible ap-
Es∼ρ [B(s)r(s)], the action-value function is rewritten as proach for training BFMs by learning successor measure
Qπr = F π (s, a)⊤ z. (6) representations and applying pre-trained policies to new
tasks. However, it suffers from several limitations: (i) when
To learn a family of polices, [87] suggests that both the the latent dimension d is finite, it relies on a low-rank
forward embedding F π and policy π can be parameterized dynamics assumption, leading to limited inductive bias for
by the same task encoding vector z , such that policy selection; (ii) poor coverage in the training dataset
Z causes offline learning to fail in reliably optimizing policies,
M πz (X |s, a) ≈ F π (s, a, z)⊤ B(s′ )ρ(ds′ ), (7) often collapsing to a few suboptimal behaviors with weak
s′ ∈X
performance on downstream tasks. These limitations greatly
d hinder the application of the FB framework for humanoid
where z ⊆ R , and the policy πz is defined as
robots. To address these limitations, [41] proposes FB with
πz = argmax F π (s, a, z)⊤ z. (8) conditional policy regularization (FB-CPR) and introduces
a
Motivo, the first BFM for humanoid WBC in the true sense
Then, the forward-backward embedding network is trained that solves diverse tasks in a zero-shot manner, including
to minimize the temporal difference (TD) loss derived as the motion tracking, goal reaching, and reward optimization.
Bellman residual: Specifically, FB-CPR learns the FB representations with
a discriminator-based regularization scheme, whose loss
LFB = E z∼ν,(s,a,s′ )∼ρ, F (s, a, z)⊤ B(s+ )
function is defined as
s+ ∼ρ,a′ ∼πz (s′ )
h i
− γ · sg(F )(s′ , a′ , z)⊤ sg(B)(s+ )
2 (9) LFB−CPR = − Ez∼ν,s∼Donline ,a∼πz (·|s) F (s, a, z)⊤ z
(12)
h i + αKL (pπ , pE ) ,
− 2Ez∼ν,(s,a,s′ )∼ρ F (s, a, z)⊤ B(s′ ) ,
where Donline is the associated replay buffer of unsuper-
where s+ denotes a future state and sg(·) denote the stop- vised transitions, pπ (s, z) is the joint distribution of (s, z)
gradient operation. Consider continuous action spaces, the induced by FB, and pE is the joint distribution of the dataset.
policies can be obtained by training an actor network to However, it is intractable to optimize the divergence term
minimize directly via a RL procedure. To tackle the problem, FB-CPR
h i interprets the divergence as an expected return under the
Lactor = −Ez∼ν,s∼ρ,a∼πz (s) F (s, a, z)⊤ z . (10) polices and defines a divergence-based reward rdiv as
Once the FB model is trained, it can be utilized to solve
pπ (s, z)
diverse tasks in a zero-shot manner without performing KL (pπ , pE ) = E z∼ν,π log
s∼ρ pE (s, z)
additional task-specific learning, planning, or fine-tuning. "∞ z # (13)
For example, given a task reward function r, the policy can
X
t p E (st+1 , z)
= −Ez∼ν E γ log s0 ∼ µ, πz .
be inferred by computing t=0
pπ (st+1 , z)
n
1X Then, a discriminator network D : S × A → [0, 1] is trained
zr = r(si )B(si ), (11)
n i=1 to estimate the rdiv :
where {si }n
i=1 is a set of sample states. Similarly, for a goal-
pπ (s, z) D
rdiv = log ≈ log . (14)
reaching problem, it suffices to compute the encoder vector pE (s, z) 1−D
by zs = B(s), s ∈ S .
As introduced in Section 1.3, [39] first introduced the The estimation holds due to the optimal discriminator sat-
term ”BFM” literally and proposed FB-IL for fast IL based isfies D∗ = pEp+p
E
π
[88]. Finally, the divergence term can be
on BFMs, which supports multiple IL principles, such as estimated by training another critic network via off-policy
behavioral cloning, feature matching, and goal-based re- TD learning, and the actor loss for FB-CPR is rewritten as
duction, without needing separate RL routines for each h i
new task. [40] further enhances the FB framework by in- LFB−CPR = − Ez∼ν,s∼Donline ,a∼πz (·|s) F (s, a, z)⊤ z
(15)
corporating auto-regressive features for more precise task + αQdiv (s, a, z),
PREPRINT. UNDER REVIEW. 8
Motion Imitation
Pose Reaching
Fig. 7. Motivo [41] learns broad behavior coverage and demonstrates outstanding zero-shot adaptation capability to a diverse downstream tasks,
including complex motion imitation, pose reaching, and composite reward optimization. Moreover, Motivo achieves real-time motor control while
ensuring motion naturalness.
It is natural to find that FB-CPR is not a rigorous un- and lookahead latent adaptation (LoLA), which allow BFMs
supervised method, as it leverages unlabeled demo data to achieve zero-shot adaptation by using minimal online
to assist motion prior learning. By aligning unsupervised interactions after pre-training. These methods improve task
RL with human-like behavioral priors from unlabeled data, performance by up to 40%, demonstrating the effectiveness
FB-CPR enhances policy diversity and dataset coverage, of fast adaptation strategies that enable efficient task switch-
enabling the agent to learn a rich latent space of behaviors ing without requiring extensive retraining. In contrast, [43]
(e.g., walking, jumping, handstands) and achieve robust introduces ”Task Tokens”, which enhance goal-conditioned
zero-shot performance across diverse tasks. Experimental BFMs by generating task-specific tokens through a task
results demonstrate that Motivo achieves 83% success rate encoder, enabling the BFM to perform complex tasks like
in motion tracking tasks, and performs 61% of the top- motion tracking and goal-reaching with high success rates.
line performance in reward optimization tasks, surpassing Their method shows up to 99.75% success in tasks such as
DIFFUSER in computational efficiency by requiring only 12 the long jump. Additionally, [44] addresses the limitation
seconds per 300-step episode. Additionally, it outperforms of FB representations by introducing belief-FB, which uses
ASE and CALM in motion diversity, achieving a score of a transformer encoder to infer environmental dynamics,
4.70 (±0.66), reflecting its ability to capture a broader range improving adaptation to unseen changes. Their approach
of behaviors. achieves up to 2× improvement in performance under
dynamic variations, demonstrating enhanced zero-shot ca-
pabilities. Future work could investigate additional post-
3.2 Adaptation
training techniques, such as test-time scaling [89] or RL from
Equipped with the derived BFMs through the aforemen- human feedback [90], to further improve the adaptability
tioned pre-training frameworks, we further introduce recent and efficiency of BFMs and ensure alignment with human
advancements on the adaptation techniques of BFMs. These preferences in real-world applications.
work can be broadly categorized into two types: fine-tuning
and towards hierarchical control. 3.2.2 Towards Hierarchical Control
While fine-tuning techniques enhance the performance of
3.2.1 Fine-tuning BFMs for specific tasks through minimal modifications
Fine-tuning of BFMs seeks to bridge the gap between during test-time, several pioneering works have attempted
general-purpose motion priors and task-specific require- to establish a hierarchical control architecture based on
ments. While pre-trained BFMs capture broad motion dis- BFM, which decouples high-level planning from low-level
tributions, they often lack precision for specialized tasks motion execution to achieve more scalable and flexible
or novel environments. For example, [42] introduces fast control [91, 92]. For example, UniHSI [93] introduces a
adaptation techniques like residual latent adaptation (ReLA) unified framework for human-scene interaction by using
PREPRINT. UNDER REVIEW. 9
language commands to guide a chain-of-contacts, which a rich behavioral prior that accelerates downstream adap-
represents the sequence of human-object contact pairs. The tation exponentially [42]. Moreover, advanced BFMs with
system translates language inputs into structured task plans, zero-shot adaptation capabilities, such as Motivo [41], can
which are then executed by a unified controller based on directly map high-level task specifications (e.g., goal states
the AMP architecture. This framework achieves semantic and reward functions) to low-level control actions, bypass-
alignment between language commands and physical mo- ing traditional RL loops entirely. This capability efficiently
tions, supporting diverse interactions with single or mul- solves basic control tasks but also facilitates rapid prototyp-
tiple objects. TokenHSI [94] further extends this line by ing, allowing developers to evaluate robot behaviors in both
proposing a transformer-based unified policy that tokenizes simulation and real-world environments within minutes,
human proprioception and task states. By separating shared dramatically shortening the development cycle.
motor knowledge (proprioception token) from task-specific
parameters (task tokens), it enables seamless multi-skill uni- 4.1.2 Virtual Agents and Gaming
fication and flexible adaptation to novel tasks, such as skill Generative AI has significantly revolutionized digital con-
composition (e.g., carrying while sitting), object or terrain tent creation, particularly in art, animation, and game de-
shape variation, and long-horizon task completion. sign [99–105]. Among these, controlling dynamic and inter-
In contrast, CLoSD [95] introduces a text-driven RL con- active behaviors for non-player characters (NPCs) in games
troller that combines motion diffusion models with physics- remains a significant challenge. Traditional rule-based or
based simulations for robust multi-task human character scripted NPCs often exhibit limited diversity, unnatural
control. By utilizing a real-time diffusion planner and a movements, and poor adaptability to player actions [106–
motion tracking controller in a closed-loop feedback system, 108]. BFMs offer a groundbreaking solution by enabling
CLoSD can handle complex tasks like goal-reaching, strik- lifelike, context-aware NPC behaviors without extensive
ing, and human-object interactions, all controlled through manual scripting. Pre-trained on diverse human behavior
text prompts and target locations. Similarly, UniPhys [96] datasets, BFMs generate adaptive actions, such as tactical
introduces a diffusion-based behavior cloning framework combat, social engagement, or exploration, seamlessly re-
unifying planning and control with diffusion forcing to sponding to dynamic player inputs. By integrating BFMs
handle prediction errors, enabling flexible control via text, with LLMs, NPCs can interpret complex player instructions
velocity, and goal guidance for applications like dynamic (e.g., dialogue-driven commands in role-playing games) and
obstacle avoidance and long-horizon planning. These ap- foster immersive and responsive interactions. This capa-
plications demonstrate that BFMs serve as a pivotal bridge bility positions BFMs as a pivotal technology for revolu-
between high-level semantic instructions and low-level tionizing virtual agents, enabling next-generation gaming
physical execution, leveraging pre-trained behavioral priors experiences with unprecedented behavioral realism and in-
to enable zero-shot adaptation, multi-task generalization, teractivity.
and physics-aware motion synthesis. By integrating vari-
ous control paradigms (e.g., transformer-based tokenization 4.1.3 Towards Industry 5.0
and closed-loop diffusion planning), these highlight BFMs’ While Industry 4.0 introduced smart factories with cyber-
potential to democratize humanoid control across complex, physical systems, IoT, and AI-driven automation, Industry
real-world scenarios, from interactive robotics to dynamic 5.0 shifts toward human-centric, resilient, and sustainable
environment adaptation. manufacturing, emphasizing collaborative robotics, adap-
tive intelligence, and personalized production [109–113].
To that end, robots must move beyond rigid automation
4 A PPLICATIONS AND L IMITATIONS and instead exhibit generalizable, adaptive, and explainable
BFMs are foreseen to significantly enhance humanoid behaviors [114–118]. BFMs are expected to empower this
robotics by providing a universal pre-trained controller landscape by enabling humanoid robots to seamlessly blend
capable of generalizing across diverse tasks. In this section, pre-trained motor skills with real-time adaptability, effort-
we explore the potential applications of BFMs in diverse lessly switching between tasks like precision welding and
industries macroscopically, such as healthcare robotics and adaptive part handling. By integrating large multimodal
gaming. Furthermore, we identify the key limitations of cur- models with BFMs, robots can process diverse inputs like
rent BFMs, including the Sim2Real gap, data bottleneck, and gestures, voice commands (e.g., ”handle gently”), or envi-
embodiment generalization. Our analysis is inspired by the ronmental cues, fostering intuitive human-robot collabora-
current development and applications of other foundation tion in shared workspaces. BFMs also ensure resilience, au-
models, such as LLMs [97] and large vision models [98]. tonomously recovering from disturbances like unbalanced
loads in logistics, and support personalized production
through zero-shot or few-shot learning.
4.1 Applications
4.1.1 General Accelerator for Humanoid Robotics 4.1.4 Healthcare and Assistive Robotics
BFMs will act as a transformative general accelerator for The global population aging presents unprecedented chal-
humanoid robotics, accelerating the development and de- lenges for healthcare systems [119–122], increasing demand
ployment of advanced WBC systems. Unlike the previous for assistive technologies that support independent living
pipelines that require resource-intensive and task-specific and rehabilitation. Extensive and diverse robots have been
training, BFMs eliminate training from scratch by pre- developed for robot-assisted dressing, rehabilitation ther-
training on vast and diverse behavior datasets, embedding apy, medical treatment, and caregiving for the elderly and
PREPRINT. UNDER REVIEW. 10
General Accelerator for Humanoid Robotics Universal motor controller, rapid task adaptation
Virtual Agents and Gaming Lifelike NPCs, immersive and responsive interactions
Healthcare and Assistive Robotics Adaptive rehabilitation support, personalized care protocols
Multimodal BFMs Exteroceptive signals such as vision, acoustics, and tactile feedback
Behavior
Foundation High-level ML System LLMs-based planning and BFMs-based control
Model
Scaling Law Model architecture, parameter size, data scale
Fig. 8. An overview of the application, limitations, research opportunities, and potential risks of BFMs.
children [123–131]. Humanoid robots are ideally suited for sensor noise, resulting in unstable or unsafe execution. The
these tasks due to their anthropomorphic design, navigating integration of visual signals into control systems further
human-centric environments and perform precise, natural, compounds the challenge, and perceptual domain shifts
and intuitive interactions. BFMs offer a promising solu- (e.g., lighting, texture, or camera calibration mismatches)
tion by enabling robots to adapt to diverse user needs and generalization gaps in visual features can destabilize
and unstructured environments. For instance, BFMs can motion policies based on simulated visual inputs. Current
empower assistive robots to perform tasks like mobility BFMs remain largely confined to simulation, with no doc-
support (e.g., fall prevention, gait assistance) or daily tasks umented large-scale real-world deployments. While several
(e.g., object retrieval, meal preparation) with minimal user- pioneering works have been devoted to developing BFM-
specific tuning. In rehabilitation, BFMs trained on clinician- like controllers in real humanoid robots, like HugWBC [55]
guided demonstrations can personalize therapy protocols and CLONE [137], their motion skills remain narrow, high-
by dynamically adjusting task difficulty or providing real- lighting a significant challenge to Sim2Real feasibility for
time feedback based on patient progress. maintaining behavior richness. This gap stems from behav-
ioral overgeneralization, dynamics mismatches, and latent
4.2 Limitations space instability, which hinder scaling targeted Sim2Real
successes (e.g., quadruped locomotion or grasping) to BFM’s
4.2.1 Sim2Real Gap
complexity.
The Sim2Real gap is a persistent challenge in robotics,
representing the performance discrepancy between the poli-
4.2.2 Data Bottleneck
cies trained in simulators and their real-world deployment
[30–33]. Traditional model-based controllers address this The data bottleneck poses a fundamental constraint in de-
through explicit physics modeling and robust optimiza- veloping BFMs for humanoid robots. While datasets listed
tion, while data-driven approaches employ domain ran- in Table 2 have been successfully employed to train the
domization and system identification [132–135]. Recent ad- current BFMs, their scale remains significantly smaller than
vancements like ASAP [136] mitigate dynamics mismatch the datasets used to train LLMs or large vision models. This
via residual action learning yet face policy-specific limita- scarcity is exacerbated when retargeting motions to specific
tions, as their residuals are trained on trajectories from a robotic platforms [138], where subtle morphological differ-
single pre-trained policy, restricting generalization. BFMs ences can incur severe policy performance loss. Real-world
exacerbate this challenge by encoding a vast spectrum of robot data is even more constrained due to the hardware
behaviors, from locomotion to multi-contact interactions, limitations and safety concerns. The challenge compounds
introducing high-dimensional transfer risks. For example, when considering multimodal data requirements.
a BFM trained for diverse humanoid motions may fail to Current BFMs predominantly rely on proprioceptive
adapt to real-world actuator delays, friction variations, or inputs, integrating exteroceptive sensing for real-world
PREPRINT. UNDER REVIEW. 11
alone cannot address. However, the effect of data scaling robust assessment of BFMs must consider multiple interde-
remains underexplored, with open questions as discussed pendent factors, including task generalization across unseen
in Section 4.2.2. Furthermore, unlike LLMs, BFM scaling scenarios, adaptability to new skills with minimal data, ro-
must strike an appropriate balance between the behavior bustness against physical perturbations, and alignment with
coverage (diverse motor skills) and control efficiency (preci- human safety and interpretability standards. For example,
sion and real-time stability). Future work should rigorously Motivo [41] combines quantitative metrics like task success
quantify these scaling dynamics to unlock BFMs’ full poten- rates with qualitative human evaluations of motion natu-
tial as general-purpose controllers. ralness, yet critical gaps remain in assessing compositional
skill combinations, hardware-specific constraints, and long-
5.1.4 Post-training term behavioral stability. Future benchmarks should focus
Post-training techniques have emerged as critical tools for on progressive difficulty levels and cross-domain transfer
the continued success and refinement of foundation models, tests to effectively assess BFMs’ potential as general-purpose
especially for LLMs [159], which refine models to improve physical controllers in both simulated and real-world sce-
reasoning, address limitations, and better align with user narios. This approach will ultimately guide the field towards
intents and ethical considerations. Among these methods, developing more capable and reliable humanoid systems.
fine-tuning [160–165], integration of RL [90, 166–172], and
test-time scaling [89, 173, 174, 174, 175] have been the most 5.2 Risks
prominent strategies for optimizing LLMs’ performance. 5.2.1 Ethical Issues
Integrating these post-training strategies presents unique
Ethical issues consistently accompany the development of
research opportunities for BFMs. For instance, leveraging
diverse foundation models [185–187], which involve biased
RL techniques like RLHF and RLAIF for BFMs can be crucial
or unlicensed data, racial discrimination, and uncontrollable
for refining the alignment between the agent’s behavior
behaviors, etc. For BFMs, training on non-diverse motion
and real-world human expectations, especially in human-
datasets may encode demographic biases such as favoring
centric task environments. This opens avenues for devel-
movements natural to specific age groups or body types,
oping more robust models that can adapt dynamically to
which then propagate into robotic behaviors and create em-
user feedback. Additionally, test-time scaling for BFMs can
bodied forms of discrimination. Meanwhile, privacy risks
optimize the computational efficiency during deployment,
escalate beyond data memorization to movement analytics,
especially for real-time robot control or decision-making
where rehabilitation or performance data could leak sen-
systems. Research could focus on improving the scalability
sitive health information through generated motions. The
of BFMs while ensuring that model outputs remain accurate
physical instantiation of BFMs introduces unprecedented
and contextually appropriate across varying operational
risks: unlike purely digital models, misaligned BFMs might
conditions.
reproduce unsafe or socially harmful behaviors (e.g., ag-
gressive gestures or exclusionary motions) with real-world
5.1.5 Multi-agent System
consequences. While techniques like differential privacy and
A multi-agent system (MAS) consists of multiple interacting federated learning offer partial solutions, they struggle with
agents that can be either cooperative, competitive, or a the continuous motion data’s temporal nature. In summary,
mix of both, aiming to address complex tasks that require BFMs demand novel governance frameworks that address
collaboration, decision-making, and behavior coordination both data provenance and normativity of real-time behavior.
among agents [176–181]. BFMs can fundamentally acceler-
ate the construction of MAS consists of humanoid robots, 5.2.2 Safety Mechanism
eliminating the need to effortlessly teach each robot basic As BFMs will be increasingly deployed in real-world robotic
survival skills like balance and locomotion before they systems, they introduce critical safety requirements beyond
can collaborate. Instead, researchers can focus directly on those of digital foundation models [188–190]. A key is-
higher-level coordination challenges like role allocation and sue is maintaining model behavior integrity, particularly
team strategy. However, current BFMs trained on single- in safety-critical scenarios such as human-robot interaction
robot data lack specialized interaction capabilities needed and autonomous navigation. When trained on large-scale
for optimal collaboration. This presents a promising re- yet weakly curated motion datasets, BFMs may uninten-
search direction: developing next-generation BFMs trained tionally learn unsafe or undesirable behaviors. Even minor
explicitly on multi-robot interaction scenarios. Such mod- changes in sensory input—whether caused by adversarial
els could better handle physical coordination challenges attacks or sensor noise—can lead to control failures. This
like object handovers, formation maintenance, and collision highlights the need for robustness against shifts in data
avoidance while preserving their generalizability. distribution and protection against malicious input manip-
ulation.
5.1.6 Evaluation Mechanism While most current BFMs focus primarily on propri-
While foundation models like LLMs benefit from well- oceptive inputs, integrating multimodal information (e.g.,
established benchmarks (e.g., GPQA [182] for broad knowl- visual, linguistic, and auditory cues) has emerged as a
edge recall, MATH [183] for mathematical problem solving, promising direction for more generalizable and situationally
or MUSR [184] for multi-step reasoning), there is no specific aware control. However, multimodality introduces new vul-
and comprehensive evaluation mechanism for evaluating nerabilities. Adversaries can exploit inconsistencies across
BFM’s capability and guiding the evolution direction. A modalities, as seen in the well-known CLIP case where
PREPRINT. UNDER REVIEW. 13
an apple image was misclassified as an ”iPod” due to an formulation,” IEEE Journal on Robotics and Automation,
overlaid label [191]. Such cross-modal confusion can be vol. 3, no. 1, pp. 43–53, 2003.
especially dangerous when a BFM is adapted for a unimodal [7] L. Sentis and O. Khatib, “Synthesis of whole-body
task but retains sensitivity to irrelevant signals from other behaviors through hierarchical control of behavioral
modalities. These challenges underscore the need to develop primitives,” International Journal of Humanoid Robotics,
robust safety mechanisms for BFMs. Future work could vol. 2, no. 04, pp. 505–518, 2005.
prioritize adversarial robustness, cross-modal consistency [8] A. Escande, N. Mansard, and P.-B. Wieber, “Hierar-
checks, and disentangling modality-specific information to chical quadratic programming: Fast online humanoid-
ensure predictable, trustworthy robot behavior in open- robot motion generation,” The International Journal of
world environments. Robotics Research, vol. 33, no. 7, pp. 1006–1028, 2014.
[9] S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela,
6 C ONCLUSION H. Dai, F. Permenter, T. Koolen, P. Marion, and
R. Tedrake, “Optimization-based locomotion plan-
In this paper, we present a systematic preview of the
ning, estimation, and control design for the atlas
behavior foundation model, an emerging yet transforma-
humanoid robot,” Autonomous robots, vol. 40, pp. 429–
tive paradigm for humanoid whole-body control systems.
455, 2016.
By pre-training on large-scale and diverse humanoid be-
[10] E. Dantec, M. Naveau, P. Fernbach, N. Villa, G. Saurel,
havior data, BFMs learn a broad behavior coverage that
O. Stasse, M. Taix, and N. Mansard, “Whole-body
enables few-shot or zero-shot adaptation to broad down-
model predictive control for biped locomotion on
stream tasks, eliminating the need for resource-intensive
a torque-controlled humanoid robot,” in 2022 IEEE-
task-specific training. We establish a comprehensive tax-
RAS 21st International Conference on Humanoid Robots
onomy categorizing BFM approaches into supervised and
(Humanoids), pp. 638–644, 2022.
unsupervised frameworks, while demonstrating their real-
[11] B. Henze, A. Dietrich, and C. Ott, “An approach
world applicability across healthcare, gaming, and indus-
to combine balancing with hierarchical whole-body
trial domains. Furthermore, we identify key research oppor-
control for legged humanoid robots,” IEEE Robotics
tunities in high-level ML system integration, post-training
and Automation Letters, vol. 1, no. 2, pp. 700–707, 2015.
optimization, and standardized evaluation mechanisms that
[12] S. Sovukluk, J. Englsberger, and C. Ott, “Whole
could accelerate BFM development.
body control formulation for humanoid robots with
Despite their unprecedented capabilities, BFMs face sig-
closed/parallel kinematic chains: Kangaroo case
nificant challenges, including the Sim2Real gap, embodi-
study,” in 2023 IEEE/RSJ International Conference on
ment dependence, and data scarcity. The physical instantia-
Intelligent Robots and Systems (IROS), pp. 10390–10396,
tion of BFMs introduces unique safety risks requiring robust
IEEE, 2023.
verification mechanisms, while their training on human mo-
[13] K. Ishihara, T. D. Itoh, and J. Morimoto, “Full-body
tion data probably raises ethical concerns regarding privacy
optimal control toward versatile and agile behaviors
and bias mitigation. Addressing these limitations in future
in a humanoid robot,” IEEE Robotics and Automation
work will lead to more reliable and generalizable BFMs. Our
Letters, vol. 5, no. 1, pp. 119–126, 2019.
work is expected to inspire more subsequent research on
[14] Y. Hao, G. C. R. Bethala, N. Pudasaini, H. Huang,
BFMs.
S. Yuan, C. Wen, B. Huang, A. Nguyen, and
Y. Fang, “Embodied chain of action reasoning with
R EFERENCES multi-modal foundation model for humanoid loco-
[1] D. Kulić, G. Venture, K. Yamane, E. Demircan, I. Mizu- manipulation,” arXiv preprint arXiv:2504.09532, 2025.
uchi, and K. Mombaur, “Anthropomorphic movement [15] M. Murooka, K. Fukumitsu, M. Hamze, M. Mori-
analysis and synthesis: A survey of methods and sawa, H. Kaminaga, F. Kanehiro, and E. Yoshida,
applications,” IEEE Transactions on Robotics, vol. 32, “Whole-body multi-contact motion control for hu-
no. 4, pp. 776–795, 2016. manoid robots based on distributed tactile sensors,”
[2] A. Goswami and P. Vadakkepat, Humanoid robotics: a IEEE Robotics and Automation Letters, 2024.
reference. Springer Dordrecht, 2019. [16] X. B. Peng, P. Abbeel, S. Levine, and M. Van de
[3] M. Schwenzer, M. Ay, T. Bergs, and D. Abel, “Review Panne, “Deepmimic: Example-guided deep reinforce-
on model predictive control: An engineering perspec- ment learning of physics-based character skills,” ACM
tive,” The International Journal of Advanced Manufactur- Transactions On Graphics (TOG), vol. 37, no. 4, pp. 1–14,
ing Technology, vol. 117, no. 5, pp. 1327–1349, 2021. 2018.
[4] G. Romualdi, S. Dafarra, G. L’Erario, I. Sorrentino, [17] A. Singh, H. Liu, G. Zhou, A. Yu, N. Rhine-
S. Traversaro, and D. Pucci, “Online non-linear cen- hart, and S. Levine, “Parrot: Data-driven behav-
troidal mpc for humanoid robot locomotion with ioral priors for reinforcement learning,” arXiv preprint
step adjustment,” in 2022 International Conference on arXiv:2011.10024, 2020.
Robotics and Automation (ICRA), pp. 10412–10419, [18] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement
IEEE, 2022. learning in robotics: A survey,” The International Jour-
[5] S. P. Sethi and S. P. Sethi, What is optimal control theory? nal of Robotics Research, vol. 32, no. 11, pp. 1238–1274,
Springer, 2021. 2013.
[6] O. Khatib, “A unified approach for motion and force [19] N. Sünderhauf, O. Brock, W. Scheirer, R. Hadsell,
control of robot manipulators: The operational space D. Fox, J. Leitner, B. Upcroft, P. Abbeel, W. Bur-
PREPRINT. UNDER REVIEW. 14
gard, M. Milford, et al., “The limits and potentials of [33] E. Su, C. Jia, Y. Qin, W. Zhou, A. Macaluso, B. Huang,
deep learning for robotics,” The International journal of and X. Wang, “Sim2real manipulation on unknown
robotics research, vol. 37, no. 4-5, pp. 405–420, 2018. objects with tactile-based reinforcement learning,” in
[20] H. Nguyen and H. La, “Review of deep reinforce- 2024 IEEE International Conference on Robotics and Au-
ment learning for robot manipulation,” in 2019 Third tomation (ICRA), pp. 9234–9241, IEEE, 2024.
IEEE international conference on robotic computing (IRC), [34] S. Schaal, “Is imitation learning the route to humanoid
pp. 590–595, IEEE, 2019. robots?,” Trends in cognitive sciences, vol. 3, no. 6,
[21] A. I. Károly, P. Galambos, J. Kuti, and I. J. Rudas, pp. 233–242, 1999.
“Deep learning in robotics: Survey on model struc- [35] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne,
tures and training strategies,” IEEE Transactions on “Imitation learning: A survey of learning methods,”
Systems, Man, and Cybernetics: Systems, vol. 51, no. 1, ACM Computing Surveys (CSUR), vol. 50, no. 2, pp. 1–
pp. 266–279, 2020. 35, 2017.
[22] R. Liu, F. Nageotte, P. Zanne, M. de Mathelin, and [36] M. Laskey, J. Lee, R. Fox, A. Dragan, and K. Goldberg,
B. Dresp-Langley, “Deep reinforcement learning for “Dart: Noise injection for robust imitation learning,”
the control of robotic manipulation: a focussed mini- in Conference on robot learning, pp. 143–156, PMLR,
review,” Robotics, vol. 10, no. 1, p. 22, 2021. 2017.
[23] J. Chen, D. Tamboli, T. Lan, and V. Aggarwal, “Multi- [37] B. Fang, S. Jia, D. Guo, M. Xu, S. Wen, and F. Sun,
task hierarchical adversarial inverse reinforcement “Survey of imitation learning for robotic manipula-
learning,” in International Conference on Machine Learn- tion,” International Journal of Intelligent Robotics and
ing, vol. 202 of Proceedings of Machine Learning Research, Applications, vol. 3, no. 4, pp. 362–369, 2019.
pp. 4895–4920, PMLR, 2023. [38] J. Hua, L. Zeng, G. Li, and Z. Ju, “Learning for a
[24] X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and robot: Deep reinforcement learning, imitation learn-
A. Kanazawa, “Amp: Adversarial motion priors for ing, transfer learning,” Sensors, vol. 21, no. 4, 2021.
stylized physics-based character control,” ACM Trans- [39] M. Pirotta, A. Tirinzoni, A. Touati, A. Lazaric, and
actions on Graphics (ToG), vol. 40, no. 4, pp. 1–20, 2021. Y. Ollivier, “Fast imitation via behavior foundation
[25] T. Huang, J. Ren, H. Wang, Z. Wang, Q. Ben, M. Wen, models,” in The Twelfth International Conference on
X. Chen, J. Li, and J. Pang, “Learning humanoid Learning Representations, 2024.
standing-up control across diverse postures,” arXiv [40] E. Cetin, A. Touati, and Y. Ollivier, “Finer be-
preprint arXiv:2502.08378, 2025. havioral foundation models via auto-regressive fea-
[26] M. Seo, S. Han, K. Sim, S. Bang, C. Gonzalez, L. Sentis, tures and advantage weighting,” arXiv preprint
and Y. Zhu, “Deep imitation learning for humanoid arXiv:2412.04368, 2024.
loco-manipulation through human teleoperation. in [41] A. Tirinzoni, A. Touati, J. Farebrother, M. Guzek,
2023 ieee-ras 22nd international conference on hu- A. Kanervisto, Y. Xu, A. Lazaric, and M. Pirotta, “Zero-
manoid robots (humanoids),” 2023. shot whole-body humanoid control via behavioral
[27] S. H. Bang, C. Gonzalez, J. Ahn, N. Paine, and L. Sen- foundation models,” in The Thirteenth International
tis, “Control and evaluation of a humanoid robot with Conference on Learning Representations, 2025.
rolling contact joints on its lower body,” Frontiers in [42] H. Sikchi, A. Tirinzoni, A. Touati, Y. Xu, A. Kanervisto,
Robotics and AI, vol. 10, p. 1164660, 2023. S. Niekum, A. Zhang, A. Lazaric, and M. Pirotta, “Fast
[28] X. Cheng, Y. Ji, J. Chen, R. Yang, G. Yang, and adaptation with behavioral foundation models,” in
X. Wang, “Expressive whole-body control for hu- Reinforcement Learning Conference, 2025.
manoid robots,” arXiv preprint arXiv:2402.16796, 2024. [43] R. Vainshtein, Z. Rimon, S. Mannor, and C. Tessler,
[29] Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, “Task tokens: A flexible approach to adapt-
D. Jayaraman, Y. Zhu, L. Fan, and A. Anandkumar, ing behavior foundation models,” arXiv preprint
“Eureka: Human-level reward design via coding large arXiv:2503.22886, 2025.
language models,” in The Twelfth International Confer- [44] M. Bobrin, I. Zisman, A. Nikulin, V. Kurenkov, and
ence on Learning Representations, 2024. D. Dylov, “Zero-shot adaptation of behavioral foun-
[30] A. Kadian, J. Truong, A. Gokaslan, A. Clegg, E. Wi- dation models to unseen dynamics,” arXiv preprint
jmans, S. Lee, M. Savva, S. Chernova, and D. Batra, arXiv:2505.13150, 2025.
“Sim2real predictivity: Does evaluation in simulation [45] R. Bommasani, D. A. Hudson, E. Adeli, R. Alt-
predict real-world performance?,” IEEE Robotics and man, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg,
Automation Letters, vol. 5, no. 4, pp. 6670–6677, 2020. A. Bosselut, E. Brunskill, et al., “On the opportuni-
[31] S. Höfer, K. Bekris, A. Handa, J. C. Gamboa, M. Moz- ties and risks of foundation models,” arXiv preprint
ifian, F. Golemo, C. Atkeson, D. Fox, K. Goldberg, arXiv:2108.07258, 2021.
J. Leonard, et al., “Sim2real in robotics and automa- [46] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya,
tion: Applications and challenges,” IEEE transactions F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman,
on automation science and engineering, vol. 18, no. 2, S. Anadkat, et al., “Gpt-4 technical report,” arXiv
pp. 398–400, 2021. preprint arXiv:2303.08774, 2023.
[32] K. Iyengar, S. H. Sadati, C. Bergeles, S. Spurgeon, [47] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh,
and D. Stoyanov, “Sim2real transfer of reinforcement S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,
learning for concentric tube robots,” IEEE Robotics and et al., “Learning transferable visual models from nat-
Automation Letters, vol. 8, no. 10, pp. 6147–6154, 2023. ural language supervision,” in International conference
PREPRINT. UNDER REVIEW. 15
on machine learning, pp. 8748–8763, PMLR, 2021. shafiei, A. Abdolmaleki, et al., “From motor control
[48] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, to team play in simulated humanoid football,” Science
L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.- Robotics, vol. 7, no. 69, p. eabo0235, 2022.
Y. Lo, et al., “Segment anything,” in Proceedings of the [62] J. Merel, L. Hasenclever, A. Galashov, A. Ahuja,
IEEE/CVF international conference on computer vision, V. Pham, G. Wayne, Y. W. Teh, and N. Heess, “Neural
pp. 4015–4026, 2023. probabilistic motor primitives for humanoid control,”
[49] B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, in International Conference on Learning Representations,
J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al., “Rt- 2019.
2: Vision-language-action models transfer web knowl- [63] X. B. Peng, Y. Guo, L. Halper, S. Levine, and S. Fi-
edge to robotic control,” in Conference on Robot Learn- dler, “Ase: Large-scale reusable adversarial skill em-
ing, pp. 2165–2183, PMLR, 2023. beddings for physically simulated characters,” ACM
[50] M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Bal- Transactions On Graphics (TOG), vol. 41, no. 4, pp. 1–
akrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, 17, 2022.
Q. Vuong, et al., “Openvla: An open-source vision- [64] C. Tessler, Y. Kasten, Y. Guo, S. Mannor, G. Chechik,
language-action model,” in Conference on Robot Learn- and X. B. Peng, “Calm: Conditional adversarial latent
ing, PMLR, 2024. models for directable virtual characters,” in ACM
[51] J. Wen, Y. Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, SIGGRAPH 2023 Conference Proceedings, pp. 1–9, 2023.
N. Liu, R. Cheng, C. Shen, et al., “Tinyvla: Towards [65] Z. Dou, X. Chen, Q. Fan, T. Komura, and W. Wang,
fast, data-efficient vision-language-action models for “C· ase: Learning conditional adversarial skill embed-
robotic manipulation,” IEEE Robotics and Automation dings for physics-based characters,” in SIGGRAPH
Letters, 2025. Asia 2023 Conference Papers, pp. 1–11, 2023.
[52] G. Zambella, G. Lentini, M. Garabini, G. Grioli, M. G. [66] Y. Huang, Z. Dou, and L. Liu, “Modskill: Phys-
Catalano, A. Palleschi, L. Pallottino, A. Bicchi, A. Set- ical character skill modularization,” arXiv preprint
timi, and D. Caporale, “Dynamic whole-body control arXiv:2502.14140, 2025.
of unstable wheeled humanoid robots,” IEEE Robotics [67] H. Yao, Z. Song, Y. Zhou, T. Ao, B. Chen, and L. Liu,
and Automation Letters, vol. 4, no. 4, pp. 3489–3496, “Moconvq: Unified physics-based motion control via
2019. scalable discrete representations,” ACM Transactions
[53] F. L. Moro and L. Sentis, “Whole-body control of on Graphics (TOG), vol. 43, no. 4, pp. 1–21, 2024.
humanoid robots,” Humanoid robotics: a reference, [68] C. Tessler, Y. Guo, O. Nabati, G. Chechik, and X. B.
pp. 1161–1183, 2019. Peng, “Maskedmimic: Unified physics-based charac-
[54] L. Sentis and O. Khatib, “A whole-body control frame- ter control through masked motion inpainting,” ACM
work for humanoids operating in human environ- Transactions on Graphics (TOG), vol. 43, no. 6, pp. 1–21,
ments,” in Proceedings 2006 IEEE International Con- 2024.
ference on Robotics and Automation, 2006. ICRA 2006., [69] S. Xu, H. Y. Ling, Y.-X. Wang, and L.-Y. Gui, “In-
pp. 2641–2648, IEEE, 2006. termimic: Towards universal whole-body control for
[55] Y. Xue, W. Dong, M. Liu, W. Zhang, and J. Pang, “A physics-based human-object interactions,” in Proceed-
unified and general humanoid whole-body controller ings of the IEEE/CVF Computer Vision and Pattern Recog-
for versatile locomotion,” in Robotics: Science and Sys- nition Conference, 2025.
tems (RSS), 2025. [70] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell,
[56] T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, “Curiosity-driven exploration by self-supervised pre-
C. Liu, G. Shi, X. Wang, et al., “Hover: Versatile neural diction,” in International conference on machine learning,
whole-body controller for humanoid robots,” in 2025 pp. 2778–2787, PMLR, 2017.
IEEE International Conference on Robotics and Automa- [71] Y. Burda, H. Edwards, A. Storkey, and O. Klimov,
tion (ICRA), IEEE, 2025. “Exploration by random network distillation,” in In-
[57] Y. Wang, M. Yang, W. Zeng, Y. Zhang, X. Xu, H. Jiang, ternational Conference on Learning Representations, 2019.
Z. Ding, and Z. Lu, “From experts to a generalist: [72] D. Pathak, D. Gandhi, and A. Gupta, “Self-supervised
Toward general whole-body control for humanoid exploration via disagreement,” in International confer-
robots,” arXiv preprint arXiv:2506.12779, 2025. ence on machine learning, pp. 5062–5071, PMLR, 2019.
[58] R. S. Sutton, A. G. Barto, et al., Reinforcement learning: [73] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel,
An introduction, vol. 1. MIT press Cambridge, 1998. D. Hafner, and D. Pathak, “Planning to explore via
[59] Z. Luo, J. Cao, K. Kitani, W. Xu, et al., “Perpetual self-supervised world models,” in International confer-
humanoid control for real-time simulated avatars,” in ence on machine learning, pp. 8583–8592, PMLR, 2020.
Proceedings of the IEEE/CVF International Conference on [74] K. Gregor, D. J. Rezende, and D. Wierstra, “Variational
Computer Vision, pp. 10895–10904, 2023. intrinsic control,” in International Conference on Learn-
[60] P. Wu, A. Majumdar, K. Stone, Y. Lin, I. Mor- ing Representations, 2017.
datch, P. Abbeel, and A. Rajeswaran, “Masked trajec- [75] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Di-
tory models for prediction, representation, and con- versity is all you need: Learning skills without a re-
trol,” in International Conference on Machine Learning, ward function,” in International Conference on Learning
pp. 37607–37623, PMLR, 2023. Representations, 2019.
[61] S. Liu, G. Lever, Z. Wang, J. Merel, S. A. Eslami, [76] S. Hansen, W. Dabney, A. Barreto, D. Warde-Farley,
D. Hennes, W. M. Czarnecki, Y. Tassa, S. Omid- T. Van de Wiele, and V. Mnih, “Fast task inference
PREPRINT. UNDER REVIEW. 16
with variational intrinsic successor features,” in Inter- directed humanoid whole-body control via end-to-
national Conference on Learning Representations, 2020. end learning,” arXiv preprint arXiv:2504.21738, 2025.
[77] H. Liu and P. Abbeel, “Aps: Active pretraining with [92] H. Xue, X. Huang, D. Niu, Q. Liao, T. Kragerud, J. T.
successor features,” in International Conference on Ma- Gravdahl, X. B. Peng, G. Shi, T. Darrell, K. Screenath,
chine Learning, pp. 6736–6747, PMLR, 2021. et al., “Leverb: Humanoid whole-body control with
[78] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, latent vision-language instruction,” arXiv preprint
D. Saxton, and R. Munos, “Unifying count-based ex- arXiv:2506.13751, 2025.
ploration and intrinsic motivation,” Advances in neural [93] Z. Xiao, T. Wang, J. Wang, J. Cao, W. Zhang, B. Dai,
information processing systems, vol. 29, 2016. D. Lin, and J. Pang, “Unified human-scene interaction
[79] G. Ostrovski, M. G. Bellemare, A. Oord, and R. Munos, via prompted chain-of-contacts,” in The Twelfth Inter-
“Count-based exploration with neural density mod- national Conference on Learning Representations, 2024.
els,” in International conference on machine learning, [94] L. Pan, Z. Yang, Z. Dou, W. Wang, B. Huang, B. Dai,
pp. 2721–2730, PMLR, 2017. T. Komura, and J. Wang, “Tokenhsi: Unified synthesis
[80] Y. Seo, L. Chen, J. Shin, H. Lee, P. Abbeel, and K. Lee, of physical human-scene interactions through task
“State entropy maximization with random encoders tokenization,” arXiv preprint arXiv:2503.19901, 2025.
for efficient exploration,” in International Conference on [95] G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B.
Machine Learning, pp. 9443–9454, PMLR, 2021. Peng, A. H. Bermano, and M. van de Panne, “Closd:
[81] D. Yarats, R. Fergus, A. Lazaric, and L. Pinto, “Re- Closing the loop between simulation and diffusion for
inforcement learning with prototypical representa- multi-task character control,” in The Thirteenth Interna-
tions,” in International Conference on Machine Learning, tional Conference on Learning Representations, 2025.
pp. 11920–11931, PMLR, 2021. [96] Y. Wu, K. Karunratanakul, Z. Luo, and S. Tang, “Uni-
[82] J. Chen, V. Aggarwal, and T. Lan, “A unified algorithm phys: Unified planner and controller with diffusion
framework for unsupervised discovery of skills based for flexible physics-based character control,” arXiv
on determinantal point process,” Advances in Neu- preprint arXiv:2504.12540, 2025.
ral Information Processing Systems, vol. 36, pp. 67925– [97] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar,
67947, 2023. M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A
[83] M. Laskin, D. Yarats, H. Liu, K. Lee, A. Zhan, K. Lu, comprehensive overview of large language models,”
C. Cang, L. Pinto, and P. Abbeel, “Urlb: Unsuper- arXiv preprint arXiv:2307.06435, 2023.
vised reinforcement learning benchmark,” in Thirty- [98] Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen,
fifth Conference on Neural Information Processing Systems Z. Yuan, Y. Huang, H. Sun, J. Gao, et al., “Sora: A
Datasets and Benchmarks Track (Round 2). review on background, technology, limitations, and
[84] L. Blier, C. Tallec, and Y. Ollivier, “Learning successor opportunities of large vision models,” arXiv preprint
states and goal-dependent values: A mathematical arXiv:2402.17177, 2024.
viewpoint,” arXiv preprint arXiv:2101.07123, 2021. [99] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah,
[85] P. Dayan, “Improving generalization for temporal dif- “Diffusion models in vision: A survey,” IEEE Trans-
ference learning: The successor representation,” Neu- actions on Pattern Analysis and Machine Intelligence,
ral computation, vol. 5, no. 4, pp. 613–624, 1993. vol. 45, no. 9, pp. 10850–10869, 2023.
[86] T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gersh- [100] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao,
man, “Deep successor reinforcement learning,” arXiv W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models:
preprint arXiv:1606.02396, 2016. A comprehensive survey of methods and applica-
[87] A. Touati and Y. Ollivier, “Learning one representation tions,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–
to optimize all rewards,” Advances in Neural Informa- 39, 2023.
tion Processing Systems, vol. 34, pp. 13–23, 2021. [101] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and
[88] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, L. Sun, “A comprehensive survey of ai-generated
D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- content (aigc): A history of generative ai from gan to
gio, “Generative adversarial nets,” in Advances in chatgpt,” arXiv preprint arXiv:2303.04226, 2023.
Neural Information Processing Systems (Z. Ghahramani, [102] P. Pilaniwala, “Integrating genai in advancing game
M. Welling, C. Cortes, N. Lawrence, and K. Wein- product management and development,” in 2024
berger, eds.), vol. 27, Curran Associates, Inc., 2014. Eighth International Conference on Parallel, Distributed
[89] K. Gandhi, D. Lee, G. Grand, M. Liu, W. Cheng, and Grid Computing (PDGC), pp. 558–563, IEEE, 2024.
A. Sharma, and N. D. Goodman, “Stream of search [103] P. Pilaniwala, G. Chhabra, and P. Kaur, “The future
(sos): Learning to search in language,” arXiv preprint of game development in the era of gen ai,” in 2024
arXiv:2404.03683, 2024. Artificial Intelligence for Business (AIxB), pp. 39–42,
[90] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wain- IEEE, 2024.
wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, [104] D. Jiang, M. Ku, T. Li, Y. Ni, S. Sun, R. Fan, and
A. Ray, et al., “Training language models to follow W. Chen, “Genai arena: An open evaluation platform
instructions with human feedback,” Advances in neural for generative models,” Advances in Neural Information
information processing systems, vol. 35, pp. 27730–27744, Processing Systems, vol. 37, pp. 79889–79908, 2024.
2022. [105] Z. Wu, Z. Chen, D. Zhu, C. Mousas, and D. Kao, “A
[91] Y. Shao, X. Huang, B. Zhang, Q. Liao, Y. Gao, Y. Chi, systematic review of generative ai on game character
Z. Li, S. Shao, and K. Sreenath, “Langwbc: Language- creation: Applications, challenges, and future trends,”
PREPRINT. UNDER REVIEW. 17
IEEE Transactions on Games, 2025. population aging: Facts, challenges, solutions & per-
[106] M. Kopel and T. Hajas, “Implementing ai for non- spectives,” Daedalus, vol. 144, no. 2, pp. 80–92, 2015.
player characters in 3d video games,” in Intelligent [121] K. Navaneetham and D. Arunachalam, “Global popu-
Information and Database Systems: 10th Asian Conference, lation aging, 1950–2050,” in Handbook of Aging, Health
ACIIDS 2018, Dong Hoi City, Vietnam, March 19-21, and Public Policy: Perspectives from Asia, pp. 1–18,
2018, Proceedings, Part I 10, pp. 610–619, Springer, 2018. Springer, 2023.
[107] A. Mehta, Y. Kunjadiya, A. Kulkarni, and M. Nagar, [122] P. Zhang, H. Yang, C. Chen, T. Wang, and X. Jia,
“Exploring the viability of conversational ai for non- “The impact of population aging on corporate digital
playable characters: A comprehensive survey,” in 2021 transformation: Evidence from china,” Technological
4th International Conference on Recent Trends in Com- Forecasting and Social Change, vol. 214, p. 124070, 2025.
puter Science and Technology (ICRTCST), pp. 96–102, [123] D. Feil-Seifer and M. J. Mataric, “Defining socially
IEEE, 2022. assistive robotics,” in 9th International Conference on
[108] M. Ç. Uludağlı and K. Oğuz, “Non-player character Rehabilitation Robotics, 2005. ICORR 2005., pp. 465–468,
decision-making in computer games,” Artificial Intelli- IEEE, 2005.
gence Review, vol. 56, no. 12, pp. 14159–14191, 2023. [124] D. P. Miller, “Assistive robotics: an overview,” Assis-
[109] X. Xu, Y. Lu, B. Vogel-Heuser, and L. Wang, “Indus- tive Technology and Artificial Intelligence: Applications in
try 4.0 and industry 5.0—inception, conception and Robotics, User Interfaces and Natural Language Process-
perception,” Journal of manufacturing systems, vol. 61, ing, pp. 126–136, 2006.
pp. 530–535, 2021. [125] A. M. Okamura, M. J. Matarić, and H. I. Christensen,
[110] J. Leng, W. Sha, B. Wang, P. Zheng, C. Zhuang, Q. Liu, “Medical and health-care robotics,” IEEE Robotics &
T. Wuest, D. Mourtzis, and L. Wang, “Industry 5.0: Automation Magazine, vol. 17, no. 3, pp. 26–37, 2010.
Prospect and retrospect,” Journal of Manufacturing Sys- [126] L. D. Riek, “Healthcare robotics,” Communications of
tems, vol. 65, pp. 279–295, 2022. the ACM, vol. 60, no. 11, pp. 68–78, 2017.
[111] S. Huang, B. Wang, X. Li, P. Zheng, D. Mourtzis, and [127] J. Holland, L. Kingston, C. McCarthy, E. Armstrong,
L. Wang, “Industry 5.0 and society 5.0—comparison, P. O’Dwyer, F. Merz, and M. McConnell, “Service
complementation and co-evolution,” Journal of manu- robots in the healthcare sector,” Robotics, vol. 10, no. 1,
facturing systems, vol. 64, pp. 424–428, 2022. p. 47, 2021.
[112] A. Akundi, D. Euresti, S. Luna, W. Ankobiah, [128] M. Kyrarini, F. Lygerakis, A. Rajavenkatanarayanan,
A. Lopes, and I. Edinbarough, “State of industry C. Sevastopoulos, H. R. Nambiappan, K. K. Chai-
5.0—analysis and identification of current research tanya, A. R. Babu, J. Mathew, and F. Makedon, “A
trends,” Applied System Innovation, vol. 5, no. 1, p. 27, survey of robots in healthcare,” Technologies, vol. 9,
2022. no. 1, p. 8, 2021.
[113] M. A. Hassan, S. Zardari, M. U. Farooq, M. M. [129] V. Sanchez, C. J. Walsh, and R. J. Wood, “Textile tech-
Alansari, and S. A. Nagro, “Systematic analysis of nology for soft robotic and autonomous garments,”
risks in industry 5.0 architecture,” Applied Sciences, Advanced functional materials, vol. 31, no. 6, p. 2008278,
vol. 14, no. 4, p. 1466, 2024. 2021.
[114] A. Dzedzickis, J. Subačiūtė-Žemaitienė, E. Šutinys, [130] F. Zhang and Y. Demiris, “Learning garment manipu-
U. Samukaitė-Bubnienė, and V. Bučinskas, “Advanced lation policies toward robot-assisted dressing,” Science
applications of industrial robotics: New trends and robotics, vol. 7, no. 65, p. eabm6010, 2022.
possibilities,” Applied Sciences, vol. 12, no. 1, p. 135, [131] M. Javaid, A. Haleem, R. Pratap Singh, S. Rab,
2021. R. Suman, and L. Kumar, “Utilization of robotics
[115] M. Bartoš, V. Bulej, M. Bohušı́k, J. Stanček, V. Ivanov, for healthcare: a scoping review,” Journal of Industrial
and P. Macek, “An overview of robot applications in Integration and Management, vol. 10, no. 01, pp. 43–65,
automotive industry,” Transportation Research Procedia, 2025.
vol. 55, pp. 837–844, 2021. [132] M. Kaspar, J. D. M. Osorio, and J. Bock, “Sim2real
[116] J. Arents and M. Greitans, “Smart industrial robot transfer for reinforcement learning without dynamics
control trends, challenges and opportunities within randomization,” in 2020 IEEE/RSJ International Confer-
manufacturing,” Applied Sciences, vol. 12, no. 2, p. 937, ence on Intelligent Robots and Systems (IROS), pp. 4383–
2022. 4388, IEEE, 2020.
[117] M. Soori, R. Dastres, B. Arezoo, and F. K. G. Jough, [133] D. Horváth, G. Erdős, Z. Istenes, T. Horváth, and
“Intelligent robotic systems in industry 4.0: A review,” S. Földi, “Object detection using sim2real domain
Journal of Advanced Manufacturing Science and Technol- randomization for robotic applications,” IEEE Trans-
ogy, pp. 2024007–0, 2024. actions on Robotics, vol. 39, no. 2, pp. 1225–1243, 2022.
[118] C.-C. Lee, S. Qin, and Y. Li, “Does industrial robot [134] J. Huber, F. Hélénon, H. Watrelot, F. B. Amar,
application promote green technology innovation in and S. Doncieux, “Domain randomization for
the manufacturing industry?,” Technological Forecast- sim2real transfer of automatically generated grasp-
ing and Social Change, vol. 183, p. 121893, 2022. ing datasets,” in 2024 IEEE International Conference on
[119] D. T. Rowland, “Global population aging: History and Robotics and Automation (ICRA), pp. 4112–4118, IEEE,
prospects,” in International handbook of population aging, 2024.
pp. 37–65, Springer, 2009. [135] T. Yao, H. Wang, B. Lu, J. Ge, Z. Pei, M. Kowarschik,
[120] D. E. Bloom, D. Canning, and A. Lubet, “Global L. Sun, L. Seneviratne, and P. Qi, “Sim2real learn-
PREPRINT. UNDER REVIEW. 18
ing with domain randomization for autonomous vol. 36, no. 19, p. 2308829, 2024.
guidewire navigation in robotic-assisted endovascular [150] T. Wang, P. Zheng, S. Li, and L. Wang, “Multi-
procedures,” IEEE Transactions on Automation Science modal human–robot interaction for human-centric
and Engineering, 2025. smart manufacturing: a survey,” Advanced Intelligent
[136] T. He, J. Gao, W. Xiao, Y. Zhang, Z. Wang, J. Wang, Systems, vol. 6, no. 3, p. 2300359, 2024.
Z. Luo, G. He, N. Sobanbab, C. Pan, et al., “Asap: [151] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang,
Aligning simulation and real-world physics for learn- “Hugginggpt: Solving ai tasks with chatgpt and its
ing agile humanoid whole-body skills,” in Robotics: friends in hugging face,” Advances in Neural Informa-
Science and Systems (RSS), 2025. tion Processing Systems, vol. 36, pp. 38154–38180, 2023.
[137] Y. Li, Y. Lin, J. Cui, T. Liu, W. Liang, Y. Zhu, and [152] S.-C. Dai, A. Xiong, and L.-W. Ku, “LLM-in-the-loop:
S. Huang, “Clone: Closed-loop whole-body humanoid Leveraging large language model for thematic anal-
teleoperation for long-horizon tasks,” arXiv preprint ysis,” in The 2023 Conference on Empirical Methods in
arXiv:2506.08931, 2025. Natural Language Processing, 2023.
[138] M. Gleicher, “Retargetting motion to new characters,” [153] C.-M. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang,
in Proceedings of the 25th annual conference on Computer J. Fu, and Z. Liu, “Chateval: Towards better LLM-
graphics and interactive techniques, pp. 33–42, 1998. based evaluators through multi-agent debate,” in The
[139] M. Plappert, C. Mandery, and T. Asfour, “The Twelfth International Conference on Learning Representa-
kit motion-language dataset,” arXiv preprint tions, 2024.
arXiv:1607.03827, 2016. [154] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown,
[140] N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and
and M. J. Black, “Amass: Archive of motion capture D. Amodei, “Scaling laws for neural language mod-
as surface shapes,” in Proceedings of the IEEE/CVF els,” arXiv preprint arXiv:2001.08361, 2020.
international conference on computer vision, pp. 5442– [155] A. Clark, D. de Las Casas, A. Guy, A. Mensch, M. Pa-
5451, 2019. ganini, J. Hoffmann, B. Damoc, B. Hechtman, T. Cai,
[141] F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and S. Borgeaud, et al., “Unified scaling laws for routed
C. Pal, “Robust motion in-betweening,” ACM Transac- language models,” in International conference on ma-
tions on Graphics (TOG), vol. 39, no. 4, pp. 60–1, 2020. chine learning, pp. 4057–4086, PMLR, 2022.
[142] A. R. Punnakkal, A. Chandrasekaran, N. Athanasiou, [156] A. Aghajanyan, L. Yu, A. Conneau, W.-N. Hsu,
A. Quiros-Ramirez, and M. J. Black, “Babel: Bodies, K. Hambardzumyan, S. Zhang, S. Roller, N. Goyal,
action and behavior with english labels,” in Proceed- O. Levy, and L. Zettlemoyer, “Scaling laws for gener-
ings of the IEEE/CVF Conference on Computer Vision and ative mixed-modal language models,” in International
Pattern Recognition, pp. 722–731, 2021. Conference on Machine Learning, pp. 265–279, PMLR,
[143] G. Delmas, P. Weinzaepfel, T. Lucas, F. Moreno- 2023.
Noguer, and G. Rogez, “Posescript: 3d human poses [157] B. Isik, N. Ponomareva, H. Hazimeh, D. Paparas,
from natural language,” in European Conference on S. Vassilvitskii, and S. Koyejo, “Scaling laws for down-
Computer Vision, pp. 346–362, Springer, 2022. stream task performance of large language models,”
[144] C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and in ICLR 2024 Workshop on Mathematical and Empirical
L. Cheng, “Generating diverse and natural 3d human Understanding of Foundation Models, 2024.
motions from text,” in Proceedings of the IEEE/CVF [158] H. Que, J. Liu, G. Zhang, C. Zhang, X. Qu, Y. Ma,
Conference on Computer Vision and Pattern Recognition F. Duan, Z. Bai, J. Wang, Y. Zhang, et al., “D-cpt law:
(CVPR), pp. 5152–5161, June 2022. Domain-specific continual pre-training scaling law for
[145] J. Lin, A. Zeng, S. Lu, Y. Cai, R. Zhang, H. Wang, large language models,” Advances in Neural Informa-
and L. Zhang, “Motion-x: A large-scale 3d expressive tion Processing Systems, vol. 37, pp. 90318–90354, 2024.
whole-body human motion dataset,” Advances in Neu- [159] K. Kumar, T. Ashraf, O. Thawakar, R. M. Anwer,
ral Information Processing Systems, vol. 36, pp. 25268– H. Cholakkal, M. Shah, M.-H. Yang, P. H. Torr, F. S.
25280, 2023. Khan, and S. Khan, “Llm post-training: A deep dive
[146] Y. Zhang, J. Lin, A. Zeng, G. Wu, S. Lu, Y. Fu, Y. Cai, into reasoning large language models,” arXiv preprint
R. Zhang, H. Wang, and L. Zhang, “Motion-x++: A arXiv:2502.21321, 2025.
large-scale multimodal 3d whole-body human motion [160] S. Yue, W. Chen, S. Wang, B. Li, C. Shen, S. Liu,
dataset,” arXiv preprint arXiv:2501.05098, 2025. Y. Zhou, Y. Xiao, S. Yun, X. Huang, et al., “Disc-lawllm:
[147] T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, Fine-tuning large language models for intelligent legal
K. M. Kitani, C. Liu, and G. Shi, “Omnih2o: Universal services,” arXiv preprint arXiv:2309.11325, 2023.
and dexterous human-to-humanoid whole-body tele- [161] Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and
operation and learning,” in 8th Annual Conference on J. Jia, “Longlora: Efficient fine-tuning of long-context
Robot Learning, 2024. large language models,” in The Twelfth International
[148] S. Liu, L. Wang, and X. Vincent Wang, “Multimodal Conference on Learning Representations, 2024.
data-driven robot control for human–robot collabo- [162] T. Q. Luong, X. Zhang, Z. Jie, P. Sun, X. Jin, and H. Li,
rative assembly,” Journal of Manufacturing Science and “Reft: Reasoning with reinforced fine-tuning,” arXiv
Engineering, vol. 144, no. 5, p. 051012, 2022. preprint arXiv:2401.08967, vol. 3, 2024.
[149] D. R. Yao, I. Kim, S. Yin, and W. Gao, “Multimodal soft [163] N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu,
robotic actuation and locomotion,” Advanced Materials, Y. Chen, C.-M. Chan, W. Chen, et al., “Parameter-
PREPRINT. UNDER REVIEW. 19