๐ Paper: https://arxiv.org/abs/2509.09372
๐ Project page: https://vla-adapter.github.io/
๐ค HuggingFace: https://huggingface.co/VLA-Adapter
Github: https://github.com/OpenHelix-Team/VLA-Adapter
- [2025/09/22] We released our codes! An enhanced Pro version is also released (this version conforms to the pipeline in the original paper, but is optimized in implementation). Everyone is welcome to use it!๐
- [2025/09/13] Our paper won the ๐ฅfirst place in the daily list, the ๐ฅsecond place in the weekly list, and ๐ฅthird place in the Monthly list in HF! โญ
- [2025/09/13] Our paper listed in the Trending Paper in HF! โญ
- [2025/09/12] We released the original version of the VLA-Adapter for four LIBERO models on HuggingFace.
- [2025/09/11] We released our paper on ArXiv.
- Release checkpoints for reproduction.
- Release VLA-Adapter v2 paper.
- A more powerful version, VLA-Adapter++, and a detailed technical report ๐ will be released soon.
- Continue to update the code to adapt to various real-world systems deployments, including the configuration of our paper, Franka, UR-5, and AGILE Piper.
- It will soon be compatible with various foundation models, including but not limited to VPP, ฯ0.5.
- We will update the diffusion transformers and flow matching policy networks in the future, and the results will be updated in the subsequent VLA-Adapter++ technical report.
- We will also update and give more experiments on Frozen backbone.
- We will expand its generalization further in the future. Work is in progress! So please stay tuned!
- RL post-training is also in progress. Interested researchers are welcome to join us in building this foundation!
- The dual-system compatibility of VLA-Adapter is under exploration!
- ๐ Quick Start
- ๐ Data Preparation
- โ VLM backbone
- ๐ฅ Training for Different Configurations โ => Provides training configurations for GPUs ranging from 10GB to 80GB of VRAM.
- ๐ Related File for Training
- ๐ How to Train on Extremely Limited VRAM GPUs โ => A card with 10GB-12GB (e.g. NVIDIA GeForce RTX 2080Ti, 3060, 3080, 4070, 4080, and 5070)
- ๐ How to Train on Low VRAM GPUs โ => A card with 24GB (e.g. NVIDIA GeForce RTX 3090 and 4090)
- ๐ How to Train on Larger VRAM GPUs โ => A Consumer GPU with 32GB (e.g. NVIDIA GeForce RTX 5090) โ A Professional-Grade GPU with 40GB-48GB (e.g. NVIDIA A100-40GB, A800-40GB, L20, and RTX A6000).
- ๐ How to Train on Sufficient VRAM GPUs โ => Professional-Grade GPUs with โฅ80GB (e.g. NVIDIA A100-80GB, A800-80GB, H100, H800, H20-NVLink, and GB200).
- ๐ฆพ Inference
- ๐ Success Rate Comparison
- ๐ Citation
- โค๏ธ Acknowledgment
# Create and activate conda environment
conda create -n vla-adapter python=3.10.16 -y
conda activate vla-adapter
# Install PyTorch
# Use a command specific to your machine: https://pytorch.org/get-started/locally/
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0
# Clone vla-adapter repo and pip install to download dependencies
git clone https://github.com/OpenHelix-Team/VLA-Adapter.git
cd VLA-Adapter
pip install -e .
pip install packaging ninja
ninja --version; echo $? # Verify Ninja --> should return exit code "0"
# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
pip install "flash-attn==2.5.5" --no-build-isolation
# If you run into difficulty, try `pip cache remove flash_attn` first, or visit the
# website to download it. (https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5)
# You can download the corresponding `.whl` file according to the cuda version of `nvidia-smi`,
# and then run `pip install flash_attn-2.5.5+cuXX...whl` to install it.
# We use the `flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl` file.
- (Optional)
Clone and install the LIBERO repo and required packages:
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO
pip install -r experiments/robot/libero/libero_requirements.txt # From vla-adapter base dir
To download the LIBERO datasets that we used in our fine-tuning experiments, run the command below. This will download the Spatial
, Object
, Goal
, and Long
datasets in RLDS
format, i.e., libero_spatial_no_noops
, libero_object_no_noops
, libero_goal_no_noops
, libero_10_no_noops
. ("_no_noops"
stands for no no-op actions, i.e., training samples with near-zero actions are filtered out). These datasets require ~10GB
of memory in total. If needed, see details on how to download the original non-RLDS datasets here. You can use these to fine-tune Prismatic-VLMs (built on Qwen2.5-0.5B) or other VLMs.
git clone git@hf.co:datasets/openvla/modified_libero_rlds
๐ Attention! The dataset downloaded in this way needs to remove of the modified_
word to adapt to the path of - ๐ Benchmark Location!!!
When using LIBERO, you may get an error message like AttributeError: 'NoneType' object has no attribute 'eglQueryString'
. You can use:
sudo apt-get update
sudo apt-get install libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev
- (Optional)
git clone --recurse-submodules https://github.com/mees/calvin.git
export CALVIN_ROOT=$(pwd)/calvin
cd $CALVIN_ROOT
# Installation of `pyhash` may fail on some machines. If it fails, you can solve it by lowering the `setuptools` version: `pip install setuptools==57.5.0`
sh install.sh
To download the CALVIN ABCโD datasets that we used in our fine-tuning experiments, run the command below.
cd $CALVIN_ROOT/dataset
sh download_data.sh ABC
If you want to download the RLDS format, you can visit here to download it. This dataset requires ~50GB
of memory.
When using CALVIN, you may get an error message like AttributeError: 'NoneType' object has no attribute 'eglQueryString'
. You can use:
sudo apt-get update
sudo apt-get install libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev
- (including LIBERO and CALVIN)
At this point, the environment is fully installed. If you want to confirm whether the environment is correct, you can see the our_envs.txt
file we released.
The downloaded dataset can be placed in the /data
folder. The overall directory structure is as follows:
ยท
โโโ data
ยท โโโ libero
โ โโโ libero_10_no_noops
โ โ โโโ 1.0.0 (It contains some json files and 32 tfrecord files)
โ โโโ libero_goal_no_noops
โ โ โโโ 1.0.0 (It contains some json files and 16 tfrecord files)
โ โโโ libero_object_no_noops
โ โ โโโ 1.0.0 (It contains some json files and 32 tfrecord files)
โ โโโ libero_spatial_no_noops
โ โ โโโ 1.0.0 (It contains some json files and 16 tfrecord files)
โ
โโโ calvin_abc
โ โโโ 1.0.0 (It contains some json files, 512 train tfrecord files, and 32 valid tfrecord files)
โ
โโโ other benchmarks ...
We use the Prismatic-VLMs
architecture. Since the file is large, please download it from here. Then put it in the /pretrained_models
folder. The file structure is:
ยท
โโโ pretrained_models
ยท โโโ configs
โโโ prism-qwen25-extra-dinosiglip-224px-0_5b
We provide different training configurations for different users. You can choose the configuration suitable for training based on your GPU card type.
vla-scripts/finetune.py
: VLA fine-tuning script
=> Extremely Limited VRAM (A card with 10GB-12GB) (e.g. NVIDIA GeForce RTX 2080Ti, 3060, 3080, 4070, 4080, and 5070).
About
batch_size
,lora_rank
,grad_accumulation_steps
, andmax_steps
.
If your resources are extremely limited, you can set --batch_size 1
and --lora_rank 64
, it only requires 9.6GB
of VRAM. Certainly, batch size = 1
will cause gradient updates to be greatly affected by extreme values, and loss convergence will be unstable. In this case, you can modify the grad_accumulation_steps
parameter to simulate a similar effect. For example, --batch_size 1
with --grad_accumulation_steps 8
has a similar effect to --batch_size 8
, but the training speed will be slower. This means that you can't use the OpenVLA-OFT model on a card with 10GB
because even with batch size = 1
, it requires 25GB
of VRAM. Fortunately, you can use VLA-Adapter. However, the batch size
is still small, you can increase --max_steps
to achieve the performance reported in the paper.
About
vlm_path
.
The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being Qwen2.5-0.5B
. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in /pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b
.
About
data_name
.
Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the /logs
folder. You can replace libero_spatial_no_noops
with libero_object_no_noops
, libero_goal_no_noops
, or libero_10_no_noops
. If you are using the CALVIN
benchmark, you need to delete \libero
in --data_root_dir
and replace libero_spatial_no_noops
with calvin_abc
.
About
use_pro_version
.
In addition, we recently released an enhanced version Pro
of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. Therefore, we strongly recommend using the Pro version! The Pro
version's Policy
size is 207MB
, and training speed is virtually unchanged. The original version
is nearly 1GB
smaller than the pro version
, requiring only 8.6GB
of VRAM. You can choose whether to use the Pro
version by setting the use_pro_version
parameter, i.e., the Pro
version is --use_pro_version True
.
data_name=libero_spatial_no_noops
CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 400000 \
--max_steps 400005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 1 \
--grad_accumulation_steps 8 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_entity "YOUR_WANDB_ENTITY" \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \
> logs/VLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 &
Please note that the obtained models will be stored in the /outputs
folder. Each model will take up nearly 3GB
of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from VLA-Adapter HuggingFace and place it in this folder for inference.
=> Low VRAM (A card with 24GB) (e.g. NVIDIA GeForce RTX 3090 and 4090).
About
batch_size
,lora_rank
,grad_accumulation_steps
, andmax_steps
.
If you have such a device, you can increase the batch size
and lora rank
: --batch_size 4
and --lora_rank 64
. This only takes nearly 20GB
. This is consistent with the rank in our paper. This means that you can't use the OpenVLA-OFT model on a card with 24GB
because even with batch size = 1
, it requires 25GB
of VRAM. Fortunately, you can use VLA-Adapter. However, the batch size
is still small, you can increase --max_steps
to achieve the performance reported in the paper.
About
vlm_path
.
The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being Qwen2.5-0.5B
. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in /pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b
.
About
data_name
.
Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the /logs
folder. You can replace libero_spatial_no_noops
with libero_object_no_noops
, libero_goal_no_noops
, or libero_10_no_noops
. If you are using the CALVIN
benchmark, you need to delete \libero
in --data_root_dir
and replace libero_spatial_no_noops
with calvin_abc
.
About
use_pro_version
.
In addition, we recently released an enhanced version Pro
of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. Therefore, we strongly recommend using the Pro version! The Pro
version's Policy
size is 207MB
, and training speed is virtually unchanged. The original version
is nearly 1GB
smaller than the pro version
(1 batch), requiring only 17.6GB
of VRAM. You can choose whether to use the Pro
version by setting the use_pro_version
parameter, i.e., the Pro
version is --use_pro_version True
.
data_name=libero_spatial_no_noops
CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 200000 \
--max_steps 200005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 4 \
--grad_accumulation_steps 4 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_entity "YOUR_WANDB_ENTITY" \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \
> logs/VLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 &
Please note that the obtained models will be stored in the /outputs
folder. Each model will take up nearly 3GB
of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from VLA-Adapter HuggingFace and place it in this folder for inference.
=> A Consumer GPU with 32GB (e.g. NVIDIA GeForce RTX 5090)
=> A Professional-Grade GPU with 40GB-48GB (e.g. NVIDIA A100-40GB, A800-40GB, L20, and RTX A6000).
About
batch_size
,lora_rank
,grad_accumulation_steps
, andmax_steps
.
If you have such a device, you can increase the batch size
and lora rank
: --batch_size 8
and --lora_rank 64
. This only takes nearly 29GB
.
About
vlm_path
.
The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being Qwen2.5-0.5B
. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in /pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b
.
About
data_name
.
Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the /logs
folder. You can replace libero_spatial_no_noops
with libero_object_no_noops
, libero_goal_no_noops
, or libero_10_no_noops
. If you are using the CALVIN
benchmark, you need to delete \libero
in --data_root_dir
and replace libero_spatial_no_noops
with calvin_abc
.
With this configuration, you can achieve the same results as in our paper on the LIBERO-Object
benchmark, achieving a 99.2%
success rate, in just 8 hours
. The LIBERO-Spatial
benchmark requires approximately 10 hours of training. However, the LIBERO-Long
benchmark takes longer because its tasks are longer and more difficult, requiring more training steps to achieve superior performance.
About
use_pro_version
.
In addition, we recently released an enhanced version Pro
of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. Therefore, we strongly recommend using the Pro version! The Pro
version's Policy
size is 207MB
, and training speed is virtually unchanged. The original version
is nearly 1GB
smaller than the pro version
(1 batch). You can choose whether to use the Pro
version by setting the use_pro_version
parameter, i.e., the Pro
version is --use_pro_version True
.
data_name=libero_spatial_no_noops
CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 200000 \
--max_steps 200005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 8 \
--grad_accumulation_steps 2 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_entity "YOUR_WANDB_ENTITY" \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \
> logs/VLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 &
Please note that the obtained models will be stored in the /outputs
folder. Each model will take up nearly 3GB
of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from VLA-Adapter HuggingFace and place it in this folder for inference.
=> Professional-Grade GPUs with โฅ80GB (e.g. NVIDIA A100-80GB, A800-80GB, H100, H800, H20-NVLink, and GB200).
About
batch_size
,lora_rank
,grad_accumulation_steps
, andmax_steps
.
You can use 1 to 8 GPUs for training by changing the number of CUDA_VISIBLE_DEVICES
to the GPU number and the number of GPUs after --nproc-per-node
. In our paper, we use 4รH100 GPU for training. In this configuration, the four suites of the LIBERO benchmark, Spatial
(only five hours), Object
(less than one hour), Goal
(three hours), and Long
(half a day); the CALVIN
benchmark (eight hours)
About
vlm_path
.
The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being Qwen2.5-0.5B
. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in /pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b
.
About
data_name
.
Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the /logs
folder. You can replace libero_spatial_no_noops
with libero_object_no_noops
, libero_goal_no_noops
, or libero_10_no_noops
. If you are using the CALVIN
benchmark, you need to delete \libero
in --data_root_dir
and replace libero_spatial_no_noops
with calvin_abc
.
About
use_pro_version
.
In addition, we recently released an enhanced version Pro
of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. Therefore, we strongly recommend using the Pro version! The Pro
version's Policy
size is 207MB
, and training speed is virtually unchanged. The original version
is nearly 1GB
smaller than the pro version
(1 batch). You can choose whether to use the Pro
version by setting the use_pro_version
parameter, i.e., the Pro
version is --use_pro_version True
.
data_name=libero_spatial_no_noops
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes 1 --nproc-per-node 4 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 150000 \
--max_steps 150005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 16 \
--grad_accumulation_steps 1 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_entity "YOUR_WANDB_ENTITY" \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--spatial--$current_time \
> logs/VLA-Adapter--spatial--$current_time.log 2>&1 &
Please note that the obtained models will be stored in the /outputs
folder. Each model will take up nearly 3GB
of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from VLA-Adapter HuggingFace and place it in this folder for inference.
experiments/robot/libero/
: LIBERO eval filesrun_libero_eval.py
: LIBERO eval scriptlibero_utils.py
: LIBERO eval utils
experiments/robot/
: General eval utils filesopenvla_utils.py
: VLA-specific eval utilsrobot_utils.py
: Other eval utils
We fine-tuned Qwen2.5-0.5B
with our adapter bridge paradigm on four LIBERO task suites independently: LIBERO-Spatial
, LIBERO-Object
, LIBERO-Goal
, and LIBERO-Long
.
The four VLA-Adapter checkpoints for LIBERO are available on Hugging Face:
- VLA-Adapter/LIBERO-Spatial
- VLA-Adapter/LIBERO-Object
- VLA-Adapter/LIBERO-Goal
- VLA-Adapter/LIBERO-Long
In addition, we also provide a Pro
version, we used 4*H100
GPUs for training, --batch_size 16
, --lora rank 64
, and the --max_steps 100000
. The Pro checkpoints is:
- VLA-Adapter/LIBERO-Spatial-Pro
(97.8 -> 99.6)
- VLA-Adapter/LIBERO-Object-Pro
(99.2 -> 99.6)
- VLA-Adapter/LIBERO-Goal-Pro
(97.2 -> 98.2)
- VLA-Adapter/LIBERO-Long-Pro
(95.0 -> 96.4)
- VLA-Adapter/CALVIN-ABC-Pro
(4.42 -> 4.50)
These files need to be placed in the /output
folder. If you trained your own models, it will also be stored here. The subsequent eval code will call the model in this folder for inference.
We strongly recommend that you use our open source Pro
version of the model, which has stronger performance. To start evaluations with one of these checkpoints, run one of the commands below. Each will automatically download the appropriate checkpoint listed above. If you want to use the original version of the model, you only need to adjust the -- use_pro_version
parameter to False
and pass the original version of the model to the --pretrained_checkpoint
parameter. Finally, the inference results will be displayed in the /eval_logs
folder, and the inference video will be displayed in the /rollouts/vla-adapter
folder.
# Launch LIBERO-Spatial-Pro evals (Background running)
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
--use_proprio True \
--num_images_in_input 2 \
--use_film False \
--pretrained_checkpoint outputs/LIBERO-Spatial-Pro \
--task_suite_name libero_spatial \
--use_pro_version True \
> eval_logs/Spatial--chkpt.log 2>&1 &
# Launch LIBERO-Object-Pro evals (Background running)
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
--use_proprio True \
--num_images_in_input 2 \
--use_film False \
--pretrained_checkpoint outputs/LIBERO-Object-Pro \
--task_suite_name libero_object \
--use_pro_version True \
> eval_logs/Object--chkpt.log 2>&1 &
# Launch LIBERO-Goal-Pro evals (Background running)
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
--use_proprio True \
--num_images_in_input 2 \
--use_film False \
--pretrained_checkpoint outputs/LIBERO-Goal-Pro \
--task_suite_name libero_goal \
--use_pro_version True \
> eval_logs/Goal--chkpt.log 2>&1 &
# Launch LIBERO-Long-Pro (LIBERO-10) evals (Background running)
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
--use_proprio True \
--num_images_in_input 2 \
--use_film False \
--pretrained_checkpoint outputs/LIBERO-long-Pro \
--task_suite_name libero_10 \
--use_pro_version True \
> eval_logs/Long--chkpt.log 2>&1 &
# Launch CALVIN ABCโD-Pro evals (Background running)
CUDA_VISIBLE_DEVICES=0 python vla-scripts/evaluate_calvin.py \
--pretrained_checkpoint outputs/CALVIN-ABC-Pro \
> eval_logs/CALVIN--ABC.log 2>&1 &
If you want to get the inference throughput, you can run it in the run_libero_eval.py
file. You can add start = time.time()
and end = time.time()
before and after lines 334--345
and calculate the difference between the two. This difference is the time it takes to generate 8 chunks
. This gives you the inference throughput. We measured it multiple times and took the average value of 0.036s
.
All our results are inferred on H100
. You can find the inference log
file in the model released on HF for viewing. The evaluation script will run 500 trials by default (10 tasks x 50 episodes each) in LIBERO and 1,000 task sequences in CALVIN. Use the same card for training and inference whenever possible. Note that results may vary slightly if you use a different GPU than the H100. This phenomenon is also mentioned in the OpenVLA-OFT readme file.
XX represents the best performance, XX represents the second best performance, and XX* represents the third best performance.
LIBERO | Methods | Scale | Spatial | Object | Goal | Long | Avg. |
Large-scale | FlowVLA (Zhong et al., 2025) | 8.5B | 93.2 | 95.0 | 91.6 | 72.6 | 88.1 |
UnifiedVLA (Wang et al., 2025) | 8.5B | 95.4 | 98.8* | 93.6 | 94.0 | 95.5 | |
OpenVLA (Kim et al., 2024) | 7B | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 | |
OpenVLA-OFT (Kim et al., 2025) | 7B | 97.6* | 98.4 | 97.9 | 94.5* | 97.1* | |
UniVLA (Bu et al., 2025) | 7B | 96.5 | 96.8 | 95.6 | 92.0 | 95.2 | |
CoT-VLA (Zhao et al., 2025) | 7B | 87.5 | 91.6 | 87.6 | 69.0 | 81.1 | |
WorldVLA (Cen et al., 2025) | 7B | 87.6 | 96.2 | 83.4 | 60.0 | 81.8 | |
TraceVLA (Zheng et al., 2025) | 7B | 84.6 | 85.2 | 75.1 | 54.1 | 74.8 | |
MolmoAct (Lee et al., 2025) | 7B | 87.0 | 95.4 | 87.6 | 77.2 | 86.6 | |
ThinkAct (Huang et al., 2025) | 7B | 88.3 | 91.4 | 87.1 | 70.9 | 84.4 | |
Small-scale | 4D-VLA (Zhang et al., 2025) | 4B | 88.9 | 95.2 | 90.9 | 79.1 | 88.6 |
SpatialVLA (Qu et al., 2025) | 4B | 88.2 | 89.9 | 78.6 | 55.5 | 78.1 | |
ฯ0 (Black et al., 2024) | 3B | 96.8 | 98.8* | 95.8 | 85.2 | 94.2 | |
ฯ0-FAST (Pertsch et al., 2025) | 3B | 96.4 | 96.8 | 88.6 | 60.2 | 85.5 | |
NORA (Hung et al., 2025) | 3B | 92.2 | 95.4 | 89.4 | 74.6 | 87.9 | |
SmolVLA (Shukor et al., 2025) | 2.2B | 93.0 | 94.0 | 91.0 | 77.0 | 88.8 | |
GR00T N1 (NVIDIA et al., 2025) | 2B | 94.4 | 97.6 | 93.0 | 90.6 | 93.9 | |
Tiny-scale | Seer (Tian et al., 2025) | 0.57B | - | - | - | 78.7 | 78.7 |
VLA-OS (Gao et al., 2025) | 0.5B | 87.0 | 96.5 | 92.7 | 66.0 | 85.6 | |
Diffusion Policy (Chi et al., 2023) | - | 78.3 | 92.5 | 68.3 | 50.5 | 72.4 | |
VLA-Adapter (Ours) | 0.5B | 97.8 | 99.2 | 97.2* | 95.0 | 97.3 | |
VLA-Adapter-Pro (Ours) | 0.5B | 99.6 | 99.6 | 98.2 | 96.4 | 98.5 |
XX represents the best performance, XX represents the second best performance, and XX* represents the third best performance.
CALVIN | Methods | Scale | 1 | 2 | 3 | 4 | 5 | Avg. len |
Large-scale | UniVLA (Bu et al., 2025) | 7B | 95.5 | 85.8 | 75.4 | 66.9 | 56.5 | 3.80 |
OpenVLA (Kim et al., 2024) | 7B | 91.3 | 77.8 | 62.0 | 52.1 | 43.5 | 3.27 | |
OpenVLA-OFT (Kim et al., 2025) | 7B | 96.3 | 89.1 | 82.4 | 75.8 | 66.5 | 4.10 | |
VLAS (Zhao et al., 2025b) | 7B | 87.2 | 64.2 | 40.9 | 28.1 | 19.6 | 2.40 | |
LCB (Shentu et al., 2024) | 7B | 73.6 | 50.2 | 28.5 | 16.0 | 9.9 | 1.78 | |
RoboDual (Bu et al., 2024a) | 7B | 94.4 | 82.7 | 72.1 | 62.4 | 54.4 | 3.66 | |
OpenHelix (Cui et al., 2025) | 7B | 97.1* | 91.4 | 82.8 | 72.6 | 64.1 | 4.08 | |
ReconVLA (Song et al., 2025c) | 7B | 95.6 | 87.6 | 76.9 | 69.3 | 64.1 | 3.95 | |
Small-scale | DeeR (Yue et al., 2024) | 3B | 86.2 | 70.1 | 51.8 | 41.5 | 30.4 | 2.82 |
RoboFlamingo (Li et al., 2024b) | 3B | 82.4 | 61.9 | 46.6 | 33.1 | 23.5 | 2.48 | |
VPP (Hu et al., 2025) | 1.5B | 95.7 | 91.2 | 86.3* | 81.0* | 75.0* | 4.33* | |
SuSIE (Black et al., 2024) | 1.3B | 87.0 | 69.0 | 49.0 | 38.0 | 26.0 | 2.69 | |
Tiny-scale | Seer-Large (Tian et al., 2025) | 0.57B | 96.3 | 91.6* | 86.1 | 80.3 | 74.0 | 4.28 |
MoDE (Reuss et al., 2025) | 0.44B | 96.2 | 88.9 | 81.1 | 71.8 | 63.5 | 4.01 | |
Seer (Tian et al., 2025) | 0.32B | 94.4 | 87.2 | 79.9 | 72.2 | 64.3 | 3.98 | |
VLA-Adapter (Ours) | 0.5B | 99.1 | 94.6 | 88.8 | 82.8 | 76.5 | 4.42 | |
VLA-Adapter-Pro (Ours) | 0.5B | 98.5 | 95.0 | 90.5 | 85.3 | 80.0 | 4.50 |
๐ซถ If you feel that this paper, models, or codes are helpful, please cite our paper, thanks for your support of VLA-Adapter!
@article{wang2025vlaadapter,
author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
journal={arXiv preprint arXiv:2509.09372},
year={2025}
}
We thank OpenVLA-OFT, MiniVLA, and RoboDual for their open-sourced work!