KEMBAR78

The official implementation of VLA-Adapter.

📝 Paper: https://arxiv.org/abs/2509.09372
🌍 Project page: https://vla-adapter.github.io/
🤗 HuggingFace: https://huggingface.co/VLA-Adapter
Github: https://github.com/OpenHelix-Team/VLA-Adapter

📢 News!

[2025/09/22] We released our codes! An enhanced Pro version is also released (this version conforms to the pipeline in the original paper, but is optimized in implementation). Everyone is welcome to use it!🎉
[2025/09/13] Our paper won the 🥇first place in the daily list, the 🥈second place in the weekly list, and 🥉third place in the Monthly list in HF! ⭐
[2025/09/13] Our paper listed in the Trending Paper in HF! ⭐
[2025/09/12] We released the original version of the VLA-Adapter for four LIBERO models on HuggingFace.
[2025/09/11] We released our paper on ArXiv.

✒️ TODO List

🌟 Table of Contents

🚀 Quick Start
- Conda Environment of VLA-Adapter
- Install Dependencies
📝 Data Preparation
⚓ VLM backbone
🔥 Training for Different Configurations => Provides training configurations for GPUs ranging from 10GB to 80GB of VRAM.
- 📚 Related File for Training
- 📒 How to Train on Extremely Limited VRAM GPUs => A card with 10GB-12GB (e.g. NVIDIA GeForce RTX 2080Ti, 3060, 3080, 4070, 4080, and 5070)
- 📒 How to Train on Low VRAM GPUs => A card with 24GB (e.g. NVIDIA GeForce RTX 3090 and 4090)
- 📒 How to Train on Larger VRAM GPUs => A Consumer GPU with 32GB (e.g. NVIDIA GeForce RTX 5090) A Professional-Grade GPU with 40GB-48GB (e.g. NVIDIA A100-40GB, A800-40GB, L20, and RTX A6000).
- 📒 How to Train on Sufficient VRAM GPUs => Professional-Grade GPUs with ≥80GB (e.g. NVIDIA A100-80GB, A800-80GB, H100, H800, H20-NVLink, and GB200).
🦾 Inference
🌈 Success Rate Comparison
📝 Citation
❤️ Acknowledgment

🚀 Quick Start

Conda Environment of VLA-Adapter

# Create and activate conda environment
conda create -n vla-adapter python=3.10.16 -y
conda activate vla-adapter

Install Dependencies

# Install PyTorch
# Use a command specific to your machine: https://pytorch.org/get-started/locally/
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0

# Clone vla-adapter repo and pip install to download dependencies
git clone https://github.com/OpenHelix-Team/VLA-Adapter.git
cd VLA-Adapter
pip install -e .

pip install packaging ninja
ninja --version; echo $?  # Verify Ninja --> should return exit code "0"

# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
pip install "flash-attn==2.5.5" --no-build-isolation
# If you run into difficulty, try `pip cache remove flash_attn` first, or visit the
# website to download it. (https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5)
# You can download the corresponding `.whl` file according to the cuda version of `nvidia-smi`,
# and then run `pip install flash_attn-2.5.5+cuXX...whl` to install it. 
# We use the `flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl` file.

📝 Data Preparation

LIBERO Benchmark

(Optional)

Clone and install the LIBERO repo and required packages:

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO
pip install -r experiments/robot/libero/libero_requirements.txt  # From vla-adapter base dir

To download the LIBERO datasets that we used in our fine-tuning experiments, run the command below. This will download the Spatial, Object, Goal, and Long datasets in RLDS format, i.e., libero_spatial_no_noops, libero_object_no_noops, libero_goal_no_noops, libero_10_no_noops. ("_no_noops" stands for no no-op actions, i.e., training samples with near-zero actions are filtered out). These datasets require ~10GB of memory in total. If needed, see details on how to download the original non-RLDS datasets here. You can use these to fine-tune Prismatic-VLMs (built on Qwen2.5-0.5B) or other VLMs.

git clone git@hf.co:datasets/openvla/modified_libero_rlds

🌟 Attention! The dataset downloaded in this way needs to remove of the modified_ word to adapt to the path of - 📌 Benchmark Location!!!

When using LIBERO, you may get an error message like AttributeError: 'NoneType' object has no attribute 'eglQueryString'. You can use:

sudo apt-get update
sudo apt-get install libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev

CALVIN Benchmark

(Optional)

git clone --recurse-submodules https://github.com/mees/calvin.git
export CALVIN_ROOT=$(pwd)/calvin
cd $CALVIN_ROOT

# Installation of `pyhash` may fail on some machines. If it fails, you can solve it by lowering the `setuptools` version: `pip install setuptools==57.5.0`
sh install.sh

To download the CALVIN ABC→D datasets that we used in our fine-tuning experiments, run the command below.

cd $CALVIN_ROOT/dataset
sh download_data.sh ABC

If you want to download the RLDS format, you can visit here to download it. This dataset requires ~50GB of memory.

When using CALVIN, you may get an error message like AttributeError: 'NoneType' object has no attribute 'eglQueryString'. You can use:

sudo apt-get update
sudo apt-get install libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev

🎮 Our Dependencies

(including LIBERO and CALVIN)

At this point, the environment is fully installed. If you want to confirm whether the environment is correct, you can see the our_envs.txt file we released.

📌 Benchmark Location

The downloaded dataset can be placed in the /data folder. The overall directory structure is as follows:

·
├── data
·   ├── libero
    │   ├── libero_10_no_noops
    │   │   └── 1.0.0  (It contains some json files and 32 tfrecord files)
    │   ├── libero_goal_no_noops
    │   │   └── 1.0.0  (It contains some json files and 16 tfrecord files)
    │   ├── libero_object_no_noops
    │   │   └── 1.0.0  (It contains some json files and 32 tfrecord files)
    │   ├── libero_spatial_no_noops
    │   │   └── 1.0.0  (It contains some json files and 16 tfrecord files)
    │
    ├── calvin_abc
    │   └── 1.0.0  (It contains some json files, 512 train tfrecord files, and 32 valid tfrecord files)
    │
    └── other benchmarks ...

⚓ VLM backbone

We use the Prismatic-VLMs architecture. Since the file is large, please download it from here. Then put it in the /pretrained_models folder. The file structure is:

·
├── pretrained_models
·   ├── configs
    └── prism-qwen25-extra-dinosiglip-224px-0_5b

🔥 Training for Different Configurations

We provide different training configurations for different users. You can choose the configuration suitable for training based on your GPU card type.

📚 Related File for Training

vla-scripts/finetune.py: VLA fine-tuning script

📒 How to Train on Extremely Limited VRAM GPUs

=> Extremely Limited VRAM (A card with 10GB-12GB) (e.g. NVIDIA GeForce RTX 2080Ti, 3060, 3080, 4070, 4080, and 5070).

About batch_size, lora_rank, grad_accumulation_steps, and max_steps.

If your resources are extremely limited, you can set --batch_size 1 and --lora_rank 64, it only requires 9.6GB of VRAM. Certainly, batch size = 1 will cause gradient updates to be greatly affected by extreme values, and loss convergence will be unstable. In this case, you can modify the grad_accumulation_steps parameter to simulate a similar effect. For example, --batch_size 1 with --grad_accumulation_steps 8 has a similar effect to --batch_size 8, but the training speed will be slower. This means that you can't use the OpenVLA-OFT model on a card with 10GB because even with batch size = 1, it requires 25GB of VRAM. Fortunately, you can use VLA-Adapter. However, the batch size is still small, you can increase --max_steps to achieve the performance reported in the paper.

About vlm_path.

The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being Qwen2.5-0.5B. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in /pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b.

About data_name.

Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the /logs folder. You can replace libero_spatial_no_noops with libero_object_no_noops, libero_goal_no_noops, or libero_10_no_noops. If you are using the CALVIN benchmark, you need to delete \libero in --data_root_dir and replace libero_spatial_no_noops with calvin_abc.

About use_pro_version.

In addition, we recently released an enhanced version Pro of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. Therefore, we strongly recommend using the Pro version! The Pro version's Policy size is 207MB, and training speed is virtually unchanged. The original version is nearly 1GB smaller than the pro version, requiring only 8.6GB of VRAM. You can choose whether to use the Pro version by setting the use_pro_version parameter, i.e., the Pro version is --use_pro_version True.

data_name=libero_spatial_no_noops

CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 400000 \
--max_steps 400005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 1 \
--grad_accumulation_steps 8 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_entity "YOUR_WANDB_ENTITY" \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \
> logs/VLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 &

Please note that the obtained models will be stored in the /outputs folder. Each model will take up nearly 3GB of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from VLA-Adapter HuggingFace and place it in this folder for inference.

📒 How to Train on Low VRAM GPUs

=> Low VRAM (A card with 24GB) (e.g. NVIDIA GeForce RTX 3090 and 4090).

About batch_size, lora_rank, grad_accumulation_steps, and max_steps.

If you have such a device, you can increase the batch size and lora rank: --batch_size 4 and --lora_rank 64. This only takes nearly 20GB. This is consistent with the rank in our paper. This means that you can't use the OpenVLA-OFT model on a card with 24GB because even with batch size = 1, it requires 25GB of VRAM. Fortunately, you can use VLA-Adapter. However, the batch size is still small, you can increase --max_steps to achieve the performance reported in the paper.

About vlm_path.

The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being Qwen2.5-0.5B. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in /pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b.

About data_name.

Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the /logs folder. You can replace libero_spatial_no_noops with libero_object_no_noops, libero_goal_no_noops, or libero_10_no_noops. If you are using the CALVIN benchmark, you need to delete \libero in --data_root_dir and replace libero_spatial_no_noops with calvin_abc.

About use_pro_version.

In addition, we recently released an enhanced version Pro of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. Therefore, we strongly recommend using the Pro version! The Pro version's Policy size is 207MB, and training speed is virtually unchanged. The original version is nearly 1GB smaller than the pro version (1 batch), requiring only 17.6GB of VRAM. You can choose whether to use the Pro version by setting the use_pro_version parameter, i.e., the Pro version is --use_pro_version True.

data_name=libero_spatial_no_noops

CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 200000 \
--max_steps 200005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 4 \
--grad_accumulation_steps 4 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_entity "YOUR_WANDB_ENTITY" \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \
> logs/VLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 &

Please note that the obtained models will be stored in the /outputs folder. Each model will take up nearly 3GB of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from VLA-Adapter HuggingFace and place it in this folder for inference.

📒 How to Train on Larger VRAM GPUs

=> A Consumer GPU with 32GB (e.g. NVIDIA GeForce RTX 5090)
=> A Professional-Grade GPU with 40GB-48GB (e.g. NVIDIA A100-40GB, A800-40GB, L20, and RTX A6000).

About batch_size, lora_rank, grad_accumulation_steps, and max_steps.

If you have such a device, you can increase the batch size and lora rank: --batch_size 8 and --lora_rank 64. This only takes nearly 29GB.

About vlm_path.

The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being Qwen2.5-0.5B. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in /pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b.

About data_name.

Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the /logs folder. You can replace libero_spatial_no_noops with libero_object_no_noops, libero_goal_no_noops, or libero_10_no_noops. If you are using the CALVIN benchmark, you need to delete \libero in --data_root_dir and replace libero_spatial_no_noops with calvin_abc.

With this configuration, you can achieve the same results as in our paper on the LIBERO-Object benchmark, achieving a 99.2% success rate, in just 8 hours. The LIBERO-Spatial benchmark requires approximately 10 hours of training. However, the LIBERO-Long benchmark takes longer because its tasks are longer and more difficult, requiring more training steps to achieve superior performance.

About use_pro_version.

In addition, we recently released an enhanced version Pro of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. Therefore, we strongly recommend using the Pro version! The Pro version's Policy size is 207MB, and training speed is virtually unchanged. The original version is nearly 1GB smaller than the pro version (1 batch). You can choose whether to use the Pro version by setting the use_pro_version parameter, i.e., the Pro version is --use_pro_version True.

data_name=libero_spatial_no_noops

CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 200000 \
--max_steps 200005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 8 \
--grad_accumulation_steps 2 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_entity "YOUR_WANDB_ENTITY" \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \
> logs/VLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 &

Please note that the obtained models will be stored in the /outputs folder. Each model will take up nearly 3GB of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from VLA-Adapter HuggingFace and place it in this folder for inference.

📒 How to Train on Sufficient VRAM GPUs

=> Professional-Grade GPUs with ≥80GB (e.g. NVIDIA A100-80GB, A800-80GB, H100, H800, H20-NVLink, and GB200).

About batch_size, lora_rank, grad_accumulation_steps, and max_steps.

You can use 1 to 8 GPUs for training by changing the number of CUDA_VISIBLE_DEVICES to the GPU number and the number of GPUs after --nproc-per-node. In our paper, we use 4×H100 GPU for training. In this configuration, the four suites of the LIBERO benchmark, Spatial (only five hours), Object (less than one hour), Goal (three hours), and Long (half a day); the CALVIN benchmark (eight hours)

About vlm_path.

The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being Qwen2.5-0.5B. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in /pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b.

About data_name.

Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the /logs folder. You can replace libero_spatial_no_noops with libero_object_no_noops, libero_goal_no_noops, or libero_10_no_noops. If you are using the CALVIN benchmark, you need to delete \libero in --data_root_dir and replace libero_spatial_no_noops with calvin_abc.

About use_pro_version.

In addition, we recently released an enhanced version Pro of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. Therefore, we strongly recommend using the Pro version! The Pro version's Policy size is 207MB, and training speed is virtually unchanged. The original version is nearly 1GB smaller than the pro version (1 batch). You can choose whether to use the Pro version by setting the use_pro_version parameter, i.e., the Pro version is --use_pro_version True.

data_name=libero_spatial_no_noops

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes 1 --nproc-per-node 4 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 150000 \
--max_steps 150005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 16 \
--grad_accumulation_steps 1 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_entity "YOUR_WANDB_ENTITY" \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--spatial--$current_time \
> logs/VLA-Adapter--spatial--$current_time.log 2>&1 &

Please note that the obtained models will be stored in the /outputs folder. Each model will take up nearly 3GB of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from VLA-Adapter HuggingFace and place it in this folder for inference.

🦾 Inference

📚 Related File for Inference

experiments/robot/libero/: LIBERO eval files
- run_libero_eval.py: LIBERO eval script
- libero_utils.py: LIBERO eval utils
experiments/robot/: General eval utils files
- openvla_utils.py: VLA-specific eval utils
- robot_utils.py: Other eval utils

🤗 Checkpoint of VLA-Adapter

We fine-tuned Qwen2.5-0.5B with our adapter bridge paradigm on four LIBERO task suites independently: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-Long. The four VLA-Adapter checkpoints for LIBERO are available on Hugging Face:

In addition, we also provide a Pro version, we used 4*H100 GPUs for training, --batch_size 16, --lora rank 64, and the --max_steps 100000. The Pro checkpoints is:

VLA-Adapter/LIBERO-Spatial-Pro (97.8 -> 99.6)
VLA-Adapter/LIBERO-Object-Pro (99.2 -> 99.6)
VLA-Adapter/LIBERO-Goal-Pro (97.2 -> 98.2)
VLA-Adapter/LIBERO-Long-Pro (95.0 -> 96.4)
VLA-Adapter/CALVIN-ABC-Pro (4.42 -> 4.50)

These files need to be placed in the /output folder. If you trained your own models, it will also be stored here. The subsequent eval code will call the model in this folder for inference.

📓 How to Eval

We strongly recommend that you use our open source Pro version of the model, which has stronger performance. To start evaluations with one of these checkpoints, run one of the commands below. Each will automatically download the appropriate checkpoint listed above. If you want to use the original version of the model, you only need to adjust the -- use_pro_version parameter to False and pass the original version of the model to the --pretrained_checkpoint parameter. Finally, the inference results will be displayed in the /eval_logs folder, and the inference video will be displayed in the /rollouts/vla-adapter folder.

# Launch LIBERO-Spatial-Pro evals (Background running)
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
  --use_proprio True \
  --num_images_in_input 2 \
  --use_film False \
  --pretrained_checkpoint outputs/LIBERO-Spatial-Pro \
  --task_suite_name libero_spatial \
  --use_pro_version True \
  > eval_logs/Spatial--chkpt.log 2>&1 &


# Launch LIBERO-Object-Pro evals (Background running)
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
  --use_proprio True \
  --num_images_in_input 2 \
  --use_film False \
  --pretrained_checkpoint outputs/LIBERO-Object-Pro \
  --task_suite_name libero_object \
  --use_pro_version True \
  > eval_logs/Object--chkpt.log 2>&1 &


# Launch LIBERO-Goal-Pro evals (Background running)
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
  --use_proprio True \
  --num_images_in_input 2 \
  --use_film False \
  --pretrained_checkpoint outputs/LIBERO-Goal-Pro \
  --task_suite_name libero_goal \
  --use_pro_version True \
  > eval_logs/Goal--chkpt.log 2>&1 &


# Launch LIBERO-Long-Pro (LIBERO-10) evals (Background running)
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
  --use_proprio True \
  --num_images_in_input 2 \
  --use_film False \
  --pretrained_checkpoint outputs/LIBERO-long-Pro \
  --task_suite_name libero_10 \
  --use_pro_version True \
  > eval_logs/Long--chkpt.log 2>&1 &


# Launch CALVIN ABC→D-Pro evals (Background running)
CUDA_VISIBLE_DEVICES=0 python vla-scripts/evaluate_calvin.py \
  --pretrained_checkpoint outputs/CALVIN-ABC-Pro \
  > eval_logs/CALVIN--ABC.log 2>&1 &

If you want to get the inference throughput, you can run it in the run_libero_eval.py file. You can add start = time.time() and end = time.time() before and after lines 334--345 and calculate the difference between the two. This difference is the time it takes to generate 8 chunks. This gives you the inference throughput. We measured it multiple times and took the average value of 0.036s.

🌈 Success Rate Comparison

All our results are inferred on H100. You can find the inference log file in the model released on HF for viewing. The evaluation script will run 500 trials by default (10 tasks x 50 episodes each) in LIBERO and 1,000 task sequences in CALVIN. Use the same card for training and inference whenever possible. Note that results may vary slightly if you use a different GPU than the H100. This phenomenon is also mentioned in the OpenVLA-OFT readme file.

Performance on LIBERO benchmark.

XX represents the best performance, XX represents the second best performance, and XX* represents the third best performance.

LIBERO	Methods	Scale	Spatial	Object	Goal	Long	Avg.
Large-scale	FlowVLA (Zhong et al., 2025)	8.5B	93.2	95.0	91.6	72.6	88.1
	UnifiedVLA (Wang et al., 2025)	8.5B	95.4	98.8*	93.6	94.0	95.5
	OpenVLA (Kim et al., 2024)	7B	84.7	88.4	79.2	53.7	76.5
	OpenVLA-OFT (Kim et al., 2025)	7B	97.6*	98.4	97.9	94.5*	97.1*
	UniVLA (Bu et al., 2025)	7B	96.5	96.8	95.6	92.0	95.2
	CoT-VLA (Zhao et al., 2025)	7B	87.5	91.6	87.6	69.0	81.1
	WorldVLA (Cen et al., 2025)	7B	87.6	96.2	83.4	60.0	81.8
	TraceVLA (Zheng et al., 2025)	7B	84.6	85.2	75.1	54.1	74.8
	MolmoAct (Lee et al., 2025)	7B	87.0	95.4	87.6	77.2	86.6
	ThinkAct (Huang et al., 2025)	7B	88.3	91.4	87.1	70.9	84.4
Small-scale	4D-VLA (Zhang et al., 2025)	4B	88.9	95.2	90.9	79.1	88.6
	SpatialVLA (Qu et al., 2025)	4B	88.2	89.9	78.6	55.5	78.1
	π0 (Black et al., 2024)	3B	96.8	98.8*	95.8	85.2	94.2
	π0-FAST (Pertsch et al., 2025)	3B	96.4	96.8	88.6	60.2	85.5
	NORA (Hung et al., 2025)	3B	92.2	95.4	89.4	74.6	87.9
	SmolVLA (Shukor et al., 2025)	2.2B	93.0	94.0	91.0	77.0	88.8
	GR00T N1 (NVIDIA et al., 2025)	2B	94.4	97.6	93.0	90.6	93.9
Tiny-scale	Seer (Tian et al., 2025)	0.57B	-	-	-	78.7	78.7
	VLA-OS (Gao et al., 2025)	0.5B	87.0	96.5	92.7	66.0	85.6
	Diffusion Policy (Chi et al., 2023)	-	78.3	92.5	68.3	50.5	72.4
	VLA-Adapter (Ours)	0.5B	97.8	99.2	97.2*	95.0	97.3
	VLA-Adapter-Pro (Ours)	0.5B	*99.6*	*99.6*	*98.2*	*96.4*	*98.5*

Performance on CALVIN ABC→D benchmark.

XX represents the best performance, XX represents the second best performance, and XX* represents the third best performance.

CALVIN	Methods	Scale	1	2	3	4	5	Avg. len
Large-scale	UniVLA (Bu et al., 2025)	7B	95.5	85.8	75.4	66.9	56.5	3.80
	OpenVLA (Kim et al., 2024)	7B	91.3	77.8	62.0	52.1	43.5	3.27
	OpenVLA-OFT (Kim et al., 2025)	7B	96.3	89.1	82.4	75.8	66.5	4.10
	VLAS (Zhao et al., 2025b)	7B	87.2	64.2	40.9	28.1	19.6	2.40
	LCB (Shentu et al., 2024)	7B	73.6	50.2	28.5	16.0	9.9	1.78
	RoboDual (Bu et al., 2024a)	7B	94.4	82.7	72.1	62.4	54.4	3.66
	OpenHelix (Cui et al., 2025)	7B	97.1*	91.4	82.8	72.6	64.1	4.08
	ReconVLA (Song et al., 2025c)	7B	95.6	87.6	76.9	69.3	64.1	3.95
Small-scale	DeeR (Yue et al., 2024)	3B	86.2	70.1	51.8	41.5	30.4	2.82
	RoboFlamingo (Li et al., 2024b)	3B	82.4	61.9	46.6	33.1	23.5	2.48
	VPP (Hu et al., 2025)	1.5B	95.7	91.2	86.3*	81.0*	75.0*	4.33*
	SuSIE (Black et al., 2024)	1.3B	87.0	69.0	49.0	38.0	26.0	2.69
Tiny-scale	Seer-Large (Tian et al., 2025)	0.57B	96.3	91.6*	86.1	80.3	74.0	4.28
	MoDE (Reuss et al., 2025)	0.44B	96.2	88.9	81.1	71.8	63.5	4.01
	Seer (Tian et al., 2025)	0.32B	94.4	87.2	79.9	72.2	64.3	3.98
	VLA-Adapter (Ours)	0.5B	*99.1*	94.6	88.8	82.8	76.5	4.42
	VLA-Adapter-Pro (Ours)	0.5B	98.5	*95.0*	*90.5*	*85.3*	*80.0*	*4.50*

📝 Citation

🫶 If you feel that this paper, models, or codes are helpful, please cite our paper, thanks for your support of VLA-Adapter!

@article{wang2025vlaadapter,
  author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
  title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
  journal={arXiv preprint arXiv:2509.09372},
  year={2025}
}

❤️ Acknowledgment

We thank OpenVLA-OFT, MiniVLA, and RoboDual for their open-sourced work!

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
eval_logs		eval_logs
experiments/robot		experiments/robot
figure		figure
pretrained_models		pretrained_models
prismatic		prismatic
scripts		scripts
vla-scripts		vla-scripts
vla_adapter.egg-info		vla_adapter.egg-info
.gitgnore		.gitgnore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
our_envs.txt		our_envs.txt
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The official implementation of VLA-Adapter.

📢 News!

✒️ TODO List

🌟 Table of Contents

🚀 Quick Start

Conda Environment of VLA-Adapter

Install Dependencies

📝 Data Preparation

LIBERO Benchmark

CALVIN Benchmark

🎮 Our Dependencies

📌 Benchmark Location

⚓ VLM backbone

🔥 Training for Different Configurations

📚 Related File for Training

📒 How to Train on Extremely Limited VRAM GPUs

📒 How to Train on Low VRAM GPUs

📒 How to Train on Larger VRAM GPUs

📒 How to Train on Sufficient VRAM GPUs

🦾 Inference

📚 Related File for Inference

🤗 Checkpoint of VLA-Adapter

📓 How to Eval

🌈 Success Rate Comparison

Performance on LIBERO benchmark.

Performance on CALVIN ABC→D benchmark.

📝 Citation

🫶 If you feel that this paper, models, or codes are helpful, please cite our paper, thanks for your support of VLA-Adapter!

❤️ Acknowledgment

🌟 Star History

About

Uh oh!

Releases

Packages

Languages

License

OpenHelix-Team/VLA-Adapter

Folders and files

Latest commit

History

Repository files navigation

The official implementation of VLA-Adapter.

📢 News!

✒️ TODO List

🌟 Table of Contents

🚀 Quick Start

Conda Environment of VLA-Adapter

Install Dependencies

📝 Data Preparation

LIBERO Benchmark

CALVIN Benchmark

🎮 Our Dependencies

📌 Benchmark Location

⚓ VLM backbone

🔥 Training for Different Configurations

📚 Related File for Training

📒 How to Train on Extremely Limited VRAM GPUs

📒 How to Train on Low VRAM GPUs

📒 How to Train on Larger VRAM GPUs

📒 How to Train on Sufficient VRAM GPUs

🦾 Inference

📚 Related File for Inference

🤗 Checkpoint of VLA-Adapter

📓 How to Eval

🌈 Success Rate Comparison

Performance on LIBERO benchmark.

Performance on CALVIN ABC→D benchmark.

📝 Citation

🫶 If you feel that this paper, models, or codes are helpful, please cite our paper, thanks for your support of VLA-Adapter!

❤️ Acknowledgment

🌟 Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages