KEMBAR78
GitHub - OpenHelix-Team/VLA-Adapter: VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
Skip to content

OpenHelix-Team/VLA-Adapter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

96 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Paper Hugging Face Collection Twitter WeChat

The official implementation of VLA-Adapter.


๐Ÿ“ Paper: https://arxiv.org/abs/2509.09372
๐ŸŒ Project page: https://vla-adapter.github.io/
๐Ÿค— HuggingFace: https://huggingface.co/VLA-Adapter
Github: https://github.com/OpenHelix-Team/VLA-Adapter


๐Ÿ“ข News!

  • [2025/09/22] We released our codes! An enhanced Pro version is also released (this version conforms to the pipeline in the original paper, but is optimized in implementation). Everyone is welcome to use it!๐ŸŽ‰
  • [2025/09/13] Our paper won the ๐Ÿฅ‡first place in the daily list, the ๐Ÿฅˆsecond place in the weekly list, and ๐Ÿฅ‰third place in the Monthly list in HF! โญ
  • [2025/09/13] Our paper listed in the Trending Paper in HF! โญ
  • [2025/09/12] We released the original version of the VLA-Adapter for four LIBERO models on HuggingFace.
  • [2025/09/11] We released our paper on ArXiv.

โœ’๏ธ TODO List

  • Release checkpoints for reproduction.
  • Release VLA-Adapter v2 paper.
  • A more powerful version, VLA-Adapter++, and a detailed technical report ๐Ÿ“ will be released soon.
  • Continue to update the code to adapt to various real-world systems deployments, including the configuration of our paper, Franka, UR-5, and AGILE Piper.
  • It will soon be compatible with various foundation models, including but not limited to VPP, ฯ€0.5.
  • We will update the diffusion transformers and flow matching policy networks in the future, and the results will be updated in the subsequent VLA-Adapter++ technical report.
  • We will also update and give more experiments on Frozen backbone.
  • We will expand its generalization further in the future. Work is in progress! So please stay tuned!
  • RL post-training is also in progress. Interested researchers are welcome to join us in building this foundation!
  • The dual-system compatibility of VLA-Adapter is under exploration!

๐ŸŒŸ Table of Contents


๐Ÿš€ Quick Start

Conda Environment of VLA-Adapter

# Create and activate conda environment
conda create -n vla-adapter python=3.10.16 -y
conda activate vla-adapter

Install Dependencies

# Install PyTorch
# Use a command specific to your machine: https://pytorch.org/get-started/locally/
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0

# Clone vla-adapter repo and pip install to download dependencies
git clone https://github.com/OpenHelix-Team/VLA-Adapter.git
cd VLA-Adapter
pip install -e .

pip install packaging ninja
ninja --version; echo $?  # Verify Ninja --> should return exit code "0"

# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
pip install "flash-attn==2.5.5" --no-build-isolation
# If you run into difficulty, try `pip cache remove flash_attn` first, or visit the
# website to download it. (https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5)
# You can download the corresponding `.whl` file according to the cuda version of `nvidia-smi`,
# and then run `pip install flash_attn-2.5.5+cuXX...whl` to install it. 
# We use the `flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl` file.


๐Ÿ“ Data Preparation

LIBERO Benchmark

  • (Optional)

Clone and install the LIBERO repo and required packages:

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO
pip install -r experiments/robot/libero/libero_requirements.txt  # From vla-adapter base dir

To download the LIBERO datasets that we used in our fine-tuning experiments, run the command below. This will download the Spatial, Object, Goal, and Long datasets in RLDS format, i.e., libero_spatial_no_noops, libero_object_no_noops, libero_goal_no_noops, libero_10_no_noops. ("_no_noops" stands for no no-op actions, i.e., training samples with near-zero actions are filtered out). These datasets require ~10GB of memory in total. If needed, see details on how to download the original non-RLDS datasets here. You can use these to fine-tune Prismatic-VLMs (built on Qwen2.5-0.5B) or other VLMs.

git clone git@hf.co:datasets/openvla/modified_libero_rlds

๐ŸŒŸ Attention! The dataset downloaded in this way needs to remove of the modified_ word to adapt to the path of - ๐Ÿ“Œ Benchmark Location!!!

When using LIBERO, you may get an error message like AttributeError: 'NoneType' object has no attribute 'eglQueryString'. You can use:

sudo apt-get update
sudo apt-get install libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev

CALVIN Benchmark

  • (Optional)
git clone --recurse-submodules https://github.com/mees/calvin.git
export CALVIN_ROOT=$(pwd)/calvin
cd $CALVIN_ROOT

# Installation of `pyhash` may fail on some machines. If it fails, you can solve it by lowering the `setuptools` version: `pip install setuptools==57.5.0`
sh install.sh

To download the CALVIN ABCโ†’D datasets that we used in our fine-tuning experiments, run the command below.

cd $CALVIN_ROOT/dataset
sh download_data.sh ABC

If you want to download the RLDS format, you can visit here to download it. This dataset requires ~50GB of memory.

When using CALVIN, you may get an error message like AttributeError: 'NoneType' object has no attribute 'eglQueryString'. You can use:

sudo apt-get update
sudo apt-get install libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev

๐ŸŽฎ Our Dependencies

  • (including LIBERO and CALVIN)

At this point, the environment is fully installed. If you want to confirm whether the environment is correct, you can see the our_envs.txt file we released.

๐Ÿ“Œ Benchmark Location

The downloaded dataset can be placed in the /data folder. The overall directory structure is as follows:

ยท
โ”œโ”€โ”€ data
ยท   โ”œโ”€โ”€ libero
    โ”‚   โ”œโ”€โ”€ libero_10_no_noops
    โ”‚   โ”‚   โ””โ”€โ”€ 1.0.0  (It contains some json files and 32 tfrecord files)
    โ”‚   โ”œโ”€โ”€ libero_goal_no_noops
    โ”‚   โ”‚   โ””โ”€โ”€ 1.0.0  (It contains some json files and 16 tfrecord files)
    โ”‚   โ”œโ”€โ”€ libero_object_no_noops
    โ”‚   โ”‚   โ””โ”€โ”€ 1.0.0  (It contains some json files and 32 tfrecord files)
    โ”‚   โ”œโ”€โ”€ libero_spatial_no_noops
    โ”‚   โ”‚   โ””โ”€โ”€ 1.0.0  (It contains some json files and 16 tfrecord files)
    โ”‚
    โ”œโ”€โ”€ calvin_abc
    โ”‚   โ””โ”€โ”€ 1.0.0  (It contains some json files, 512 train tfrecord files, and 32 valid tfrecord files)
    โ”‚
    โ””โ”€โ”€ other benchmarks ...


โš“ VLM backbone

We use the Prismatic-VLMs architecture. Since the file is large, please download it from here. Then put it in the /pretrained_models folder. The file structure is:

ยท
โ”œโ”€โ”€ pretrained_models
ยท   โ”œโ”€โ”€ configs
    โ””โ”€โ”€ prism-qwen25-extra-dinosiglip-224px-0_5b


๐Ÿ”ฅ Training for Different Configurations

We provide different training configurations for different users. You can choose the configuration suitable for training based on your GPU card type.

๐Ÿ“š Related File for Training

  • vla-scripts/finetune.py: VLA fine-tuning script

๐Ÿ“’ How to Train on Extremely Limited VRAM GPUs

=> Extremely Limited VRAM (A card with 10GB-12GB) (e.g. NVIDIA GeForce RTX 2080Ti, 3060, 3080, 4070, 4080, and 5070).

About batch_size, lora_rank, grad_accumulation_steps, and max_steps.

If your resources are extremely limited, you can set --batch_size 1 and --lora_rank 64, it only requires 9.6GB of VRAM. Certainly, batch size = 1 will cause gradient updates to be greatly affected by extreme values, and loss convergence will be unstable. In this case, you can modify the grad_accumulation_steps parameter to simulate a similar effect. For example, --batch_size 1 with --grad_accumulation_steps 8 has a similar effect to --batch_size 8, but the training speed will be slower. This means that you can't use the OpenVLA-OFT model on a card with 10GB because even with batch size = 1, it requires 25GB of VRAM. Fortunately, you can use VLA-Adapter. However, the batch size is still small, you can increase --max_steps to achieve the performance reported in the paper.

About vlm_path.

The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being Qwen2.5-0.5B. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in /pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b.

About data_name.

Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the /logs folder. You can replace libero_spatial_no_noops with libero_object_no_noops, libero_goal_no_noops, or libero_10_no_noops. If you are using the CALVIN benchmark, you need to delete \libero in --data_root_dir and replace libero_spatial_no_noops with calvin_abc.

About use_pro_version.

In addition, we recently released an enhanced version Pro of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. Therefore, we strongly recommend using the Pro version! The Pro version's Policy size is 207MB, and training speed is virtually unchanged. The original version is nearly 1GB smaller than the pro version, requiring only 8.6GB of VRAM. You can choose whether to use the Pro version by setting the use_pro_version parameter, i.e., the Pro version is --use_pro_version True.

data_name=libero_spatial_no_noops

CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 400000 \
--max_steps 400005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 1 \
--grad_accumulation_steps 8 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_entity "YOUR_WANDB_ENTITY" \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \
> logs/VLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 &

Please note that the obtained models will be stored in the /outputs folder. Each model will take up nearly 3GB of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from VLA-Adapter HuggingFace and place it in this folder for inference.


๐Ÿ“’ How to Train on Low VRAM GPUs

=> Low VRAM (A card with 24GB) (e.g. NVIDIA GeForce RTX 3090 and 4090).

About batch_size, lora_rank, grad_accumulation_steps, and max_steps.

If you have such a device, you can increase the batch size and lora rank: --batch_size 4 and --lora_rank 64. This only takes nearly 20GB. This is consistent with the rank in our paper. This means that you can't use the OpenVLA-OFT model on a card with 24GB because even with batch size = 1, it requires 25GB of VRAM. Fortunately, you can use VLA-Adapter. However, the batch size is still small, you can increase --max_steps to achieve the performance reported in the paper.

About vlm_path.

The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being Qwen2.5-0.5B. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in /pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b.

About data_name.

Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the /logs folder. You can replace libero_spatial_no_noops with libero_object_no_noops, libero_goal_no_noops, or libero_10_no_noops. If you are using the CALVIN benchmark, you need to delete \libero in --data_root_dir and replace libero_spatial_no_noops with calvin_abc.

About use_pro_version.

In addition, we recently released an enhanced version Pro of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. Therefore, we strongly recommend using the Pro version! The Pro version's Policy size is 207MB, and training speed is virtually unchanged. The original version is nearly 1GB smaller than the pro version (1 batch), requiring only 17.6GB of VRAM. You can choose whether to use the Pro version by setting the use_pro_version parameter, i.e., the Pro version is --use_pro_version True.

data_name=libero_spatial_no_noops

CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 200000 \
--max_steps 200005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 4 \
--grad_accumulation_steps 4 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_entity "YOUR_WANDB_ENTITY" \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \
> logs/VLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 &

Please note that the obtained models will be stored in the /outputs folder. Each model will take up nearly 3GB of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from VLA-Adapter HuggingFace and place it in this folder for inference.


๐Ÿ“’ How to Train on Larger VRAM GPUs

=> A Consumer GPU with 32GB (e.g. NVIDIA GeForce RTX 5090)
=> A Professional-Grade GPU with 40GB-48GB (e.g. NVIDIA A100-40GB, A800-40GB, L20, and RTX A6000).

About batch_size, lora_rank, grad_accumulation_steps, and max_steps.

If you have such a device, you can increase the batch size and lora rank: --batch_size 8 and --lora_rank 64. This only takes nearly 29GB.

About vlm_path.

The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being Qwen2.5-0.5B. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in /pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b.

About data_name.

Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the /logs folder. You can replace libero_spatial_no_noops with libero_object_no_noops, libero_goal_no_noops, or libero_10_no_noops. If you are using the CALVIN benchmark, you need to delete \libero in --data_root_dir and replace libero_spatial_no_noops with calvin_abc.

With this configuration, you can achieve the same results as in our paper on the LIBERO-Object benchmark, achieving a 99.2% success rate, in just 8 hours. The LIBERO-Spatial benchmark requires approximately 10 hours of training. However, the LIBERO-Long benchmark takes longer because its tasks are longer and more difficult, requiring more training steps to achieve superior performance.

About use_pro_version.

In addition, we recently released an enhanced version Pro of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. Therefore, we strongly recommend using the Pro version! The Pro version's Policy size is 207MB, and training speed is virtually unchanged. The original version is nearly 1GB smaller than the pro version (1 batch). You can choose whether to use the Pro version by setting the use_pro_version parameter, i.e., the Pro version is --use_pro_version True.

data_name=libero_spatial_no_noops

CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 200000 \
--max_steps 200005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 8 \
--grad_accumulation_steps 2 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_entity "YOUR_WANDB_ENTITY" \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \
> logs/VLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 &

Please note that the obtained models will be stored in the /outputs folder. Each model will take up nearly 3GB of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from VLA-Adapter HuggingFace and place it in this folder for inference.


๐Ÿ“’ How to Train on Sufficient VRAM GPUs

=> Professional-Grade GPUs with โ‰ฅ80GB (e.g. NVIDIA A100-80GB, A800-80GB, H100, H800, H20-NVLink, and GB200).

About batch_size, lora_rank, grad_accumulation_steps, and max_steps.

You can use 1 to 8 GPUs for training by changing the number of CUDA_VISIBLE_DEVICES to the GPU number and the number of GPUs after --nproc-per-node. In our paper, we use 4ร—H100 GPU for training. In this configuration, the four suites of the LIBERO benchmark, Spatial (only five hours), Object (less than one hour), Goal (three hours), and Long (half a day); the CALVIN benchmark (eight hours)

About vlm_path.

The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being Qwen2.5-0.5B. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in /pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b.

About data_name.

Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the /logs folder. You can replace libero_spatial_no_noops with libero_object_no_noops, libero_goal_no_noops, or libero_10_no_noops. If you are using the CALVIN benchmark, you need to delete \libero in --data_root_dir and replace libero_spatial_no_noops with calvin_abc.

About use_pro_version.

In addition, we recently released an enhanced version Pro of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. Therefore, we strongly recommend using the Pro version! The Pro version's Policy size is 207MB, and training speed is virtually unchanged. The original version is nearly 1GB smaller than the pro version (1 batch). You can choose whether to use the Pro version by setting the use_pro_version parameter, i.e., the Pro version is --use_pro_version True.

data_name=libero_spatial_no_noops

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes 1 --nproc-per-node 4 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 150000 \
--max_steps 150005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 16 \
--grad_accumulation_steps 1 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_entity "YOUR_WANDB_ENTITY" \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--spatial--$current_time \
> logs/VLA-Adapter--spatial--$current_time.log 2>&1 &

Please note that the obtained models will be stored in the /outputs folder. Each model will take up nearly 3GB of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from VLA-Adapter HuggingFace and place it in this folder for inference.

๐Ÿฆพ Inference

๐Ÿ“š Related File for Inference

  • experiments/robot/libero/: LIBERO eval files
    • run_libero_eval.py: LIBERO eval script
    • libero_utils.py: LIBERO eval utils
  • experiments/robot/: General eval utils files
    • openvla_utils.py: VLA-specific eval utils
    • robot_utils.py: Other eval utils

๐Ÿค— Checkpoint of VLA-Adapter

We fine-tuned Qwen2.5-0.5B with our adapter bridge paradigm on four LIBERO task suites independently: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-Long. The four VLA-Adapter checkpoints for LIBERO are available on Hugging Face:

In addition, we also provide a Pro version, we used 4*H100 GPUs for training, --batch_size 16, --lora rank 64, and the --max_steps 100000. The Pro checkpoints is:

These files need to be placed in the /output folder. If you trained your own models, it will also be stored here. The subsequent eval code will call the model in this folder for inference.


๐Ÿ““ How to Eval

We strongly recommend that you use our open source Pro version of the model, which has stronger performance. To start evaluations with one of these checkpoints, run one of the commands below. Each will automatically download the appropriate checkpoint listed above. If you want to use the original version of the model, you only need to adjust the -- use_pro_version parameter to False and pass the original version of the model to the --pretrained_checkpoint parameter. Finally, the inference results will be displayed in the /eval_logs folder, and the inference video will be displayed in the /rollouts/vla-adapter folder.

# Launch LIBERO-Spatial-Pro evals (Background running)
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
  --use_proprio True \
  --num_images_in_input 2 \
  --use_film False \
  --pretrained_checkpoint outputs/LIBERO-Spatial-Pro \
  --task_suite_name libero_spatial \
  --use_pro_version True \
  > eval_logs/Spatial--chkpt.log 2>&1 &


# Launch LIBERO-Object-Pro evals (Background running)
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
  --use_proprio True \
  --num_images_in_input 2 \
  --use_film False \
  --pretrained_checkpoint outputs/LIBERO-Object-Pro \
  --task_suite_name libero_object \
  --use_pro_version True \
  > eval_logs/Object--chkpt.log 2>&1 &


# Launch LIBERO-Goal-Pro evals (Background running)
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
  --use_proprio True \
  --num_images_in_input 2 \
  --use_film False \
  --pretrained_checkpoint outputs/LIBERO-Goal-Pro \
  --task_suite_name libero_goal \
  --use_pro_version True \
  > eval_logs/Goal--chkpt.log 2>&1 &


# Launch LIBERO-Long-Pro (LIBERO-10) evals (Background running)
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
  --use_proprio True \
  --num_images_in_input 2 \
  --use_film False \
  --pretrained_checkpoint outputs/LIBERO-long-Pro \
  --task_suite_name libero_10 \
  --use_pro_version True \
  > eval_logs/Long--chkpt.log 2>&1 &


# Launch CALVIN ABCโ†’D-Pro evals (Background running)
CUDA_VISIBLE_DEVICES=0 python vla-scripts/evaluate_calvin.py \
  --pretrained_checkpoint outputs/CALVIN-ABC-Pro \
  > eval_logs/CALVIN--ABC.log 2>&1 &

If you want to get the inference throughput, you can run it in the run_libero_eval.py file. You can add start = time.time() and end = time.time() before and after lines 334--345 and calculate the difference between the two. This difference is the time it takes to generate 8 chunks. This gives you the inference throughput. We measured it multiple times and took the average value of 0.036s.


๐ŸŒˆ Success Rate Comparison

All our results are inferred on H100. You can find the inference log file in the model released on HF for viewing. The evaluation script will run 500 trials by default (10 tasks x 50 episodes each) in LIBERO and 1,000 task sequences in CALVIN. Use the same card for training and inference whenever possible. Note that results may vary slightly if you use a different GPU than the H100. This phenomenon is also mentioned in the OpenVLA-OFT readme file.

Performance on LIBERO benchmark.

XX represents the best performance, XX represents the second best performance, and XX* represents the third best performance.

LIBERO Methods Scale Spatial Object Goal Long Avg.
Large-scaleFlowVLA (Zhong et al., 2025) 8.5B93.295.091.672.688.1
UnifiedVLA (Wang et al., 2025) 8.5B95.498.8* 93.6 94.0 95.5
OpenVLA (Kim et al., 2024) 7B84.788.479.253.776.5
OpenVLA-OFT (Kim et al., 2025) 7B97.6*98.497.994.5*97.1*
UniVLA (Bu et al., 2025) 7B96.5 96.8 95.6 92.0 95.2
CoT-VLA (Zhao et al., 2025) 7B87.5 91.6 87.6 69.0 81.1
WorldVLA (Cen et al., 2025) 7B87.6 96.2 83.4 60.0 81.8
TraceVLA (Zheng et al., 2025) 7B84.6 85.2 75.1 54.1 74.8
MolmoAct (Lee et al., 2025) 7B87.0 95.4 87.6 77.2 86.6
ThinkAct (Huang et al., 2025) 7B88.3 91.4 87.1 70.9 84.4
Small-scale4D-VLA (Zhang et al., 2025) 4B88.9 95.2 90.9 79.1 88.6
SpatialVLA (Qu et al., 2025) 4B88.2 89.9 78.6 55.5 78.1
ฯ€0 (Black et al., 2024) 3B96.898.8*95.8 85.2 94.2
ฯ€0-FAST (Pertsch et al., 2025) 3B96.4 96.8 88.6 60.2 85.5
NORA (Hung et al., 2025) 3B92.2 95.4 89.4 74.6 87.9
SmolVLA (Shukor et al., 2025) 2.2B93.0 94.0 91.0 77.0 88.8
GR00T N1 (NVIDIA et al., 2025) 2B94.4 97.6 93.0 90.6 93.9
Tiny-scaleSeer (Tian et al., 2025) 0.57B- - - 78.7 78.7
VLA-OS (Gao et al., 2025) 0.5B87.0 96.5 92.7 66.0 85.6
Diffusion Policy (Chi et al., 2023) -78.3 92.5 68.3 50.5 72.4
VLA-Adapter (Ours) 0.5B97.899.297.2* 95.0 97.3
VLA-Adapter-Pro (Ours) 0.5B99.699.6 98.296.498.5

Performance on CALVIN ABCโ†’D benchmark.

XX represents the best performance, XX represents the second best performance, and XX* represents the third best performance.

CALVIN Methods Scale 1 2 3 4 5 Avg. len
Large-scaleUniVLA (Bu et al., 2025) 7B 95.5 85.8 75.4 66.9 56.5 3.80
OpenVLA (Kim et al., 2024) 7B 91.3 77.8 62.0 52.1 43.5 3.27
OpenVLA-OFT (Kim et al., 2025) 7B 96.3 89.1 82.4 75.8 66.5 4.10
VLAS (Zhao et al., 2025b) 7B 87.2 64.2 40.9 28.1 19.6 2.40
LCB (Shentu et al., 2024) 7B 73.6 50.2 28.5 16.0 9.9 1.78
RoboDual (Bu et al., 2024a) 7B 94.4 82.7 72.1 62.4 54.4 3.66
OpenHelix (Cui et al., 2025) 7B 97.1* 91.4 82.8 72.6 64.1 4.08
ReconVLA (Song et al., 2025c) 7B 95.6 87.6 76.9 69.3 64.1 3.95
Small-scaleDeeR (Yue et al., 2024) 3B 86.2 70.1 51.8 41.5 30.4 2.82
RoboFlamingo (Li et al., 2024b) 3B 82.4 61.9 46.6 33.1 23.5 2.48
VPP (Hu et al., 2025) 1.5B 95.7 91.2 86.3* 81.0* 75.0* 4.33*
SuSIE (Black et al., 2024)1.3B 87.0 69.0 49.0 38.0 26.0 2.69
Tiny-scaleSeer-Large (Tian et al., 2025)0.57B 96.3 91.6* 86.1 80.3 74.0 4.28
MoDE (Reuss et al., 2025) 0.44B 96.2 88.9 81.1 71.8 63.5 4.01
Seer (Tian et al., 2025) 0.32B 94.4 87.2 79.9 72.2 64.3 3.98
VLA-Adapter (Ours) 0.5B99.1 94.6 88.8 82.8 76.5 4.42
VLA-Adapter-Pro (Ours) 0.5B98.595.0 90.585.380.04.50

๐Ÿ“ Citation

๐Ÿซถ If you feel that this paper, models, or codes are helpful, please cite our paper, thanks for your support of VLA-Adapter!

@article{wang2025vlaadapter,
  author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
  title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
  journal={arXiv preprint arXiv:2509.09372},
  year={2025}
}

โค๏ธ Acknowledgment

We thank OpenVLA-OFT, MiniVLA, and RoboDual for their open-sourced work!

๐ŸŒŸ Star History

About

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages