TokenBench.mp4
TokenBench is a comprehensive benchmark to standardize the evaluation for Cosmos-Tokenizer, which covers a wide variety of domains including robotic manipulation, driving, egocentric, and web videos. It consists of high-resolution, long-duration videos, and is designed to evaluate the performance of video tokenizers. We resort to existing video datasets that are commonly used for various tasks, including BDD100K, EgoExo-4D, BridgeData V2, and Panda-70M. This repo provides instructions on how to download and preprocess the videos for TokenBench.
- Clone the source code
git clone https://github.com/NVlabs/TokenBench.git
cd TokenBench
- Install via pip
pip3 install -r requirements.txt
apt-get install -y ffmpeg
Preferably, build a docker image using the provided Dockerfile
docker build -t token-bench -f Dockerfile .
# You can run the container as:
docker run --gpus all -it --rm -v /home/${USER}:/home/${USER} \
--workdir ${PWD} token-bench /bin/bash
You can use this snippet to download StyleGAN checkpoints from huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0:
from huggingface_hub import login, snapshot_download
import os
login(token="<YOUR-HF-TOKEN>", add_to_git_credential=True)
model_name="LanguageBind/Open-Sora-Plan-v1.0.0"
local_dir = "pretrained_ckpts/" + model_name
os.makedirs(local_dir, exist_ok=True)
print(f"downloading `{model_name}` ...")
snapshot_download(repo_id=f"{model_name}", local_dir=local_dir)Under pretrained_ckpts/Open-Sora-Plan-v1.0.0, you can find the StyleGAN checkpoints required for FVD metrics.
├── opensora/eval/fvd/styleganv/
│ ├── fvd.py
│ ├── i3d_torchscript.pt- Download the datasets from the official websites:
- EgoExo4D: https://docs.ego-exo4d-data.org/
- BridgeData V2: https://rail-berkeley.github.io/bridgedata/
- Panda70M: https://snap-research.github.io/Panda-70M/
- BDD100K: http://bdd-data.berkeley.edu/
- Pick the videos as specified in the
token_bench/video/list.txtfile. - Preprocess the videos using the script
token_bench/video/preprocessing_script.py.
We provide the basic scripts to compute the common evaluation metrics for video tokenizer reonctruction, including PSNR, SSIM, and lpips. Use the code to compute metrics between two folders as below
python3 -m token_bench.metrics_cli --mode=lpips \
--gtpath <ground truth folder> \
--targetpath <reconstruction folder>
| Tokenizer | Compression Ratio (T x H x W) | Formulation | PSNR | SSIM | rFVD |
|---|---|---|---|---|---|
| CogVideoX | 4 × 8 × 8 | VAE | 33.149 | 0.908 | 6.970 |
| OmniTokenizer | 4 × 8 × 8 | VAE | 29.705 | 0.830 | 35.867 |
| Cosmos-CV | 4 × 8 × 8 | AE | 37.270 | 0.928 | 6.849 |
| Cosmos-CV | 8 × 8 × 8 | AE | 36.856 | 0.917 | 11.624 |
| Cosmos-CV | 8 × 16 × 16 | AE | 35.158 | 0.875 | 43.085 |
| Tokenizer | Compression Ratio (T x H x W) | Quantization | PSNR | SSIM | rFVD |
|---|---|---|---|---|---|
| VideoGPT | 4 × 4 × 4 | VQ | 35.119 | 0.914 | 13.855 |
| OmniTokenizer | 4 × 8 × 8 | VQ | 30.152 | 0.827 | 53.553 |
| Cosmos-DV | 4 × 8 × 8 | FSQ | 35.137 | 0.887 | 19.672 |
| Cosmos-DV | 8 × 8 × 8 | FSQ | 34.746 | 0.872 | 43.865 |
| Cosmos-DV | 8 × 16 × 16 | FSQ | 33.718 | 0.828 | 113.481 |
Fitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu
If you find TokenBench useful in your works, please acknowledge it appropriately by citing:
@article{agarwal2025cosmos,
title={Cosmos World Foundation Model Platform for Physical AI},
author={NVIDIA et. al.},
journal={arXiv preprint arXiv:2501.03575},
year={2025}
}