GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Tianwei Xiong^1* · Jun Hao Liew² · Zilong Huang² · Jiashi Feng² · Xihui Liu^1✉
¹The University of Hong Kong ²ByteDance Seed
^*Work partly done as an Intern at ByteDance. ✉ Corresponding author

🔈News

[2025/06/26] GigaTok is accepted by ICCV 2025!
[2025/04/14] Research paper, code, and models are released for GigaTok!

Introduction

We introduce GigaTok, the first method for scaling visual tokenizers to 3 billion parameters. We reveal that reconstruction vs. generation dilemma for scaling tokenizers is caused by increasing latent space complexity. And it can be resolved by semantic regularization. And to scale up a visual tokenizer to 3B:

1D tokenizers are more scalable than 2D tokenizers.
It is better prioritizing decoder scaling when expanding both encoder and decoder.
Entropy loss helps stabilize training for billion-scale tokenizers.

🚀 In this codebase, we release

A series of tokenizers ranging from 136M to 3B, with AR models trained on them.
A comprehensive framework for experimental exploration of tokenizer training and evaluation, beyond reconstruction target.

Environment Setup

To set up the environment for Gigatok, follow these steps:

# A working CUDA version: 12.1
# Correspond to TORCH_RUN_PATH in set_env_vars.sh
conda create -n gigatok python=3.9
conda activate gigatok
# Install required packages using the provided script
bash env_install.sh

Download Checkpoints

All the tokenizers are for 256x256 images.

Tokenizer	Config	Param. (Tokenizer)	rFID	LPIPS	Tokenizer Download Link	AR Model	Param. (AR)	gFID	Acc.	AR Model Download Link
S-S	VQ_SS256.yaml	136M	1.01	0.2226	VQ_SS256_e100.pt	GPT-B	111M	4.05	62.6%	GPT_B256_e300_VQ_SS.pt
S-B	VQ_SB256.yaml	232M	0.89	0.2121	VQ_SB256_e200.pt	GPT-B	111M	3.83	62.9%	GPT_B256_e300_VQ_SB.pt
B-L	VQ_BL256.yaml	622M	0.81	0.2059	VQ_BL256_e200.pt	GPT-B	111M	3.26	67.6%	GPT_B256_e300_VQ_BL.pt
B-L (dino disc)	VQ_BL256_dino_disc.yaml	622M	0.51	0.2056	VQ_BL256_dino_disc.pt	GPT-B	111M	3.33	67.7%	GPT_B256_e300_VQ_BL_dino_disc.pt
XL-XXL	VQ_XLXXL256.yaml	2.9B	0.79	0.1947	VQ_XLXXL256_e300.pt	GPT-B	111M	3.15	72.0%	GPT_B256_e300_VQ_XLXXL.pt

Larger AR models Downloading

Tokenzier	Config	AR Model	Param. (AR)	gFID	Acc.	AR Model Download Link
B-L	VQ_BL256.yaml	GPT-XL	775M	2.13	70.6%	GPT_XL256_e300_VQ_BL.pt
B-L	VQ_BL256.yaml	GPT-XXL	1.4B	2.03	69.4%	GPT_XXL256_e300_VQ_BL.pt
XL-XXL	VQ_XLXXL256.yaml	GPT-XXL	1.4B	1.98	74.0%	GPT_XXL256_e300_VQ_XLXXL.pt

Inference and Evaluation

Tokenizer Reconstruction

To perform tokenizer reconstruction, you need to set up the required environment variables and run the reconstruction script. Follow the instructions below:

Set Environment Variables
Modify the set_env_vars.sh script according to the comments in it. For this reconstruction task, you only need to specify the following variables: PROJECT_ROOT and TORCH_RUN_PATH.

# Define the required path/env related variables
. set_env_vars.sh

# Choose the tokenizer configuration

# For S-S Tokenizer (128M)
export TOK_CONFIG="configs/vq/VQ_SS256.yaml"
export VQ_CKPT=results/recheck/VQ_SS256_e100.pt

# Uncomment the following for S-B (232M)
# export TOK_CONFIG="configs/vq/VQ_SB256.yaml"
# export VQ_CKPT=results/recheck/VQ_SB256_e200.pt

# Uncomment the following for B-L (622M)
# export TOK_CONFIG="configs/vq/VQ_BL256.yaml"
# export VQ_CKPT=results/recheck/VQ_BL256_e200.pt

# Uncomment the following for B-L (dino disc) (622M)
# export TOK_CONFIG="configs/vq/VQ_BL256_dinodisc.yaml"
# export VQ_CKPT=results/ckpts/VQ_BL256_dino_disc.pt


# Uncomment the following for XL-XXL (2.9B)
# export TOK_CONFIG="configs/vq/VQ_XLXXL256.yaml"
# export VQ_CKPT=results/ckpts/VQ_XLXXL256_e300.pt

Run the Qualitative Reconstruction Script

DATA_PATH=${PROJECT_ROOT}/tests/
# this is the output directory
SAMPLE_DIR=results/reconstructions/

gpus=1 \
PORT=11086 \
bash scripts/reconstruction.sh \
--quant-way=vq \
--data-path=${DATA_PATH} \
--image-size=256 \
--sample-dir=$SAMPLE_DIR \
--vq-ckpt=${VQ_CKPT} \
--model-config ${TOK_CONFIG} \
--qualitative \
--lpips \
--clear-cache

For the quantitative reconstruction evaluation, see Detailed_instructions

AR Model Inference for class-conditional generation

Qualitative Sampling

# Try these classes!
# [388]='giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca'
# [90]='lorikeet'
# [323]='monarch, monarch butterfly, milkweed butterfly, Danaus plexippus'
# [84]='peacock'
# [980]='volcano'
# [977]='sandbar, sand bar'
# [978]='seashore, coast, seacoast, sea-coast'
# [979]='valley, vale'
# [972]='cliff, drop, drop-off'
# [105]='koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus'
# [22]='bald eagle, American eagle, Haliaeetus leucocephalus'

. set_env_vars.sh

export TOK_CONFIG="configs/vq/VQ_XLXXL256.yaml"
export VQ_CKPT=results/ckpts/VQ_XLXXL256_e300.pt

export LM_CKPT=results/ckpts/GPT_B256_e300_VQ_XLXXL.pt
CFG=4.0
CFG_SCHEDULE="constant"
GPT_MODEL="GPT-B"
SAMPLE_DIR=results/gpt_eval/GPT_B256_e300_VQ_XLXXL

# Uncomment for testing GPT-XXL
# export LM_CKPT=results/ckpts/GPT_XXL256_e300_VQ_XLXXL.pt
# CFG=4.0
# CFG_SCHEDULE="constant"
# GPT_MODEL="GPT-XXL"
# SAMPLE_DIR=results/gpt_eval/GPT_XXL256_e300_VQ_XLXXL

# sample results: 
bash scripts/sample_c2i_visualization.sh \
--quant-way=vq \
--image-size=256 \
--sample-dir=$SAMPLE_DIR \
--vq-ckpt  ${VQ_CKPT} \
--tok-config ${TOK_CONFIG} \
--gpt-model ${GPT_MODEL} \
--cfg-schedule ${CFG_SCHEDULE} \
--cfg-scale ${CFG} \
--gpt-ckpt ${LM_CKPT} \
--precision fp16 \
--class-idx "22,388,90,978" \
--per-proc-batch-size 8 \
--qual-num 40

For the quantitative evaluation, see Detailed_instructions

Linear Probing Evaluation

See Detailed_instructions

Detailed Evaluation and Training Scripts

See Detailed_instructions

Acknowledgements

The authors sincerely thank Qihang Yu and Liang-Chieh Chen for their valuable discussions during the development of GigaTok.
This codebase is built on LlamaGen. Important reference codebases for this project include REPA, DETR.
We also include some experimental implementation from vaex, vector-quantize-pytorch, LARP, rotation_trick, etc. More references can be found in corresponding files.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

@article{gigatok,
    title={GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation},
    author={Tianwei Xiong and Jun Hao Liew and Zilong Huang and Jiashi Feng and Xihui Liu},
    journal={arXiv preprint arXiv:2504.08736},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets/images		assets/images
autoregressive		autoregressive
configs/vq		configs/vq
dataset		dataset
evaluations/c2i		evaluations/c2i
scripts		scripts
tests/gt_qualitiative		tests/gt_qualitiative
tokenizer		tokenizer
utils		utils
Detailed_instructions.md		Detailed_instructions.md
LICENSE		LICENSE
README.md		README.md
env_install.sh		env_install.sh
set_env_vars.sh		set_env_vars.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

🔈News

Introduction

Environment Setup

Download Checkpoints

Inference and Evaluation

Tokenizer Reconstruction

AR Model Inference for class-conditional generation

Linear Probing Evaluation

Detailed Evaluation and Training Scripts

Acknowledgements

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

SilentView/GigaTok

Folders and files

Latest commit

History

Repository files navigation

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

🔈News

Introduction

Environment Setup

Download Checkpoints

Inference and Evaluation

Tokenizer Reconstruction

AR Model Inference for class-conditional generation

Linear Probing Evaluation

Detailed Evaluation and Training Scripts

Acknowledgements

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages