π We are actively gathering feedback from the community to improve our benchmark. We welcome your input and encourage you to stay updated through our repository!!
π To add your own model to the leaderboard, please send an Email to Yibin Wang, then we will help with the evaluation and updating the leaderboard.
Please leave us a star β if you find our benchmark helpful.
- [2025/10] π₯π₯π₯ We release the offline evaluation model UniGenBench-EvalModel-qwen-72b-v1, which achieves an average accuracy of 94% compared to evaluations by Gemini 2.5 Pro.

-
[2025/9] π₯π₯ Lumina-DiMOO, OmniGen2, Infinity, X-Omni, OneCAT, Echo-4o, and MMaDA are added to all π Leaderboard.
-
[2025/9] π₯π₯ Seedream-4.0, Nano Banana, GPT-4o, Qwen-Image, FLUX-Kontext-[Max/Pro] are added to all π Leaderboard.
-
[2025/9] π₯π₯ We release UniGenBench π Leaderboard (Chinese), π Leaderboard (English Long) and π Leaderboard (Chinese Long). We will continue to update them regularly. The test prompts are provided in
./data
. -
[2025/9] π₯π₯ We release all generated images from the T2I models evaluated in our UniGenBench on UniGenBench-Eval-Images. Feel free to use any evaluation model that is convenient and suitable for you to assess and compare the performance of your models.
-
[2025/8] π₯π₯ We release paper, project page, and UniGenBench π Leaderboard (English).
We propose UniGenBench, a unified and versatile benchmark for image generation that integrates diverse prompt themes with a comprehensive suite of fine-grained evaluation criteria.

-
Comprehensive and Fine-grained Evaluation: covering 10 primary dimensions and 27 sub-dimensions, enabling systematic and fine-grained assessment of diverse model capabilities.
-
Rich Prompt Theme Coverage: organized into 5 primary themes and 20 sub-themes, comprehensively spanning both realistic and imaginative generation scenarios.
-
Efficient yet Comprehensive: unlike other benchmarks, UniGenBench requires only 600 prompts, with each prompt targeting 1β10 specific testpoint, ensuring both coverage and efficiency.
-
Stremlined MLLM Evaluation: Each testpoint of the prompt is accompanied by a detailed description, explaining how the testpoint is reflected in the prompt, assisting MLLM in conducting precise evaluations.
-
Bilingual and Length-variant Prompt Support: providing both English and Chinese test prompts in short and long forms, together with evaluation pipelines for both languages, thus enabling fair and broad cross-lingual benchmarking.
-
Reliable Evaluation Model for Offline Assessment: To facilitate community use, we train a robust evaluation model that supports offline assessment of T2I model outputs.

Each prompt in our benchmark is recorded as a row in a .csv
file, combining with structured annotations for evaluation.
-
index
-
prompt: The full English prompt to be tested
-
sub_dims: A JSON-encoded field that organizes rich metadata, including:
- Primary / Secondary Categories β prompt theme (e.g., Creative Divergence β Imaginative Thinking)
- Subjects β the main entities involved in the prompt (e.g., Animal)
- Sentence Structure β the linguistic form of the prompt (e.g., Descriptive)
- Testpoints β key aspects to evaluate (e.g., Style, World Knowledge, Attribute - Quantity)
- Testpoint Description β evaluation cues extracted from the prompt (e.g., classical ink painting, Egyptian pyramids, two pandas)
-
English Test set:
data/test_prompts_en.csv
-
Chinese Test set:
data/test_prompts_zh.csv
-
Training set:
train_prompt.txt
We provide reference code for multi-node inference based on FLUX.1-dev.
# English Prompt
bash inference/flux_en_dist_infer.sh
# Chinese Prompt
bash inference/flux_zh_dist_infer.sh
For each test prompt, 4 images are generated and stored in the following folder structure:
output_directory/
βββ 0_0.png
βββ 0_1.png
βββ 0_2.png
βββ 0_3.png
βββ 1_0.png
βββ 1_1.png
...
The file naming follows the pattern promptID_imageID.png
We are using the API version:
gemini-2.5-pro:
Release stage: General Availability (GA)
Release date: June 17, 2025
#!/bin/bash
# API
API_KEY="sk-xxxxxxx"
BASE_URL=""
DATA_PATH="flux_output" # Directory of generated images
CSV_FILE="data/test_prompts_en.csv" # English test prompt file
# English Evaluation
python eval/gemini_en_eval.py \
--data_path "$DATA_PATH" \
--api_key "$API_KEY" \
--base_url "$BASE_URL" \
--csv_file "$CSV_FILE"
# Chinese Evaluation
CSV_FILE="data/test_prompts_zh.csv" # Chinese test prompt file
python eval/gemini_zh_eval.py \
--data_path "$DATA_PATH" \
--api_key "$API_KEY" \
--base_url "$BASE_URL" \
--csv_file "$CSV_FILE"
- After evaluation, scores across all dimensions will be printed to the console.
- A detailed
.csv
results file will also be saved in the./results
directory.
You can also load the results file to re-print or further analyze the scores.
python eval/calculate_score.py
- Install vLLM
pip install vllm==0.9.0.1 transformers==4.52.4
- Start server
echo ${LOCAL_IP}
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve CodeGoat24/UniGenBench-EvalModel-qwen-72b-v1 \
--host ${LOCAL_IP} \
--trust-remote-code \
--served-model-name QwenVL \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 1 \
--limit-mm-per-prompt image=2 \
--port 8080
#!/bin/bash
# vLLM request url
API_URL=http://${LOCAL_IP}:8080
DATA_PATH="flux_output" # Directory of generated images
CSV_FILE="data/test_prompts_en.csv" # English test prompt file
# English Evaluation
python eval/qwenvl_72b_en_eval.py \
--data_path "$DATA_PATH" \
--api_url "$API_URL" \
--csv_file "$CSV_FILE"
# Chinese Evaluation
CSV_FILE="data/test_prompts_zh.csv" # Chinese test prompt file
python eval/qwenvl_72b_zh_eval.py \
--data_path "$DATA_PATH" \
--api_url "$API_URL" \
--csv_file "$CSV_FILE"
- After evaluation, scores across all dimensions will be printed to the console.
- A detailed
.csv
results file will also be saved in the./results
directory.
You can also load the results file to re-print or further analyze the scores.
python eval/calculate_score.py
If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.
@article{UniGenBench++,
title={UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation},
author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Bu, Jiazi and Zhou, Yujie and Xin, Yi and He, Junjun and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
journal={arXiv preprint arXiv:2510.18701},
year={2025}
}
@article{Pref-GRPO&UniGenBench,
title={Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning},
author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Zhou, Yujie and Bu, Jiazi and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
journal={arXiv preprint arXiv:2508.20751},
year={2025}
}