KEMBAR78
GitHub - CodeGoat24/UniGenBench: UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
Skip to content

CodeGoat24/UniGenBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

83 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

Hunyuan, Tencent & UnifiedReward Team

Paper PDF Paper PDF
Project Page Project Page

Hugging Face Spaces Hugging Face Spaces

Hugging Face Spaces Hugging Face Spaces

Hugging Face Spaces Hugging Face Spaces

πŸ”₯ News

😊 We are actively gathering feedback from the community to improve our benchmark. We welcome your input and encourage you to stay updated through our repository!!

πŸ“ To add your own model to the leaderboard, please send an Email to Yibin Wang, then we will help with the evaluation and updating the leaderboard.

Please leave us a star ⭐ if you find our benchmark helpful.

  • [2025/10] πŸ”₯πŸ”₯πŸ”₯ We release the offline evaluation model UniGenBench-EvalModel-qwen-72b-v1, which achieves an average accuracy of 94% compared to evaluations by Gemini 2.5 Pro.
image
  • [2025/9] πŸ”₯πŸ”₯ Lumina-DiMOO, OmniGen2, Infinity, X-Omni, OneCAT, Echo-4o, and MMaDA are added to all πŸ…Leaderboard.

  • [2025/9] πŸ”₯πŸ”₯ Seedream-4.0, Nano Banana, GPT-4o, Qwen-Image, FLUX-Kontext-[Max/Pro] are added to all πŸ…Leaderboard.

  • [2025/9] πŸ”₯πŸ”₯ We release UniGenBench πŸ…Leaderboard (Chinese), πŸ…Leaderboard (English Long) and πŸ…Leaderboard (Chinese Long). We will continue to update them regularly. The test prompts are provided in ./data.

  • [2025/9] πŸ”₯πŸ”₯ We release all generated images from the T2I models evaluated in our UniGenBench on UniGenBench-Eval-Images. Feel free to use any evaluation model that is convenient and suitable for you to assess and compare the performance of your models.

  • [2025/8] πŸ”₯πŸ”₯ We release paper, project page, and UniGenBench πŸ…Leaderboard (English).

Introduction

We propose UniGenBench, a unified and versatile benchmark for image generation that integrates diverse prompt themes with a comprehensive suite of fine-grained evaluation criteria.

image

✨ Highlights:

  • Comprehensive and Fine-grained Evaluation: covering 10 primary dimensions and 27 sub-dimensions, enabling systematic and fine-grained assessment of diverse model capabilities.

  • Rich Prompt Theme Coverage: organized into 5 primary themes and 20 sub-themes, comprehensively spanning both realistic and imaginative generation scenarios.

  • Efficient yet Comprehensive: unlike other benchmarks, UniGenBench requires only 600 prompts, with each prompt targeting 1–10 specific testpoint, ensuring both coverage and efficiency.

  • Stremlined MLLM Evaluation: Each testpoint of the prompt is accompanied by a detailed description, explaining how the testpoint is reflected in the prompt, assisting MLLM in conducting precise evaluations.

  • Bilingual and Length-variant Prompt Support: providing both English and Chinese test prompts in short and long forms, together with evaluation pipelines for both languages, thus enabling fair and broad cross-lingual benchmarking.

  • Reliable Evaluation Model for Offline Assessment: To facilitate community use, we train a robust evaluation model that supports offline assessment of T2I model outputs.

image

πŸ“‘ Prompt Introduction

Each prompt in our benchmark is recorded as a row in a .csv file, combining with structured annotations for evaluation.

  • index

  • prompt: The full English prompt to be tested

  • sub_dims: A JSON-encoded field that organizes rich metadata, including:

    • Primary / Secondary Categories – prompt theme (e.g., Creative Divergence β†’ Imaginative Thinking)
    • Subjects – the main entities involved in the prompt (e.g., Animal)
    • Sentence Structure – the linguistic form of the prompt (e.g., Descriptive)
    • Testpoints – key aspects to evaluate (e.g., Style, World Knowledge, Attribute - Quantity)
    • Testpoint Description – evaluation cues extracted from the prompt (e.g., classical ink painting, Egyptian pyramids, two pandas)
  • English Test set: data/test_prompts_en.csv

  • Chinese Test set: data/test_prompts_zh.csv

  • Training set: train_prompt.txt

πŸš€ Inference

We provide reference code for multi-node inference based on FLUX.1-dev.

# English Prompt
bash inference/flux_en_dist_infer.sh

# Chinese Prompt
bash inference/flux_zh_dist_infer.sh

For each test prompt, 4 images are generated and stored in the following folder structure:

output_directory/
  β”œβ”€β”€ 0_0.png
  β”œβ”€β”€ 0_1.png
  β”œβ”€β”€ 0_2.png
  β”œβ”€β”€ 0_3.png
  β”œβ”€β”€ 1_0.png
  β”œβ”€β”€ 1_1.png
  ...

The file naming follows the pattern promptID_imageID.png

✨ Evaluation with Gemini2.5-pro

We are using the API version:

gemini-2.5-pro:

Release stage: General Availability (GA)

Release date: June 17, 2025

1. Evaluation

#!/bin/bash

# API
API_KEY="sk-xxxxxxx"
BASE_URL=""

DATA_PATH="flux_output"  # Directory of generated images
CSV_FILE="data/test_prompts_en.csv" # English test prompt file

# English Evaluation
python eval/gemini_en_eval.py \
  --data_path "$DATA_PATH" \
  --api_key "$API_KEY" \
  --base_url "$BASE_URL" \
  --csv_file "$CSV_FILE"

# Chinese Evaluation
CSV_FILE="data/test_prompts_zh.csv" # Chinese test prompt file

python eval/gemini_zh_eval.py \
  --data_path "$DATA_PATH" \
  --api_key "$API_KEY" \
  --base_url "$BASE_URL" \
  --csv_file "$CSV_FILE"
  • After evaluation, scores across all dimensions will be printed to the console.
  • A detailed .csv results file will also be saved in the ./results directory.

2. Calculate Score

You can also load the results file to re-print or further analyze the scores.

python eval/calculate_score.py

✨ Evaluation with UniGenBench-EvalModel

1. Deploy vLLM server

  1. Install vLLM
pip install vllm==0.9.0.1 transformers==4.52.4
  1. Start server
echo ${LOCAL_IP}

CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve CodeGoat24/UniGenBench-EvalModel-qwen-72b-v1 \
    --host ${LOCAL_IP} \
    --trust-remote-code \
    --served-model-name QwenVL \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 1 \
    --limit-mm-per-prompt image=2 \
    --port 8080 

2. Evaluation

#!/bin/bash

# vLLM request url
API_URL=http://${LOCAL_IP}:8080

DATA_PATH="flux_output"  # Directory of generated images
CSV_FILE="data/test_prompts_en.csv" # English test prompt file

# English Evaluation
python eval/qwenvl_72b_en_eval.py \
  --data_path "$DATA_PATH" \
  --api_url "$API_URL" \
  --csv_file "$CSV_FILE"

# Chinese Evaluation
CSV_FILE="data/test_prompts_zh.csv" # Chinese test prompt file

python eval/qwenvl_72b_zh_eval.py \
  --data_path "$DATA_PATH" \
  --api_url "$API_URL" \
  --csv_file "$CSV_FILE"
  • After evaluation, scores across all dimensions will be printed to the console.
  • A detailed .csv results file will also be saved in the ./results directory.

3. Calculate Score

You can also load the results file to re-print or further analyze the scores.

python eval/calculate_score.py

πŸ“§ Contact

If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.

⭐ Citation

@article{UniGenBench++,
  title={UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Bu, Jiazi and Zhou, Yujie and Xin, Yi and He, Junjun and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2510.18701},
  year={2025}
}

@article{Pref-GRPO&UniGenBench,
  title={Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Zhou, Yujie and Bu, Jiazi and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2508.20751},
  year={2025}
}

πŸ… Evaluation Leaderboards

English Short Prompt Evaluation

en_short

English Long Prompt Evaluation

en_long

Chinese Short Prompt Evaluation

zh_short

Chinese Long Prompt Evaluation

zh_long

About

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published