Unified Reward Model for Multimodal Understanding and Generation

UnifiedReward Series Works

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation: We propose UniGenBench++, a unified semantic benchmark for T2I generation. It supports both short and long prompts in Chinese and English, featuring a streamlined evaluation pipeline and a robust offline evaluation model.

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning: We propose Pref-GRPO and UniGenbench, the first preference reward-based GRPO method for stable T2I reinforcement learning, and a unified T2I generation benchmark for fine-grained semantic consistency evaluation.

[NeurIPS 2025] Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning: We propose UnifiedReward-Think, the first unified multimodal CoT reward model.

Unified Reward Model for Multimodal Understanding and Generation: We release the UnifiedReward, the first unified reward model for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring.

✨ Awesome Works using UnifiedReward

😊 Meta, Transition Matching: Scalable and Flexible Generative Modeling.

😊 NVIDIA, Stanford, Tsinghua, DiffusionNFT: Online Diffusion Reinforcement with Forward Process.

😊 University of California, USTC, PKU, BIGAI, MILR: Improving Multimodal Image Generation via Test-time Latent Reasoning.

😊 Kuaishou, Tsinghua, CUHK, Flow-GRPO: Training Flow Matching Models via Online RL.

😊 Tencent Hunyuan, MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE.

😊 Kling Team, CUHK MMLab, NJU, VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning.

😊 CUHK MMLab, Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO.

Method	HPS	ImageReward	UnifiedReward
Janus-Pro + DPO	77.3	77.7	80.0
Janus-Pro + GRPO	79.2	79.3	81.0
Janus-Pro + Best-of-4	82.1	82.4	84.5

😊 Tencent Hunyuan X, X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again.

🔥 News

[2025/10/23] 🔥🔥🔥 We release UnifiedReward-Edit-[3b/7b], a unified reward model for both Text-to-Image and Image-to-Image generation trained on approximately 700K unified image generation and editing reward data!! For image editing reward task, our models support:

Pairwise Rank — directly judge which of two edited images is better.

Pairwise Score — assign a separate score to each image in a pair.

Pointwise Score — rate a single image on two axes: instruction-following and overall image quality.

🚀 The image editing reward inference code is available at UnifiedReward-Edit/ directory, while T2I inference code is unchanged from previous models. The editing training data is preprocessed from EditScore and EditReward and will be released soon. We sincerely appreciate all contributors!!

[2025/9/25] 🔥🔥🔥 We release UnifiedReward-2.0-qwen-[3b/7b/32b/72b]. This version introduces several new capabilities:

Pairwise scoring for image and video generation assessment on Alignment, Coherence, Style dimensions.

Pointwise scoring for image and video generation assessment on Alignment, Coherence/Physics, Style dimensions.

The added inference code is available at inference_qwen/UnifiedReward-2.0-inference directory. The newly added training data has been released here 😊.

😊 We are actively gathering feedback from the community to improve our models. We welcome your input and encourage you to stay updated through our repository!!

Unified Reward Model for Multimodal Understanding and Generation

😊 We appreciate the mradermacher team for providing the GGUF version of our models, and the Tencent Hunyuan team for providing the evaluation results on several T2I models using UnifiedReward-qwen-7b!! The evaluation was conducted on 400 prompts sourced from here.

click for evaluation results on several T2I models

Model	Alignment	Coherence	Style
Flux-pro-ultra	3.6453	3.8193	3.4971
Imagen-4.0	3.6792	3.8049	3.4756
Recraft-v3	3.6611	3.8409	3.5158
OpenAI-GPT-image-1	3.6890	3.8448	3.4960
Imagen-3.0	3.6733	3.8027	3.4674
Seedream-3.0	3.6927	3.8218	3.4887

🔥🔥🔥 [NeurIPS 2025] UnifiedReward-Think

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

We release UnifiedReward-Think -- the first unified multimodal CoT reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.

Please refer to the README.md for training and inference details.

UnifiedReward-Think-qwen-2.0 [3b/7b/32b/72b] are coming soon!!

🔥🔥 We release UnifiedReward-Think-qwen-7b, a more powerful unified multimodal CoT reward model built upon UnifiedReward-qwen-7b!!!!

🔥🔥 We released Gradio for UnifiedReward-Think!

🏁 Compared with Current Reward Models

Reward Model	Method	Image Generation	Image Understanding	Video Generation	Video Understanding	CoT Reasoning
PickScore	Point	√
HPS	Point	√
ImageReward	Point	√
LLaVA-Critic	Pair/Point		√
IXC-2.5-Reward	Pair/Point		√		√
VideoScore	Point			√
LiFT	Point			√
VisionReward	Point	√		√
VideoReward	Point			√
UnifiedReward (Ours)	Pair/Point	√	√	√	√
UnifiedReward-Think (Ours)	Pair/Point	√	√	√	√	√

🔧 Environment Set Up

Clone this repository and navigate to the UnifiedReward folder:

git clone https://github.com/CodeGoat24/UnifiedReward.git
cd UnifiedReward

Install the inference package:

conda create -n unifiedreward python=3.10 -y
conda activate unifiedreward
pip install --upgrade pip  
pip install -e ".[train]"
pip install flash_attn==2.5.8 --no-build-isolation

🚀 Inference

For Qwen2.5-VL based UnifiedReward models, you should first install the inference packages as follows:

pip install git+https://github.com/huggingface/transformers accelerate qwen-vl-utils[decord]==0.0.8

We provide reference pair ranking and point score inference code for each task in the ./inference and ./inference_qwen directories.

inference
├── image_generation                  
    ├── pair_rank_image_generation.py            
    └── point_score_image_generation.py         
├── video_understanding                 
    ├── pair_rank_video_understanding.py            
    └── point_score_video_understanding.py
...

Note that our model is not constrained to a fixed input prompt style. You can flexibly adjust inputs based on your requirements.

1. vLLM Inference

We provide vLLM inference code for UnifiedReward-qwen in vllm_qwen directory.

Install vLLM

pip install vllm==0.9.0.1 transformers==4.52.4

Deploy vLLM Server

bash vllm_qwen/vllm_server.sh

Inference Request to vLLM Server

python vllm_qwen/vllm_inference.py

2. SGLang Inference

We provide SGLang inference code for UnifiedReward-llava in sglang_llava directory.

Install SGLang

pip install "sglang[all]"

Deploy SGLang Server

bash sglang_llava/sglang_server.sh

Inference Request to SGLang Server

python sglang_llava/sglang_inference.py

💻 Training UnifiedReward

1. Training based on Qwen2.5-VL-Instruct (Recommended)

We use LLaMA-Factory to train the SFT model.

Clone the LLaMA-Factory repository and install the dependencies.

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

Follow this README (Multimodal Image Dataset) to prepare our released datasets.

Run the following command to train the SFT model.

llamafactory-cli train examples/train_full/qwen2_5vl_full_sft.yaml

2. Training based on LLaVA-Onevision

2.1 Unified Preference Training Dataset Preparation

Please download our constructed unified preference dataset from Huggingface and put it in ./dataset/.

dataset
├── EvalMuse                  
    ├── pairwise            
    └── pointwise
    └── ...            
└── HPD                   
└── LiFT-HRA
└── LLaVA-Critic 
    ├── pairwise            
    └── pointwise
    └── ...
└── OIP
└── ShareGPTVideo
    ├── pairwise            
    └── pointwise
    └── ...      
└── VideoDPO 
└── VideoFeedback
└── train_data.yaml

2.2 Training based on LLaVA-Onevision

bash train.sh

✨ Direct Preference Optimization

🎨 Image and Video Understanding DPO

1. Construct Preference data

The data for preference data construction should adhere to the following structure:

[
    {
    "prompt": "",
    "image": "",
    },
    ...
]

Then

# image understanding 
cd preference_data_construction/image_understanding
python infer+sift.py # you need to fill the 'image_folder' and 'data_path' in this file

# video understanding 
cd preference_data_construction/video_understanding
python infer+sift.py # you need to fill the 'image_folder' and 'data_path' in this file

2. Training

The training data format in data.json should adhere to the following structure:

[
    {
    "id": "",
    "image": "",
    "prompt": "",
    "chosen": "",
    "rejected": ""
    },
    ...
]

Then start training:

# image understanding 
bash dpo_image_understand_ov7b.sh 

# video understanding 
bash dpo_video_understand_llava_video_7b.sh

🖼️ Image Generation DPO

0. Prepare Environments

cd DiffusionDPO
conda create -n diffdpo python=3.10 -y
conda activate diffdpo
pip install -r requirements.txt

1. Construct Preference data

Image Generation

The data for preference data construction should adhere to the following structure:

[
    {
    "prompt": "",
    },
    ...
]

Then

python data_generation.py # you need to fill the 'data_path' in this file

Preference Pair Data Construction

python sift_dpo_data.py

2. Training

The training data format in data.json should adhere to the following structure:

[
    {
        "id": "",
        "caption": "",
        "jpg_0": "", #chosen image path
        "jpg_1": "", #rejected image path
        "label_0": 1,
    },
    ...
]

Then start training:

bash launchers/turbo_dpo.sh

🎬 Video Generation DPO

0. Prepare Environments

cd VideoDPO
conda create -n videodpo python=3.10 -y
conda activate videodpo
pip install -r requirements.txt

Run following instruction to download VideoCrafter checkpoints.

mkdir -p checkpoints/vc2
wget -P checkpoints/vc2 https://huggingface.co/VideoCrafter/VideoCrafter2/resolve/main/model.ckpt

Please download our constructed T2V-Turbo model and its reference model from Huggingface and put it in ./checkpoints/t2v-turbo.

1. Construct Preference data

Video Generation

The data for preference data construction should adhere to the following structure:

[
    {
    "prompt": "",
    },
    ...
]

Then

bash data_generation.sh # you need to fill '--prompts_file' in this file

Preference Pair Data Construction

python sift_dpo_data.py

2. Training

The training data format in data.json should adhere to the following structure:

[
    {
        "id": "",
        "caption": "",
        "chosen": "", # chosen video path
        "rejected": "", # rejected video path
    },
    ...
]

Then start training:

bash run.sh

🚀 Evaluation

We provide several evaluation code in ./benchmark_evaluation directory.

Reward model

We provide evaluation code for GenAI-Bench-Video, GenAI-Bench-Image, VideoGen-RewardBench and VL-RewardBench benchmarks.

Video Understanding

We provide evaluation code for MSRVTT, MSVD, and TGIF benchmarks while using the VLMEvalKit toolkit for evaluating LongVideoBench, MLVU, and Video-MME benchmarks with 64 input frames.

Image Understanding

We use LMMs-Eval toolkit to evaluate LLaVABench, WildVision, LLaVABench-Wilder, LiveBench, and MMHal benchmarks.

Image Generation

We utilize the image reward model, i.e., PickScore, HPS and ImageReward for quality assessment.

Video Generation

VBench is used for video generation assessment.

📧 Contact

If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.

🤗 Acknowledgments

In this work, reward model and image/video understanding DPO code is based on LLaVA-Next, while image and video generation DPO is based on DiffusionDPO and VideoDPO.

We also utilize LMMs-Eval and VLMEvalKit toolkits for evaluation.

Thanks to all the contributors!

⭐ Citation

@article{unifiedreward-think,
  title={Unified multimodal chain-of-thought reward model through reinforcement fine-tuning},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2505.03318},
  year={2025}
}

@article{unifiedreward,
  title={Unified reward model for multimodal understanding and generation},
  author={Wang, Yibin and Zang, Yuhang and Li, Hao and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2503.05236},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 298 Commits
DiffusionDPO		DiffusionDPO
UnifiedReward-Edit		UnifiedReward-Edit
UnifiedReward-Think		UnifiedReward-Think
VideoDPO		VideoDPO
benchmark_evaluation		benchmark_evaluation
dataset		dataset
docs		docs
inference		inference
inference_qwen		inference_qwen
llava		llava
modules		modules
playground		playground
preference_data_construction		preference_data_construction
scripts		scripts
sglang_llava		sglang_llava
trl		trl
vllm_qwen		vllm_qwen
LICENSE		LICENSE
README.md		README.md
dpo_image_understand_ov7b.sh		dpo_image_understand_ov7b.sh
dpo_video_understand_llava_video_7b.sh		dpo_video_understand_llava_video_7b.sh
pyproject.toml		pyproject.toml
train.sh		train.sh

License

CodeGoat24/UnifiedReward

Folders and files

Latest commit

History

Repository files navigation

UnifiedReward Series Works

✨ Awesome Works using UnifiedReward

🔥 News

Unified Reward Model for Multimodal Understanding and Generation

🔥🔥🔥 [NeurIPS 2025] UnifiedReward-Think

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

🏁 Compared with Current Reward Models

🔧 Environment Set Up

🚀 Inference

1. vLLM Inference

2. SGLang Inference

💻 Training UnifiedReward

1. Training based on Qwen2.5-VL-Instruct (Recommended)

2. Training based on LLaVA-Onevision

2.1 Unified Preference Training Dataset Preparation

2.2 Training based on LLaVA-Onevision

✨ Direct Preference Optimization

1. Construct Preference data

2. Training

0. Prepare Environments

1. Construct Preference data

2. Training

0. Prepare Environments

1. Construct Preference data

2. Training

🚀 Evaluation

Reward model

Video Understanding

Image Understanding

Image Generation

Video Generation

📧 Contact

🤗 Acknowledgments

⭐ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages