VideoReward Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

🗓️logs

2025-10-10

Modified Codebase to veRL & Llama factory

Since the agentic control flow differs for every codebase, we’re still working on unifying the veRL style to help everyone get started more quickly. Stay tuned!

2025-09-22

More details and demos are coming soon.

✨Overview

VideoReward Thinker (VR-Thinker)

Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations:

Visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details
All visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning

To overcome these issues, we introduce VR-Thinker, a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability.

🤗Training Pipeline

We activate visual reasoning via a reinforcement fine-tuning pipeline:

Cold-start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting
Rejection Sampling Fine-Tuning: Select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning
Group Relative Policy Optimization (GRPO): Apply GRPO to strengthen reasoning

🔥Results

Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos:

A 7B VR-Thinker achieves:

80.5% on VideoGen Reward
82.3% on GenAI-Bench
75.6% on MJ-Bench-Video

These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

🚀Quick Start

We proposed 3-staged post-training. The Cold Start and Rejection Sampling Fine-Tuning Code is adapted from Open-R1. The GRPO Train Code is adapted from PixelReasoner.

For Cold Start and Rejection sampling Fine-Tuning

Follow these steps to start the instruction tuning process:

Installation

conda create -n vr-thinker python=3.10 -y
conda activate vr-thinker
cd rl_train_verl
pip install -e '.[vllm]'
pip install -e '.[sglang]'
cd sft_train_llama_factory
pip install -e ".[torch,metrics,qwen]" --no-build-isolation

Configuration
- configure model and data path in sft.sh
- use specific data for Cold Start and Rejection sampling respectively
- configure corresponding environment variables

Launch Training

cd sft_train_trl
bash sft.sh

# or to use llama factory
bash examples/sft_llmafactory.sh
bash examples/sft_llmafactory.slurm

Data Sampling

For Rejection Sampling Fine-Tuning, we need to sample and filter data. To Sampling from VR-Thinker checkpoint:
```
cd rl_train_openrlhf
bash scripts/sampling.sh
```

Running GRPO Training

Configuration
- configure model and data path in training.sh
- configure corresponding environment variables
- Run the following for training

Launch Training

cd rl_train_openrlhf
bash scripts/training.sh

🤗 Acknowledgement

This repo is based on pixel-reasoner, Open-RLHF, LLama-Factory, veRL. We thank the authors for their valuable contributions to the AIGC community.

⭐Citation

If you find VR-Thinker useful for your research or projects, we would greatly appreciate it if you could cite the following paper:

@misc{wang2025vrthinkerboostingvideoreward,
      title={VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning},
      author={Qunzhong Wang and Jie Liu and Jiajun Liang and Yilei Jiang and Yuanxing Zhang and Jinyuan Chen and Yaozhi Zheng and Xintao Wang and Pengfei Wan and Xiangyu Yue and Jiaheng Liu},
      year={2025},
      eprint={2510.10518},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.10518},
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
examples		examples
figs		figs
rl_train_openrlhf		rl_train_openrlhf
rl_train_verl		rl_train_verl
sft_train_llama_factory		sft_train_llama_factory
sft_train_trl		sft_train_trl
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoReward Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

🗓️logs

✨Overview

VideoReward Thinker (VR-Thinker)

🤗Training Pipeline

🔥Results

🚀Quick Start

For Cold Start and Rejection sampling Fine-Tuning

Running GRPO Training

🤗 Acknowledgement

⭐Citation

About

Uh oh!

Releases

Packages

Languages

qunzhongwang/vr-thinker

Folders and files

Latest commit

History

Repository files navigation

VideoReward Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

🗓️logs

✨Overview

VideoReward Thinker (VR-Thinker)

🤗Training Pipeline

🔥Results

🚀Quick Start

For Cold Start and Rejection sampling Fine-Tuning

Running GRPO Training

🤗 Acknowledgement

⭐Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages