2025-10-10
Modified Codebase to veRL & Llama factory
Since the agentic control flow differs for every codebase, we’re still working on unifying the veRL style to help everyone get started more quickly. Stay tuned!
2025-09-22
More details and demos are coming soon.
Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations:
- Visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details
- All visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning
To overcome these issues, we introduce VR-Thinker, a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability.
We activate visual reasoning via a reinforcement fine-tuning pipeline:
-
Cold-start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting
-
Rejection Sampling Fine-Tuning: Select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning
-
Group Relative Policy Optimization (GRPO): Apply GRPO to strengthen reasoning
Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos:
A 7B VR-Thinker achieves:
- 80.5% on VideoGen Reward
- 82.3% on GenAI-Bench
- 75.6% on MJ-Bench-Video
These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
We proposed 3-staged post-training. The Cold Start and Rejection Sampling Fine-Tuning Code is adapted from Open-R1. The GRPO Train Code is adapted from PixelReasoner.
Follow these steps to start the instruction tuning process:
-
Installation
conda create -n vr-thinker python=3.10 -y conda activate vr-thinker cd rl_train_verl pip install -e '.[vllm]' pip install -e '.[sglang]' cd sft_train_llama_factory pip install -e ".[torch,metrics,qwen]" --no-build-isolation
-
Configuration
- configure model and data path in sft.sh
- use specific data for Cold Start and Rejection sampling respectively
- configure corresponding environment variables
-
Launch Training
cd sft_train_trl bash sft.sh # or to use llama factory bash examples/sft_llmafactory.sh bash examples/sft_llmafactory.slurm
-
Data Sampling
For Rejection Sampling Fine-Tuning, we need to sample and filter data. To Sampling from VR-Thinker checkpoint:
cd rl_train_openrlhf bash scripts/sampling.sh
-
Configuration
- configure model and data path in training.sh
- configure corresponding environment variables
- Run the following for training
-
Launch Training
cd rl_train_openrlhf bash scripts/training.sh
This repo is based on pixel-reasoner, Open-RLHF, LLama-Factory, veRL. We thank the authors for their valuable contributions to the AIGC community.
If you find VR-Thinker useful for your research or projects, we would greatly appreciate it if you could cite the following paper:
@misc{wang2025vrthinkerboostingvideoreward,
title={VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning},
author={Qunzhong Wang and Jie Liu and Jiajun Liang and Yilei Jiang and Yuanxing Zhang and Jinyuan Chen and Yaozhi Zheng and Xintao Wang and Pengfei Wan and Xiangyu Yue and Jiaheng Liu},
year={2025},
eprint={2510.10518},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.10518},
}