KEMBAR78
GitHub - qunzhongwang/vr-thinker
Skip to content

qunzhongwang/vr-thinker

Repository files navigation

VideoReward Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

       

🗓️logs

2025-10-10

Modified Codebase to veRL & Llama factory

Since the agentic control flow differs for every codebase, we’re still working on unifying the veRL style to help everyone get started more quickly. Stay tuned!

2025-09-22

More details and demos are coming soon.

✨Overview

overview

VideoReward Thinker (VR-Thinker)

Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations:

  1. Visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details
  2. All visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning

To overcome these issues, we introduce VR-Thinker, a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability.

Qualitative Case

🤗Training Pipeline

We activate visual reasoning via a reinforcement fine-tuning pipeline:

  1. Cold-start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting

  2. Rejection Sampling Fine-Tuning: Select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning

  3. Group Relative Policy Optimization (GRPO): Apply GRPO to strengthen reasoning

🔥Results

Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos:

A 7B VR-Thinker achieves:

  • 80.5% on VideoGen Reward
  • 82.3% on GenAI-Bench
  • 75.6% on MJ-Bench-Video

These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

🚀Quick Start

We proposed 3-staged post-training. The Cold Start and Rejection Sampling Fine-Tuning Code is adapted from Open-R1. The GRPO Train Code is adapted from PixelReasoner.

For Cold Start and Rejection sampling Fine-Tuning

Follow these steps to start the instruction tuning process:

  1. Installation

    conda create -n vr-thinker python=3.10 -y
    conda activate vr-thinker
    cd rl_train_verl
    pip install -e '.[vllm]'
    pip install -e '.[sglang]'
    cd sft_train_llama_factory
    pip install -e ".[torch,metrics,qwen]" --no-build-isolation
  2. Configuration

    • configure model and data path in sft.sh
    • use specific data for Cold Start and Rejection sampling respectively
    • configure corresponding environment variables
  3. Launch Training

    cd sft_train_trl
    bash sft.sh
    
    # or to use llama factory
    bash examples/sft_llmafactory.sh
    bash examples/sft_llmafactory.slurm
  4. Data Sampling

    For Rejection Sampling Fine-Tuning, we need to sample and filter data. To Sampling from VR-Thinker checkpoint:

    cd rl_train_openrlhf
    bash scripts/sampling.sh

Running GRPO Training

  1. Configuration

    • configure model and data path in training.sh
    • configure corresponding environment variables
    • Run the following for training
  2. Launch Training

    cd rl_train_openrlhf
    bash scripts/training.sh

🤗 Acknowledgement

This repo is based on pixel-reasoner, Open-RLHF, LLama-Factory, veRL. We thank the authors for their valuable contributions to the AIGC community.

⭐Citation

If you find VR-Thinker useful for your research or projects, we would greatly appreciate it if you could cite the following paper:

@misc{wang2025vrthinkerboostingvideoreward,
      title={VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning},
      author={Qunzhong Wang and Jie Liu and Jiajun Liang and Yilei Jiang and Yuanxing Zhang and Jinyuan Chen and Yaozhi Zheng and Xintao Wang and Pengfei Wan and Xiangyu Yue and Jiaheng Liu},
      year={2025},
      eprint={2510.10518},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.10518},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published