Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen

🔗 Links & Resources

[📄 Paper] [🌐 Project Page] [📦 Model Weights] [📊 Dataset]

TL;DR

To solve the data scarcity problem, we introduce a scalable pipeline Ditto, for generating high-quality video editing data, which is used to train a new state-of-the-art instruction-based video editing model, Editto.

Summary

We introduce Ditto, a holistic framework designed to tackle the fundamental challenge of instruction-based video editing. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new SOTA in instruction-based video editing.

Updating List

Add codes for Denoising Enhancing.
10/22/2025 - We have uploaded the csvs that can be directly used for model training with DiffSynth-Studio, as well as the metadata json for sim2real setting.
10/22/2025 - We finish uploading all the videos of the dataset!

Model Usage

1. Using with DiffSynth

Environment Setup

# Create conda environment (if you already have a DiffSynth conda environment, you can reuse it)
conda create -n ditto python=3.10
conda activate ditto
pip install -e .

Download Models

Download the base model and our models from Google Drive or Hugging Face:

# Download Wan-AI/Wan2.1-VACE-14B from Hugging Face to models/Wan-AI/
hf download Wan-AI/Wan2.1-VACE-14B --local-dir models/Wan-AI/

# Download Ditto models
hf download QingyanBai/Ditto_models --include="models/*" --local-dir ./

Usage

You can either use the provided script or run Python directly:

# Option 1: Use the provided script
bash infer.sh

# Option 2: Run Python directly
python inference/infer_ditto.py \
    --input_video /path/to/input_video.mp4 \
    --output_video /path/to/output_video.mp4 \
    --prompt "Editing instruction." \
    --lora_path /path/to/model.safetensors \
    --num_frames 73 \
    --device_id 0

Some test cases could be found at HF Dataset. You can also find some reference editing prompts in inference/example_prompts.txt.

2. Using with ComfyUI

_{Note: While ComfyUI runs faster with lower computational requirements (832×480x73 videos need 11G GPU memory and ~4min on A6000), please note that due to the use of quantized and distilled models, there may be some quality degradation.}

Environment Setup

First, follow the ComfyUI installation guide to set up the base ComfyUI environment. We strongly recommend installing ComfyUI-Manager for easy custom node management:

# Install ComfyUI-Manager
cd ComfyUI/custom_nodes
git clone https://github.com/Comfy-Org/ComfyUI-Manager.git

After installing ComfyUI, you can either:

Option 1 (Recommended): Use ComfyUI-Manager to automatically install all required custom nodes with the function Install Missing Custom Nodes.

Option 2: Manually install the required custom nodes (you can refer to this page):

Download Models

Download the required model weights from: Kijai/WanVideo_comfy to subfolders of models/. Required files include:

Wan2_1-T2V-14B_fp8_e4m3fn.safetensors to diffusion_models/
Wan21_CausVid_14B_T2V_lora_rank32_v2.safetensors to loras/ for inference acceleration
Wan2_1_VAE_bf16.safetensors to vae/wan/
umt5-xxl-enc-bf16.safetensors to text_encoders/

Download our models from Google Drive or Hugging Face to diffusion_models/ (use VACE Module Select node for loading).

Usage

Use the workflow ditto_comfyui_workflow.json in this repo to get started. We provided some reference prompts in the note. Some test cases could be found at HF Dataset.

_{Note: If you want to test sim2real cases, you can try prompts like 'Turn it into the real domain'.}

Model Training

Training Setup

To train a model, you can first download the training CSV files from the csvs directory on Hugging Face, then use the provided train.sh script for training.

# Download the training CSVs from HF dataset to your local directory
hf download QingyanBai/Ditto-1M --include="csvs_for_DiffSynth/*" --local-dir ./

# Run training
bash train.sh

Multi-Node Training

Thanks to DiffSynth-Studio, this codebase supports multi-node training. You can consider using DLRover to support training across multiple machines.

Citation

If you find this work useful, please consider citing our paper:

@article{bai2025ditto,
  title={Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset},
  author={Bai, Qingyan and Wang, Qiuyu and Ouyang, Hao and Yu, Yue and Wang, Hanlin and Wang, Wen and Cheng, Ka Leong and Ma, Shuailei and Zeng, Yanhong and Liu, Zichen and Xu, Yinghao and Shen, Yujun and Chen, Qifeng},
  journal={arXiv preprint arXiv:2510.15742},
  year={2025}
}

Acknowledgments

We thank Wan & VACE & Qwen-Image for providing the powerful foundation model, and QwenVL for the advanced visual understanding capabilities. We also thank DiffSynth-Studio serving as the codebase for this repository.

License

This project is licensed under the CC BY-NC-SA 4.0(Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License).

The code is provided for academic research purposes only.

For any questions, please contact qingyanbai@hotmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
diffsynth		diffsynth
examples		examples
inference		inference
models		models
LICENSE		LICENSE
README.md		README.md
ditto_comfyui_workflow.json		ditto_comfyui_workflow.json
infer.sh		infer.sh
requirements.txt		requirements.txt
setup.py		setup.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

🔗 Links & Resources

TL;DR

Summary

Updating List

Model Usage

1. Using with DiffSynth

Environment Setup

Download Models

Usage

2. Using with ComfyUI

Environment Setup

Download Models

Usage

Model Training

Training Setup

Multi-Node Training

Citation

Acknowledgments

License

About

Uh oh!

Releases

Packages

Contributors 4

Languages

License

EzioBy/Ditto

Folders and files

Latest commit

History

Repository files navigation

Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

🔗 Links & Resources

TL;DR

Summary

Updating List

Model Usage

1. Using with DiffSynth

Environment Setup

Download Models

Usage

2. Using with ComfyUI

Environment Setup

Download Models

Usage

Model Training

Training Setup

Multi-Node Training

Citation

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages