Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen
[π Paper] [π Project Page] [π¦ Model Weights] [π Dataset]
To solve the data scarcity problem, we introduce a scalable pipeline Ditto, for generating high-quality video editing data, which is used to train a new state-of-the-art instruction-based video editing model, Editto.
We introduce Ditto, a holistic framework designed to tackle the fundamental challenge of instruction-based video editing. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new SOTA in instruction-based video editing.
- Add codes for Denoising Enhancing.
- 10/22/2025 - We have uploaded the csvs that can be directly used for model training with DiffSynth-Studio, as well as the metadata json for sim2real setting.
- 10/22/2025 - We finish uploading all the videos of the dataset!
# Create conda environment (if you already have a DiffSynth conda environment, you can reuse it)
conda create -n ditto python=3.10
conda activate ditto
pip install -e .
Download the base model and our models from Google Drive or Hugging Face:
# Download Wan-AI/Wan2.1-VACE-14B from Hugging Face to models/Wan-AI/
hf download Wan-AI/Wan2.1-VACE-14B --local-dir models/Wan-AI/
# Download Ditto models
hf download QingyanBai/Ditto_models --include="models/*" --local-dir ./
You can either use the provided script or run Python directly:
# Option 1: Use the provided script
bash infer.sh
# Option 2: Run Python directly
python inference/infer_ditto.py \
--input_video /path/to/input_video.mp4 \
--output_video /path/to/output_video.mp4 \
--prompt "Editing instruction." \
--lora_path /path/to/model.safetensors \
--num_frames 73 \
--device_id 0
Some test cases could be found at HF Dataset. You can also find some reference editing prompts in inference/example_prompts.txt
.
Note: While ComfyUI runs faster with lower computational requirements (832Γ480x73 videos need 11G GPU memory and ~4min on A6000), please note that due to the use of quantized and distilled models, there may be some quality degradation.
First, follow the ComfyUI installation guide to set up the base ComfyUI environment. We strongly recommend installing ComfyUI-Manager for easy custom node management:
# Install ComfyUI-Manager
cd ComfyUI/custom_nodes
git clone https://github.com/Comfy-Org/ComfyUI-Manager.git
After installing ComfyUI, you can either:
Option 1 (Recommended): Use ComfyUI-Manager to automatically install all required custom nodes with the function Install Missing Custom Nodes.
Option 2: Manually install the required custom nodes (you can refer to this page):
Download the required model weights from: Kijai/WanVideo_comfy to subfolders of models/
. Required files include:
- Wan2_1-T2V-14B_fp8_e4m3fn.safetensors to
diffusion_models/
- Wan21_CausVid_14B_T2V_lora_rank32_v2.safetensors to
loras/
for inference acceleration - Wan2_1_VAE_bf16.safetensors to
vae/wan/
- umt5-xxl-enc-bf16.safetensors to
text_encoders/
Download our models from Google Drive or Hugging Face to diffusion_models/
(use VACE Module Select node for loading).
Use the workflow ditto_comfyui_workflow.json
in this repo to get started.
We provided some reference prompts in the note.
Some test cases could be found at HF Dataset.
Note: If you want to test sim2real cases, you can try prompts like 'Turn it into the real domain'.
To train a model, you can first download the training CSV files from the csvs directory on Hugging Face, then use the provided train.sh
script for training.
# Download the training CSVs from HF dataset to your local directory
hf download QingyanBai/Ditto-1M --include="csvs_for_DiffSynth/*" --local-dir ./
# Run training
bash train.sh
Thanks to DiffSynth-Studio, this codebase supports multi-node training. You can consider using DLRover to support training across multiple machines.
If you find this work useful, please consider citing our paper:
@article{bai2025ditto,
title={Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset},
author={Bai, Qingyan and Wang, Qiuyu and Ouyang, Hao and Yu, Yue and Wang, Hanlin and Wang, Wen and Cheng, Ka Leong and Ma, Shuailei and Zeng, Yanhong and Liu, Zichen and Xu, Yinghao and Shen, Yujun and Chen, Qifeng},
journal={arXiv preprint arXiv:2510.15742},
year={2025}
}
We thank Wan & VACE & Qwen-Image for providing the powerful foundation model, and QwenVL for the advanced visual understanding capabilities. We also thank DiffSynth-Studio serving as the codebase for this repository.
This project is licensed under the CC BY-NC-SA 4.0(Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License).
The code is provided for academic research purposes only.
For any questions, please contact qingyanbai@hotmail.com.