Official pytorch implementation of "ControlVideo: Training-free Controllable Text-to-Video Generation"
ControlVideo adapts ControlNet to the video counterpart without any finetuning, aiming to directly inherit its high-quality and consistent generation
- [07/16/2023] Add HuggingFace demo!
- [07/11/2023] Support ControlNet 1.1 based version!
- [05/28/2023] Thank chenxwh, add a Replicate demo!
- [05/25/2023] Code ControlVideo released!
- [05/23/2023] Paper ControlVideo released!
All pre-trained weights are downloaded to checkpoints/ directory, including the pre-trained weights of Stable Diffusion v1.5, ControlNet 1.0 conditioned on canny edges, depth maps, human poses, and ControlNet 1.1 in here.
The flownet.pkl is the weights of RIFE.
The final file tree likes:
checkpoints
├── stable-diffusion-v1-5
├── sd-controlnet-canny
├── sd-controlnet-depth
├── sd-controlnet-openpose
├── ...
├── flownet.pkl
conda create -n controlvideo python=3.10
conda activate controlvideo
pip install -r requirements.txtNote: xformers is recommended to save memory and running time. controlnet-aux is updated to version 0.0.6.
To perform text-to-video generation, just run this command in inference.sh:
python inference.py \
--prompt "A striking mallard floats effortlessly on the sparkling pond." \
--condition "depth" \
--video_path "data/mallard-water.mp4" \
--output_path "outputs/" \
--video_length 15 \
--smoother_steps 19 20 \
--width 512 \
--height 512 \
--frame_rate 2 \
--version v10 \
# --is_long_videowhere --video_length is the length of synthesized video, --condition represents the type of structure sequence,
--smoother_steps determines at which timesteps to perform smoothing, --version selects the version of ControlNet (e.g., v10 or v11), and --is_long_video denotes whether to enable efficient long-video synthesis.
![]() |
![]() |
![]() |
![]() |
| "James bond moonwalk on the beach, animation style." | "Goku in a mountain range, surreal style." | "Hulk is jumping on the street, cartoon style." | "A robot dances on a road, animation style." |
![]() |
![]() |
| "A steamship on the ocean, at sunset, sketch style." | "Hulk is dancing on the beach, cartoon style." |
If you make use of our work, please cite our paper.
@article{zhang2023controlvideo,
title={ControlVideo: Training-free Controllable Text-to-Video Generation},
author={Zhang, Yabo and Wei, Yuxiang and Jiang, Dongsheng and Zhang, Xiaopeng and Zuo, Wangmeng and Tian, Qi},
journal={arXiv preprint arXiv:2305.13077},
year={2023}
}This work repository borrows heavily from Diffusers, ControlNet, Tune-A-Video, and RIFE. The code of HuggingFace demo borrows from fffiloni/ControlVideo. Thanks for their contributions!
There are also many interesting works on video generation: Tune-A-Video, Text2Video-Zero, Follow-Your-Pose, Control-A-Video, et al.

















