KEMBAR78
Add enable_vae_tiling to AllegroPipeline, fix example by hlky · Pull Request #10212 · huggingface/diffusers · GitHub
Skip to content

Conversation

@hlky
Copy link
Contributor

@hlky hlky commented Dec 13, 2024

What does this PR do?

Allegro doesn't support VAE without tiling and is missing enable_vae_tiling etc on the pipeline.

raise NotImplementedError("Encoding without tiling has not been implemented yet.")

Note that the default num_inference_steps=100 is very slow (~1h on A40, only slightly faster on A6000 Ada) so the example could have the value changed as well.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul @yiyixuxu @DN6

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Regarding the number of inference steps being high, inference with 50 steps produces noisy videos. The default steps in the official repo is 100 as well. For faster inference, we could explore the different cache techniques soon.

@hlky
Copy link
Contributor Author

hlky commented Dec 13, 2024

@a-r-r-o-w Is lowering num_frames also known to produce noisy videos? I'm getting noisy video with num_inference_steps=100 and num_frames=22. I'm trying to test if outputs are changed with #10156 so I wanted to to be faster. Edit: this is on main.

import torch
from diffusers import AutoencoderKLAllegro, AllegroPipeline
from diffusers.utils import export_to_video
vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro", subfolder="vae", torch_dtype=torch.float32)
pipe = AllegroPipeline.from_pretrained("rhymes-ai/Allegro", vae=vae, torch_dtype=torch.bfloat16).to("cuda")
pipe.vae.enable_tiling()
prompt = (
    "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, "
    "the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this "
    "location might be a popular spot for docking fishing boats."
)
video = pipe(prompt, num_inference_steps=100, num_frames=22, guidance_scale=7.5, max_sequence_length=512, generator=torch.Generator().manual_seed(0)).frames[0]
export_to_video(video, "AllegroPipeline.mp4", fps=15)
AllegroPipeline.mp4

@a-r-r-o-w
Copy link
Contributor

I believe that is expected. My memory is a bit hazy since it's been a while since we integrated it, but IIRC Allegro only works at 88 frames and specific resolutions (?).

I think it is okay to not run the full test if it's time consuming. What you could do instead is - save outputs of all intermediate transformers blocks and after decoding for say 2 inference steps, and compare that to your sincos/rope changes by looking at absmax of the intermediates from both runs. This should be a sufficient check imo

@yiyixuxu yiyixuxu merged commit 7186bb4 into huggingface:main Dec 16, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants