KEMBAR78
[core] Allegro T2V by a-r-r-o-w · Pull Request #9736 · huggingface/diffusers · GitHub
Skip to content

Conversation

@a-r-r-o-w
Copy link
Contributor

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w and others added 6 commits October 23, 2024 00:55
Co-Authored-By: Huan Yang <hyang@fastmail.com>
Co-Authored-By: YiYi Xu <yixu310@gmail.com>
@a-r-r-o-w a-r-r-o-w marked this pull request as ready for review October 22, 2024 23:10
@a-r-r-o-w
Copy link
Contributor Author

It looks like something broke when doing the VAE refactor - looking into it at the moment. Will fix the broken tests afterwards

frames = frames.permute(0, 2, 1, 3, 4) # [batch_size, channels, num_frames, height, width]
return frames

def _prepare_rotary_positional_embeddings(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker to merge.

We currently have a mix of creating rotary embeddings like this in a few pipelines (Cog, Lumina, Hunyuan)
Was there a specific reason I missed to go this route as opposed to creating a dedicated layer in the transformer (Flux)? Is it because we need access to height, width etc to create the embedding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is on my mind. Will take up dedicated RoPE layer refactor for existing models that do it in the pipeline in a future PR

).frames

video = videos[0]
expected_video = torch.randn(1, 88, 720, 1280, 3).numpy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this will be updated to a real video?

@DN6
Copy link
Collaborator

DN6 commented Oct 23, 2024

LGTM. There's a failing test that looks related to saving/loading the transformer.

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, just a few typos :)

a-r-r-o-w and others added 2 commits October 24, 2024 01:43
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
if self.use_tiling:
return self.tiled_decode(z)

raise NotImplementedError("Decoding without tiling has not been implemented yet.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiyixuxu Is this okay for now? There are some followups that we could look into later and maybe rewrite the tiling implementation similar to our other VAEs.

I don't think the model works well with lower number of frames (in which case not using tiling would be faster when decoding), so we should probably always just use tiling since we have 88 frames as the default (and recommended)

hidden_dtype: Optional[torch.dtype] = None,
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
# No modulation happening here.
added_cond_kwargs = added_cond_kwargs or {"resolution": None, "aspect_ratio": None}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh what is this? is this fixing a current bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. We accept None in added_cond_kwargs here, but we actually need to pass values for resolution and aspect_ratio in the following pixart embedding layer (which requires it as non-defaulted arg)

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!
we can merge once the tests are fixed
I can look into refactoring the 3d rope in a follow up PR (or if you want to leave this a little bit longer it is ok too, up to you!)

@Ednaordinary
Copy link
Contributor

I have something really weird happening here. Feeding in a list of prompts (even if the list is only one prompt) results in a really bad video. It might be related to #9769 (comment), but I can't be sure. It seems cfg baked, so it would make sense if it was a cfg error.

prompt = ["Orbital shot of a squirrel nibbles on a nut while sitting in a tree"]:

listprompt.mp4

prompt = "Orbital shot of a squirrel nibbles on a nut while sitting in a tree":

stringprompt.mp4

@DN6
Copy link
Collaborator

DN6 commented Oct 28, 2024

@Ednaordinary Can you share a code snippet? Could you include how you're loading the pipeline? I'm unable to reproduce the issue.

@Ednaordinary
Copy link
Contributor

Ednaordinary commented Oct 28, 2024

I checked further, and it happens when the prompt and negative prompt are length one lists (even with the list being [None], I think), but not when the prompt is a list and negative is unspecified (my mistake). I have yet to test further

that's kinda incoherent, here's a snippet:
video = model(["A squirrel sitting on a tree and nibbling on an acorn."], negative_prompt=[""], num_frames=88, num_inference_steps=20, guidance_scale=7.5)

I'm using UniPC since its way faster. Testing without the negative_prompt specified at all, it works fine. I haven't tested with commit 9214f4a merged yet, though

@foreverpiano
Copy link

@Ednaordinary So the problem is that u are using empty negative prompt?

@Ednaordinary
Copy link
Contributor

Ednaordinary commented Oct 29, 2024

@foreverpiano I don't believe so, as passing None in a list to negative_prompt also seems to trigger it. It also looks suspiciously like cfg baking, which I can't be certain but I feel as if negative prompting with nothing wouldn't cause

The way I'm passing in arguments is using the same interface I pass them in for other pipelines, which ive never had issues with. I use a batching mechanism that passes in multiple prompts, in a list, even if there's only one prompt. negative_prompt is converted to None of its blank. Changing this to only passing in a string instead of a list and only batching one image (multi-prompt doesn't currently seem to work on this pipeline regardless) fixed things, for whatever reason

[None] as negative_prompt:

allegro.mp4

@a-r-r-o-w
Copy link
Contributor Author

@Ednaordinary I think should be fixed with the latest commit. LMK if it still persists - if so, will fix in a follow-up PR

@a-r-r-o-w a-r-r-o-w merged commit 0d1d267 into main Oct 29, 2024
18 checks passed
@a-r-r-o-w a-r-r-o-w deleted the allegro-impl branch October 29, 2024 07:44
sayakpaul pushed a commit that referenced this pull request Dec 23, 2024
* update

* refactor transformer part 1

* refactor part 2

* refactor part 3

* make style

* refactor part 4; modeling tests

* make style

* refactor part 5

* refactor part 6

* gradient checkpointing

* pipeline tests (broken atm)

* update

* add coauthor

Co-Authored-By: Huan Yang <hyang@fastmail.com>

* refactor part 7

* add docs

* make style

* add coauthor

Co-Authored-By: YiYi Xu <yixu310@gmail.com>

* make fix-copies

* undo unrelated change

* revert changes to embeddings, normalization, transformer

* refactor part 8

* make style

* refactor part 9

* make style

* fix

* apply suggestions from review

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* update example

* remove attention mask for self-attention

* update

* copied from

* update

* update

---------

Co-authored-by: Huan Yang <hyang@fastmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants