FSDP + DTensor Loss Flatlines Randomly

### 🐛 Describe the bug

We have been training dtensor off torch nightly (in anticipation for 2.2), and we are very often seeing the loss flatline. We do not see this at all on current nightly (as of 4 days ago), and at this point we are very confident there is a regression/bug in the current release candidate (for 2.2) that breaks FSDP training (at least with dtensor).
Our best guess is one of the two PRs linked fix it:
- https://github.com/pytorch/pytorch/pull/116559
- https://github.com/pytorch/pytorch/pull/117020
To be safe, I personally would want to also include the no grad bug fix:
- https://github.com/pytorch/pytorch/pull/116792

![image](https://github.com/pytorch/pytorch/assets/17102158/c0d6f7db-d00d-4672-8d60-7b994e8da684)


### Versions

Torch 2.2 branch

cc @zhaojuanmao @mrshenli @rohan-varma @awgu @fegin @penguinwu @kwen2501 @wanchaol @XilunWu @tianyu-l

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FSDP + DTensor Loss Flatlines Randomly #117471

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FSDP + DTensor Loss Flatlines Randomly #117471

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions