KEMBAR78
[FSDP2] Move to public `torch.distributed.fsdp` by awgu · Pull Request #141868 · pytorch/pytorch · GitHub
Skip to content

Conversation

@awgu
Copy link
Collaborator

@awgu awgu commented Dec 2, 2024

Stack from ghstack (oldest at bottom):

Overview
This PR moves torch/distributed/_composable/fsdp to torch/distributed/fsdp/_fully_shard and makes public APIs available from torch.distributed.fsdp, e.g.:

from torch.distributed.fsdp import fully_shard

This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

Changes for Reland

  • Preserved the public objects from torch/distributed/_composable/fsdp/fully_shard.py so that the import path still works internally
  • Added a unit test that we can do from torch.distributed._composable.fsdp.fully_shard import FSDPModule

cc @H-Huang @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn

Differential Revision: D66890387

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141868

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 1b534f8 with merge base bab15df (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o LucasLLC MeetVadakkanchery mhorowitz pradeepfn

[ghstack-poisoned]
will need several CI iterations to get this right -- please do not review yet

cc H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o LucasLLC MeetVadakkanchery mhorowitz pradeepfn

[ghstack-poisoned]
will need several CI iterations to get this right -- please do not review yet

cc H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames LucasLLC MeetVadakkanchery mhorowitz pradeepfn

[ghstack-poisoned]
will need several CI iterations to get this right -- please do not review yet

cc H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames LucasLLC MeetVadakkanchery mhorowitz pradeepfn

[ghstack-poisoned]
will need several CI iterations to get this right -- please do not review yet

cc H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames LucasLLC MeetVadakkanchery mhorowitz pradeepfn

[ghstack-poisoned]
@awgu awgu added the keep-going Don't stop on first failure, keep running tests until the end label Dec 2, 2024
will need several CI iterations to get this right -- please do not review yet

cc H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames LucasLLC MeetVadakkanchery mhorowitz pradeepfn

[ghstack-poisoned]
will need several CI iterations to get this right -- please do not review yet

cc H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames LucasLLC MeetVadakkanchery mhorowitz pradeepfn

[ghstack-poisoned]
awgu pushed a commit that referenced this pull request Dec 2, 2024
ghstack-source-id: 99573d0
Pull Request resolved: #141868
@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Dec 6, 2024
This reverts commit 45583a5.

Reverted #141868 on behalf of https://github.com/atalman due to failing internally ([comment](#141868 (comment)))
@pytorchmergebot
Copy link
Collaborator

@awgu your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Dec 6, 2024
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Follow-Ups**
- [x] Add some explanation in the docs about FSDP1 vs. FSDP2
- [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/`


cc H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov LucasLLC MeetVadakkanchery mhorowitz pradeepfn

[ghstack-poisoned]
awgu pushed a commit that referenced this pull request Dec 6, 2024
ghstack-source-id: e9a6453
Pull Request resolved: #141868
@awgu
Copy link
Collaborator Author

awgu commented Dec 6, 2024

@awgu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@awgu awgu added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Dec 6, 2024
@awgu
Copy link
Collaborator Author

awgu commented Dec 6, 2024

test_nestedtensor.py::TestNestedTensorOpInfoCUDA::test_compile_backward_nn_functional_linear_cuda_float32

failure is not related

@awgu
Copy link
Collaborator Author

awgu commented Dec 6, 2024

internal tests look good!

@awgu
Copy link
Collaborator Author

awgu commented Dec 6, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 3, 5, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorch-bot bot pushed a commit that referenced this pull request Dec 9, 2024
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Follow-Ups**
- [x] Add some explanation in the docs about FSDP1 vs. FSDP2
- [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/`

Pull Request resolved: #141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
AmdSampsa pushed a commit to AmdSampsa/pytorch that referenced this pull request Dec 9, 2024
This reverts commit ad93aa8.

Reverted pytorch#141398 on behalf of https://github.com/atalman due to Sorry need to revert pytorch#141868, we can try rebase and reland this after ([comment](pytorch#141398 (comment)))
pytorch-bot bot pushed a commit that referenced this pull request Dec 9, 2024
This reverts commit 45583a5.

Reverted #141868 on behalf of https://github.com/atalman due to failing internally ([comment](#141868 (comment)))
pytorch-bot bot pushed a commit that referenced this pull request Dec 9, 2024
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Changes for Reland**
- Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally
- Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule`

Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387)
Pull Request resolved: #141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Esquains pushed a commit to Esquains/study1 that referenced this pull request Dec 15, 2024
@github-actions github-actions bot deleted the gh/awgu/659/head branch January 6, 2025 02:09
aostrowski-hbn pushed a commit to HabanaAI/pytorch-fork that referenced this pull request May 21, 2025
jedrzejmyrcha pushed a commit to HabanaAI/pytorch-fork that referenced this pull request Jul 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged module: dynamo module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp2) release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants