KEMBAR78

[PT] support custom all_gather and reduce_scatter comms by xunnanxu · Pull Request #155189 · pytorch/pytorch · GitHub

[PT] support custom all_gather and reduce_scatter comms #155189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

xunnanxu wants to merge 1 commit into pytorch:main from xunnanxu:export-D75714362

+287 −80

Contributor

xunnanxu commented Jun 5, 2025 •

edited

Loading

Summary:
This change introduces 2 comm override APIs: set_custom_all_gather and set_custom_reduce_scatter to allow for custom behavior respectively.

This allow users to control how the comm buffers are allocated and the exact comm implementation for flexibility.
For details, see docstring in Comm in _fsdp_api.py

Related PR:
#150564

Test Plan: CI

Differential Revision: D75714362

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot bot commented Jun 5, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155189

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1cacb20 with merge base f79689b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added ciflow/inductor module: inductor oncall: distributed release notes: distributed (fsdp) labels

Contributor

facebook-github-bot commented Jun 5, 2025

This pull request was exported from Phabricator. Differential Revision: D75714362

facebook-github-bot added the fb-exported label

xunnanxu force-pushed the export-D75714362 branch 2 times, most recently from 2b428e6 to 569a4d6 Compare

June 5, 2025 06:27

Contributor

facebook-github-bot commented Jun 5, 2025

This pull request was exported from Phabricator. Differential Revision: D75714362

1 similar comment

Contributor

facebook-github-bot commented Jun 5, 2025

This pull request was exported from Phabricator. Differential Revision: D75714362

xunnanxu force-pushed the export-D75714362 branch 2 times, most recently from e806f3a to b03004b Compare

June 5, 2025 07:19

Contributor

facebook-github-bot commented Jun 5, 2025

This pull request was exported from Phabricator. Differential Revision: D75714362

xunnanxu force-pushed the export-D75714362 branch from b03004b to 8fc9010 Compare

June 5, 2025 07:24

xunnanxu force-pushed the export-D75714362 branch from 8fc9010 to 8d1e2cf Compare

June 5, 2025 07:27

Contributor

facebook-github-bot commented Jun 5, 2025

This pull request was exported from Phabricator. Differential Revision: D75714362

xunnanxu force-pushed the export-D75714362 branch from 8d1e2cf to 6a7b530 Compare

June 5, 2025 07:31

xunnanxu force-pushed the export-D75714362 branch from 6a7b530 to 37d962b Compare

June 5, 2025 16:40

Contributor

facebook-github-bot commented Jun 5, 2025

This pull request was exported from Phabricator. Differential Revision: D75714362

xunnanxu marked this pull request as draft

June 5, 2025 16:40

xunnanxu force-pushed the export-D75714362 branch from 37d962b to 995e205 Compare

June 6, 2025 05:22

Contributor

facebook-github-bot commented Jun 30, 2025

This pull request was exported from Phabricator. Differential Revision: D75714362

xunnanxu force-pushed the export-D75714362 branch from 94dc77f to f9ebe84 Compare

June 30, 2025 04:42

xunnanxu force-pushed the export-D75714362 branch from f9ebe84 to a292323 Compare

June 30, 2025 04:49


          [PT] support custom all_gather and reduce_scatter comms (pytorch#155189)

1cacb20

Summary:
Pull Request resolved: pytorch#155189

This change introduces 2 comm override APIs: `set_custom_all_gather` and `set_custom_reduce_scatter` to allow for custom behavior respectively.

This allow users to control how the comm buffers are allocated and the exact comm implementation for flexibility.
For details, see docstring in `Comm`, `BaseComm` in `_fsdp_api.py`

Test Plan: CI

Differential Revision: D75714362

Contributor

facebook-github-bot commented Jun 30, 2025

This pull request was exported from Phabricator. Differential Revision: D75714362

xunnanxu force-pushed the export-D75714362 branch from a292323 to 1cacb20 Compare

June 30, 2025 04:53

xunnanxu marked this pull request as ready for review

June 30, 2025 08:13

xunnanxu requested review from kwen2501, lw and weifengpy

June 30, 2025 08:14

weifengpy approved these changes

View reviewed changes

Contributor

weifengpy left a comment

appreicate your persistence to push this to the very end

pytorch-bot bot added the ciflow/trunk label

Contributor Author

xunnanxu commented Jul 2, 2025

@pytorchbot merge

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Jul 2, 2025

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Raised by workflow job

pytorchmergebot removed the merging label

Contributor

facebook-github-bot commented Jul 2, 2025

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Jul 2, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this in

0364db7

pytorchmergebot added Merged and removed merging labels

sawaraken bot mentioned this pull request

PyTorch Introduces Custom Communication Hooks for FSDP / PyTorch、FSDP向けにカスタムの通信フックを導入 xhiroga/news#775

Open

xunnanxu mentioned this pull request

[PT][FSDP] fail set_allocate_memory_from_process_group if used together with custom comm hooks #157487

Closed

xunnanxu added a commit to xunnanxu/pytorch that referenced this pull request


          [PT][FSDP] fail set_allocate_memory_from_process_group if used toge…

11702fa

…ther with (pytorch#157487)

Summary:

This is a follow up after the PR to add comm override support: pytorch#155189

The previous PR loosely checks the allocation mixin classes, which isn't really safe as the actual hook may still override the behavior.
This may lead to unnecessary confusion for no good use case. So for now we just make the 2 sets of APIs largely incompatible:
1. setting custom comms after `set_allocate_memory_from_process_group_for_comm()` is ok.
2. setting `set_allocate_memory_from_process_group_for_comm()` after custom comms is ko.

Basically `set_allocate_memory_from_process_group_for_comm` is like a drop in hammer while the `set_custom_all_gather/reduce_scatter()` are like finer-grained scalpels that require more code crafted. 

We can revisit this if there's use case in between but for now they can be largely viewed independent from each other (even tho we do share some of the underlying pieces for now, that could be subject to change and should not be exposed to end users).

Test Plan: added UT

Reviewed By: weifengpy

Differential Revision: D77681620

pytorchmergebot pushed a commit that referenced this pull request


          [PT][FSDP] fail set_allocate_memory_from_process_group if used toge…

8c2e450

…ther with custom comm hooks (#157487)

Summary:
This is a follow up after the PR to add comm override support: #155189

The previous PR loosely checks the allocation mixin classes, which isn't really safe as the actual hook may still override the behavior.
This may lead to unnecessary confusion for no good use case. So for now we just make the 2 sets of APIs largely incompatible:
1. setting custom comms after `set_allocate_memory_from_process_group_for_comm()` is ok.
2. setting `set_allocate_memory_from_process_group_for_comm()` after custom comms is ko.

Basically `set_allocate_memory_from_process_group_for_comm` is like a drop in hammer while the `set_custom_all_gather/reduce_scatter()` are like finer-grained scalpels that require more code crafted.

We can revisit this if there's use case in between but for now they can be largely viewed independent from each other (even tho we do share some of the underlying pieces for now, that could be subject to change and should not be exposed to end users).

Test Plan: added UT

Differential Revision: D77681620

Pull Request resolved: #157487
Approved by: https://github.com/weifengpy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk fb-exported Merged module: inductor oncall: distributed release notes: distributed (fsdp)