[PT][FSDP] fail `set_allocate_memory_from_process_group` if used together with custom comm hooks #157487

xunnanxu · 2025-07-02T19:40:09Z

Summary:
This is a follow up after the PR to add comm override support: #155189

The previous PR loosely checks the allocation mixin classes, which isn't really safe as the actual hook may still override the behavior.
This may lead to unnecessary confusion for no good use case. So for now we just make the 2 sets of APIs largely incompatible:

setting custom comms after set_allocate_memory_from_process_group_for_comm() is ok.
setting set_allocate_memory_from_process_group_for_comm() after custom comms is ko.

Basically set_allocate_memory_from_process_group_for_comm is like a drop in hammer while the set_custom_all_gather/reduce_scatter() are like finer-grained scalpels that require more code crafted.

We can revisit this if there's use case in between but for now they can be largely viewed independent from each other (even tho we do share some of the underlying pieces for now, that could be subject to change and should not be exposed to end users).

Test Plan: added UT

Differential Revision: D77681620

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

pytorch-bot · 2025-07-02T19:40:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157487

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 11702fa with merge base af9c92b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-07-02T19:40:19Z

This pull request was exported from Phabricator. Differential Revision: D77681620

…ther with (pytorch#157487) Summary: This is a follow up after the PR to add comm override support: pytorch#155189 The previous PR loosely checks the allocation mixin classes, which isn't really safe as the actual hook may still override the behavior. This may lead to unnecessary confusion for no good use case. So for now we just make the 2 sets of APIs largely incompatible: 1. setting custom comms after `set_allocate_memory_from_process_group_for_comm()` is ok. 2. setting `set_allocate_memory_from_process_group_for_comm()` after custom comms is ko. Basically `set_allocate_memory_from_process_group_for_comm` is like a drop in hammer while the `set_custom_all_gather/reduce_scatter()` are like finer-grained scalpels that require more code crafted. We can revisit this if there's use case in between but for now they can be largely viewed independent from each other (even tho we do share some of the underlying pieces for now, that could be subject to change and should not be exposed to end users). Test Plan: added UT Reviewed By: weifengpy Differential Revision: D77681620

facebook-github-bot · 2025-07-02T21:18:57Z

This pull request was exported from Phabricator. Differential Revision: D77681620

xunnanxu · 2025-07-02T21:20:26Z

@pytorchbot merge

pytorchmergebot · 2025-07-02T21:25:00Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-07-03T03:21:03Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

xunnanxu · 2025-07-03T06:58:22Z

@pytorchbot merge

pytorchmergebot · 2025-07-03T07:00:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jul 2, 2025

facebook-github-bot added the fb-exported label Jul 2, 2025

xunnanxu changed the title ~~[PT][FSDP] fail set_allocate_memory_from_process_group if used together with~~ [PT][FSDP] fail set_allocate_memory_from_process_group if used together with custom comm hooks Jul 2, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 2, 2025

xunnanxu requested review from kwen2501, lw and weifengpy July 2, 2025 19:44

weifengpy approved these changes Jul 2, 2025

View reviewed changes

xunnanxu force-pushed the export-D77681620 branch from 8885ee2 to 11702fa Compare July 2, 2025 21:18

pytorchmergebot added the merging label Jul 2, 2025

pytorchmergebot closed this in 8c2e450 Jul 3, 2025

pytorchmergebot added Merged and removed merging labels Jul 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PT][FSDP] fail `set_allocate_memory_from_process_group` if used together with custom comm hooks #157487

[PT][FSDP] fail `set_allocate_memory_from_process_group` if used together with custom comm hooks #157487

xunnanxu commented Jul 2, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Jul 2, 2025

Uh oh!

facebook-github-bot commented Jul 2, 2025

Uh oh!

xunnanxu commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 3, 2025

Uh oh!

xunnanxu commented Jul 3, 2025

Uh oh!

pytorchmergebot commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[PT][FSDP] fail set_allocate_memory_from_process_group if used together with custom comm hooks #157487

[PT][FSDP] fail set_allocate_memory_from_process_group if used together with custom comm hooks #157487

Conversation

xunnanxu commented Jul 2, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157487

✅ No Failures

Uh oh!

facebook-github-bot commented Jul 2, 2025

Uh oh!

facebook-github-bot commented Jul 2, 2025

Uh oh!

xunnanxu commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 2, 2025

Merge started

Uh oh!

pytorchmergebot commented Jul 3, 2025

Uh oh!

xunnanxu commented Jul 3, 2025

Uh oh!

pytorchmergebot commented Jul 3, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[PT][FSDP] fail `set_allocate_memory_from_process_group` if used together with custom comm hooks #157487

[PT][FSDP] fail `set_allocate_memory_from_process_group` if used together with custom comm hooks #157487

xunnanxu commented Jul 2, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading