KEMBAR78
[PT][FSDP] fail `set_allocate_memory_from_process_group` if used together with custom comm hooks by xunnanxu · Pull Request #157487 · pytorch/pytorch · GitHub
Skip to content

Conversation

@xunnanxu
Copy link
Contributor

@xunnanxu xunnanxu commented Jul 2, 2025

Summary:
This is a follow up after the PR to add comm override support: #155189

The previous PR loosely checks the allocation mixin classes, which isn't really safe as the actual hook may still override the behavior.
This may lead to unnecessary confusion for no good use case. So for now we just make the 2 sets of APIs largely incompatible:

  1. setting custom comms after set_allocate_memory_from_process_group_for_comm() is ok.
  2. setting set_allocate_memory_from_process_group_for_comm() after custom comms is ko.

Basically set_allocate_memory_from_process_group_for_comm is like a drop in hammer while the set_custom_all_gather/reduce_scatter() are like finer-grained scalpels that require more code crafted.

We can revisit this if there's use case in between but for now they can be largely viewed independent from each other (even tho we do share some of the underlying pieces for now, that could be subject to change and should not be exposed to end users).

Test Plan: added UT

Differential Revision: D77681620

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157487

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 11702fa with merge base af9c92b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jul 2, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77681620

@xunnanxu xunnanxu changed the title [PT][FSDP] fail set_allocate_memory_from_process_group if used together with [PT][FSDP] fail set_allocate_memory_from_process_group if used together with custom comm hooks Jul 2, 2025
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 2, 2025
@xunnanxu xunnanxu requested review from kwen2501, lw and weifengpy July 2, 2025 19:44
…ther with (pytorch#157487)

Summary:

This is a follow up after the PR to add comm override support: pytorch#155189

The previous PR loosely checks the allocation mixin classes, which isn't really safe as the actual hook may still override the behavior.
This may lead to unnecessary confusion for no good use case. So for now we just make the 2 sets of APIs largely incompatible:
1. setting custom comms after `set_allocate_memory_from_process_group_for_comm()` is ok.
2. setting `set_allocate_memory_from_process_group_for_comm()` after custom comms is ko.

Basically `set_allocate_memory_from_process_group_for_comm` is like a drop in hammer while the `set_custom_all_gather/reduce_scatter()` are like finer-grained scalpels that require more code crafted. 

We can revisit this if there's use case in between but for now they can be largely viewed independent from each other (even tho we do share some of the underlying pieces for now, that could be subject to change and should not be exposed to end users).

Test Plan: added UT

Reviewed By: weifengpy

Differential Revision: D77681620
@xunnanxu xunnanxu force-pushed the export-D77681620 branch from 8885ee2 to 11702fa Compare July 2, 2025 21:18
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77681620

@xunnanxu
Copy link
Contributor Author

xunnanxu commented Jul 2, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@xunnanxu
Copy link
Contributor Author

xunnanxu commented Jul 3, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants