Fix unsafe collective reorder past wait #157489

wconstab · 2025-07-02T19:51:22Z

Stack from ghstack (oldest at bottom):

Covers the case where the output of one collective feeds the input of another collective.
e.g. TP + FSDP - all_gather(tp+dp sharded param on TP dim) -> allgather dp_sharded buffer on DP dim

Fixes a bug where the reordering pass specifically exempted wait nodes from dependencies.
Note: this exemption was incorrect, so it should be removed. But it was also put there for a reason, to help move collectives past wait nodes that are not related to that collective. After this fix, reordering performance may be worse and we need to find a smarter way to decide if a particular wait node is a blocker for a given collective.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Covers the case where the output of one collective feeds the input of another collective. e.g. TP + FSDP - all_gather(tp+dp sharded param on TP dim) -> allgather dp_sharded buffer on DP dim Fixes a bug where the reordering pass specifically exempted wait nodes from dependencies. Note: this exemption was incorrect, so it should be removed. But it was also put there for a reason, to help move collectives past wait nodes that are not related to that collective. After this fix, reordering performance may be worse and we need to find a smarter way to decide if a particular wait node is a blocker for a given collective. [ghstack-poisoned]

pytorch-bot · 2025-07-02T19:51:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157489

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 832f5d7 with merge base 156bc24 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab · 2025-07-02T20:19:05Z

@pytorchbot merge

pytorchmergebot · 2025-07-02T20:20:57Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-07-02T21:13:52Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-jammy-cuda12.8-py3.10-gcc11-sm89 / test (default, 3, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Covers the case where the output of one collective feeds the input of another collective. e.g. TP + FSDP - all_gather(tp+dp sharded param on TP dim) -> allgather dp_sharded buffer on DP dim Fixes a bug where the reordering pass specifically exempted wait nodes from dependencies. Note: this exemption was incorrect, so it should be removed. But it was also put there for a reason, to help move collectives past wait nodes that are not related to that collective. After this fix, reordering performance may be worse and we need to find a smarter way to decide if a particular wait node is a blocker for a given collective. cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

Covers the case where the output of one collective feeds the input of another collective. e.g. TP + FSDP - all_gather(tp+dp sharded param on TP dim) -> allgather dp_sharded buffer on DP dim Fixes a bug where the reordering pass specifically exempted wait nodes from dependencies. Note: this exemption was incorrect, so it should be removed. But it was also put there for a reason, to help move collectives past wait nodes that are not related to that collective. After this fix, reordering performance may be worse and we need to find a smarter way to decide if a particular wait node is a blocker for a given collective. ghstack-source-id: f946d9b Pull Request resolved: #157489

wconstab · 2025-07-03T00:18:16Z

@pytorchbot merge -i

pytorchmergebot · 2025-07-03T00:20:09Z

Merge started

Your change will be merged while ignoring the following 1 checks: pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

wconstab mentioned this pull request Jul 2, 2025

Move logging into inner method for reorder pass #156879

Closed

pytorch-bot bot added ciflow/inductor module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 2, 2025

wconstab added the release notes: inductor label Jul 2, 2025

IvanKobzarev approved these changes Jul 2, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 2, 2025

pytorchmergebot added the merging label Jul 2, 2025

pytorchmergebot removed the merging label Jul 2, 2025

pytorchmergebot added the merging label Jul 3, 2025

pytorchmergebot added the Merged label Jul 3, 2025

pytorchmergebot closed this in 382598e Jul 3, 2025

pytorchmergebot removed the merging label Jul 3, 2025

sawaraken bot mentioned this pull request Jul 3, 2025

PyTorch Fixes Bug in Unsafe Collective Reordering Past Wait Nodes / PyTorch、非同期集合通信の待機ノード越えの危険な並び替えに関するバグを修正 xhiroga/news#815

Open

github-actions bot deleted the gh/wconstab/422/head branch August 3, 2025 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix unsafe collective reorder past wait #157489

Fix unsafe collective reorder past wait #157489

Uh oh!

wconstab commented Jul 2, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading

Uh oh!

wconstab commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 2, 2025

Uh oh!

wconstab commented Jul 3, 2025

Uh oh!

pytorchmergebot commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix unsafe collective reorder past wait #157489

Fix unsafe collective reorder past wait #157489

Uh oh!

Conversation

wconstab commented Jul 2, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157489

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

wconstab commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 2, 2025

Merge started

Uh oh!

pytorchmergebot commented Jul 2, 2025

Merge failed

Uh oh!

wconstab commented Jul 3, 2025

Uh oh!

pytorchmergebot commented Jul 3, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wconstab commented Jul 2, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading