Do not incorrectly chain each of the strings as iterables #160709

ezyang · 2025-08-15T04:06:38Z

Stack from ghstack (oldest at bottom):

-> Do not incorrectly chain each of the strings as iterables #160709

Signed-off-by: Edward Yang ezyang@meta.com

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

[ghstack-poisoned]

Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: c2ca2f6 Pull-Request: #160709

pytorch-bot · 2025-08-15T04:06:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160709

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9310376 with merge base d678674 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Skylion007 · 2025-08-15T17:01:19Z

torch/distributed/device_mesh.py

            self.flatten_name_to_root_dims.setdefault(root_mesh, {})
            invalid_dim_names = chain(
-                *list(not_none(root_mesh.mesh_dim_names)),
+                list(not_none(root_mesh.mesh_dim_names)),


No tests for this regression?

To clarify this bug won't affect the happy path, which means how _flatten works in the expected way. It will fail only when we pass in the mesh_dim_name when it is same as one of the mesh_dim_name from root mesh.

For example:
mesh = init_device_mesh((4,2), ["cp", "tp"])
mesh["cp", "tp"]._flatten("tp") this should fail instead of letting it pass.

you're saying that its pointless to flatten a single dim, right?
well, shouldn't we still add a test for how flatten behaves in this case? if we want to throw an exception we can assertRaises..

Sure, we definitely should add a unit test case for it. Also it is not just "it's pointless to flatten a single dim" it is also about we are flattening two mesh dims into a conflicting mesh_dim_name which will cause ambiguity during slicing. Though this won't be a problem if we do it functional way. (So I guess the test if added will be removed later)

fduwjj · 2025-08-15T20:41:38Z

@pytorchbot merge

pytorchmergebot · 2025-08-15T20:43:24Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

fduwjj · 2025-08-15T20:47:34Z

@pytorchbot merge

pytorchmergebot · 2025-08-15T20:49:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…0709) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#160709 Approved by: https://github.com/Skylion007, https://github.com/fduwjj

Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding: 1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached). 2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in #160709 but it does not fixed the check for the case when we call the `_flatten` twice. What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why? 1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__). 2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that line will never be reached if we error out before that. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding: 1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached). 2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in #160709 but it does not fixed the check for the case when we call the `_flatten` twice. What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why? 1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__). 2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that line will never be reached if we error out before that. Also we should allow a no-op for flatten a 1D mesh into itself's mesh_dim_name, I added a unit test for it. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding: 1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached). 2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in #160709 but it does not fixed the check for the case when we call the `_flatten` twice. What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why? 1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__). 2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that line will never be reached if we error out before that. Also we should allow a no-op for flatten a 1D mesh into itself's mesh_dim_name, I added a unit test for it. Pull Request resolved: #161311 Approved by: https://github.com/fegin

…0709) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#160709 Approved by: https://github.com/Skylion007, https://github.com/fduwjj

Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding: 1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached). 2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in pytorch#160709 but it does not fixed the check for the case when we call the `_flatten` twice. What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why? 1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__). 2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that line will never be reached if we error out before that. Also we should allow a no-op for flatten a 1D mesh into itself's mesh_dim_name, I added a unit test for it. Pull Request resolved: pytorch#161311 Approved by: https://github.com/fegin

Update

9310376

[ghstack-poisoned]

ezyang added a commit that referenced this pull request Aug 15, 2025

Do not incorrectly chain each of the strings as iterables

afdc113

Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: c2ca2f6 Pull-Request: #160709

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 15, 2025

github-actions bot requested review from SherlockNoMad, albanD, antoniojkim, bdhirsh and miladm August 15, 2025 04:06

Skylion007 approved these changes Aug 15, 2025

View reviewed changes

Skylion007 reviewed Aug 15, 2025

View reviewed changes

fduwjj approved these changes Aug 15, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 15, 2025

pytorchmergebot added the merging label Aug 15, 2025

pytorchmergebot removed the merging label Aug 15, 2025

fduwjj added DeviceMesh topic: bug fixes topic category release notes: DeviceMesh labels Aug 15, 2025

pytorchmergebot added the merging label Aug 15, 2025

pytorchmergebot added the Merged label Aug 15, 2025

pytorchmergebot closed this in 838f22c Aug 15, 2025

pytorchmergebot removed the merging label Aug 15, 2025

fduwjj mentioned this pull request Aug 22, 2025

[DeviceMesh] Clarifying flatten use case #161311

Closed

github-actions bot deleted the gh/ezyang/3137/head branch September 18, 2025 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not incorrectly chain each of the strings as iterables #160709

Do not incorrectly chain each of the strings as iterables #160709

Uh oh!

ezyang commented Aug 15, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Aug 15, 2025 •

edited

Loading

Uh oh!

Skylion007 Aug 15, 2025

Uh oh!

fduwjj Aug 15, 2025

Uh oh!

wconstab Aug 15, 2025

Uh oh!

fduwjj Aug 18, 2025 •

edited

Loading

Uh oh!

fduwjj commented Aug 15, 2025

Uh oh!

pytorchmergebot commented Aug 15, 2025

Uh oh!

fduwjj commented Aug 15, 2025

Uh oh!

pytorchmergebot commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Do not incorrectly chain each of the strings as iterables #160709

Do not incorrectly chain each of the strings as iterables #160709

Uh oh!

Conversation

ezyang commented Aug 15, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160709

✅ No Failures

Uh oh!

Skylion007 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

wconstab Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fduwjj commented Aug 15, 2025

Uh oh!

pytorchmergebot commented Aug 15, 2025

Merge failed

Uh oh!

fduwjj commented Aug 15, 2025

Uh oh!

pytorchmergebot commented Aug 15, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ezyang commented Aug 15, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 15, 2025 •

edited

Loading

fduwjj Aug 18, 2025 •

edited

Loading