[DeviceMesh] Update get_group and add get_all_groups #128097

wz337 · 2024-06-06T01:23:12Z

Fixes #121984

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @tianyu-l @wconstab @yf225 @chauhang @d4l3k @msaroufim

pytorch-bot · 2024-06-06T01:23:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128097

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rebase your PRs: Unstable CUDA signal in CI caused by cudnn 9 update

❌ 1 New Failure, 4 Unrelated Failures

As of commit a2640e5 with merge base 65aa16f ():

NEW FAILURE - The following job has failed:

trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_bool

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
gluon_inception_v3
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (matched linux rule in flaky-rules.json)
Error response from daemon: a prune operation is already running
inductor-periodic / cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (matched linux rule in flaky-rules.json)
Error response from daemon: a prune operation is already running

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal, unstable) (gh) (#126993)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wz337 · 2024-06-06T17:06:28Z

@wconstab @wanchaol As suggested, I updated get_group to return a single PG and added a new API for get_groups. I am wondering how we should warn the user about the change, since the API signature of get_group remains the same while the return type changes.

I believe most of the use cases I've seen use get_group with a mesh_dim passing in, so these use cases won't be affected. The few use cases where the list is returned is actually in our own code base or tests. Should we throw a warning in get_group when user does not pass in a mesh_dim and redirect them to get_groups if they are looking for the list of all the PGs?

facebook-github-bot · 2024-06-06T17:07:28Z

@wz337 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

wconstab · 2024-06-06T21:15:04Z

torch/distributed/device_mesh.py

-                return not_none(
-                    _find_pg_by_ranks_and_tag(*self._dim_group_infos[0][:2])
+            if self.mesh.ndim > 1 and mesh_dim is None:
+                raise RuntimeError(


do we want to raise the error here or just issue a deprecation warning?

how many users do we think are already using this api and would hit this error

cc @wanchaol

I have reviewed all the internal use cases that I can find by fbgs mesh.get_group( (unless people named it some other stuff otherwise) and look like they are all either:

already calling get_group with mesh_dim specified

calling get_group on a 1D child mesh without mesh_dim specified

For these two cases, we are already returning a single PG anyway so these two cases won't be affected.

ok. it may be fine. if you want to derisk further against a revert, you could do a warning in this PR and stack a PR on top that changes the warning to an error. but i'll stamp to unblock.

ok. it may be fine. if you want to derisk further against a revert, you could do a warning in this PR and stack a PR on top that changes the warning to an error. but i'll stamp to unblock.

I think it should be fine. I am importing this PR as a diff to let internal tests run on it. Either way, even though if we do a warning, if it is returning a list at this point, it would also result in an error.

yeah I think either warning/error works, although this is a corner case, it's techniquelly BC breaking, we would probably need to put line in the release notes to explain the changes

torch/distributed/device_mesh.py

wconstab · 2024-06-06T21:18:03Z

torch/distributed/device_mesh.py

            """
-            Returns a list of ProcessGroups corresponding to the mesh dimensions, or
-            returns a single ProcessGroup if mesh_dim is specified or the given mesh has
+            Returns a single ProcessGroup if mesh_dim is specified or the given mesh has


nit: rephrase as something like

"Returns the single ProcessGroup specified by mesh_dim, or, if mesh_dim is unspecified and the DeviceMesh is 1-dimensional, returns the only ProcessGroup in the mesh."

torch/distributed/device_mesh.py

wconstab · 2024-06-06T21:22:02Z

torch/distributed/tensor/parallel/fsdp.py

    dim_groups = mesh.get_group()
-    assert isinstance(dim_groups, list)
-    return dim_groups[0]
+    return dim_groups


nit: its not groups anymore, maybe just say return mesh.get_group() instead?

wz337 · 2024-06-06T23:26:32Z

@pytorchmergebot rebase

pytorchmergebot · 2024-06-06T23:28:01Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-06-06T23:28:04Z

Successfully rebased fix_get_group onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_get_group && git pull --rebase)

wz337 · 2024-06-07T06:29:23Z

@pytorchmergebot rebase

facebook-github-bot · 2024-06-07T19:48:02Z

@wz337 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

wanchaol

lgtm! one nit inlined

wanchaol · 2024-06-07T19:49:27Z

torch/distributed/device_mesh.py

-                return not_none(
-                    _find_pg_by_ranks_and_tag(*self._dim_group_infos[0][:2])
+            if self.mesh.ndim > 1 and mesh_dim is None:
+                raise RuntimeError(


yeah I think either warning/error works, although this is a corner case, it's techniquelly BC breaking, we would probably need to put line in the release notes to explain the changes

torch/distributed/device_mesh.py

facebook-github-bot · 2024-06-07T19:57:16Z

@wz337 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

wz337 · 2024-06-08T00:41:59Z

@pytorchmergebot merge

pytorchmergebot · 2024-06-08T00:43:49Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-08T01:05:27Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / build

Details for Dev Infra team

Raised by workflow job

wz337 · 2024-06-08T01:33:47Z

@pytorchmergebot merge -i "unreleated trunk errror"

pytorch-bot · 2024-06-08T01:33:48Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: unreleated trunk errror

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

wz337 · 2024-06-08T01:34:43Z

@pytorchmergebot merge -i

pytorchmergebot · 2024-06-08T01:36:27Z

Merge started

Your change will be merged while ignoring the following 5 checks: inductor-periodic / cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal, unstable), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu), trunk / win-vs2019-cpu-py3 / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-08T02:03:06Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

wz337 · 2024-06-08T03:44:33Z

@pytorchmergebot merge -i

pytorchmergebot · 2024-06-08T03:46:19Z

Merge started

Your change will be merged while ignoring the following 5 checks: inductor-periodic / cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal, unstable), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu), trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes pytorch#121984 Pull Request resolved: pytorch#128097 Approved by: https://github.com/wconstab, https://github.com/wanchaol

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jun 6, 2024

wz337 force-pushed the fix_get_group branch from c638a57 to 0b2847c Compare June 6, 2024 01:55

wz337 changed the title ~~[WIP]update get_group and add get_groups~~ [DeviceMesh] Update get_group and add get_groups Jun 6, 2024

wz337 requested review from wanchaol and wconstab June 6, 2024 16:59

wz337 marked this pull request as ready for review June 6, 2024 16:59

wz337 force-pushed the fix_get_group branch from 0b2847c to eb72eb2 Compare June 6, 2024 17:06

wz337 added module: dtensor distributed tensor tag and removed release notes: distributed (fsdp) release notes category labels Jun 6, 2024

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Jun 6, 2024

wz337 added release notes: distributed (dtensor) release notes category and removed release notes: distributed (fsdp) release notes category labels Jun 6, 2024

wconstab reviewed Jun 6, 2024

View reviewed changes

torch/distributed/device_mesh.py Show resolved Hide resolved

wconstab reviewed Jun 6, 2024

View reviewed changes

torch/distributed/device_mesh.py Outdated Show resolved Hide resolved

wconstab reviewed Jun 6, 2024

View reviewed changes

wconstab approved these changes Jun 6, 2024

View reviewed changes

wz337 force-pushed the fix_get_group branch 2 times, most recently from 2bd9ffc to 7b0a7e9 Compare June 6, 2024 22:31

pytorchmergebot force-pushed the fix_get_group branch from 7b0a7e9 to 8993ebe Compare June 6, 2024 23:28

wz337 changed the title ~~[DeviceMesh] Update get_group and add get_groups~~ [DeviceMesh] Update get_group and add get_all_groups Jun 7, 2024

wz337 added 3 commits June 7, 2024 12:45

Rename get_groups -> get_all_groups

f63c50a

Update test_tracing.py

9ea1575

Update test_device_mesh.py

3127b71

wanchaol approved these changes Jun 7, 2024

View reviewed changes

wanchaol added the topic: bc breaking topic category label Jun 7, 2024

Update device_mesh.py

a2640e5

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 8, 2024

pytorchmergebot added the merging label Jun 8, 2024

pytorchmergebot removed the merging label Jun 8, 2024

pytorchmergebot added the merging label Jun 8, 2024

pytorchmergebot removed the merging label Jun 8, 2024

pytorchmergebot added the merging label Jun 8, 2024

pytorchmergebot added the Merged label Jun 8, 2024

pytorchmergebot closed this in 1d84c7e Jun 8, 2024

pytorchmergebot removed the merging label Jun 8, 2024

[DeviceMesh] Update get_group and add get_all_groups #128097

[DeviceMesh] Update get_group and add get_all_groups #128097

Uh oh!

Conversation

wz337 commented Jun 6, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128097

❗ 1 Active SEVs

❌ 1 New Failure, 4 Unrelated Failures

Uh oh!

wz337 commented Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 6, 2024

Uh oh!

wconstab Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

wz337 Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

wz337 Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wanchaol Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wconstab Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wconstab Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

wz337 commented Jun 6, 2024

Uh oh!

pytorchmergebot commented Jun 6, 2024

Uh oh!

pytorchmergebot commented Jun 6, 2024

Uh oh!

wz337 commented Jun 7, 2024

Uh oh!

facebook-github-bot commented Jun 7, 2024

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Jun 7, 2024

Uh oh!

wz337 commented Jun 8, 2024

Uh oh!

pytorchmergebot commented Jun 8, 2024

Merge started

Uh oh!

pytorchmergebot commented Jun 8, 2024

Merge failed

Uh oh!

wz337 commented Jun 8, 2024

Uh oh!

pytorch-bot bot commented Jun 8, 2024

Uh oh!

wz337 commented Jun 8, 2024

Uh oh!

pytorchmergebot commented Jun 8, 2024

Merge started

Uh oh!

wz337 commented Jun 6, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 6, 2024 •

edited

Loading

wz337 commented Jun 6, 2024 •

edited

Loading

wz337 Jun 6, 2024 •

edited

Loading

wz337 Jun 6, 2024 •

edited

Loading