[dtensor] relax device_mesh argument constraint in local_map #157049

wanchaol · 2025-06-27T01:12:34Z

This PR relaxes the device_mesh argument constraint in the local_map API. The current restriction is too strict, i.e. all the input arguments must have the same device mesh if they are DTensors. But many times user might want to pass in DTensors to this function that lives on different device mesh, i.e. weight and activation could live in different device mesh.

When using the local_map, we are extracting the local tensors from DTensors, and as long as the placements user specified match with the actual DTensor placements, user knows clearly that the inputs are intended to live in different mesh. So this PR removes the same mesh check and update doc to clearly document the behavior.

The device_mesh argument now serves for a main purpose, allow user to specify the device_mesh for the output DTensor reconstruction

Fixes #ISSUE_NUMBER

cc @H-Huang @awgu @fegin @fduwjj @wz337 @wconstab @d4l3k

This PR relaxes the device_mesh argument constraint in the local_map API. The current restriction is too strict, i.e. all the input arguments must have the same device mesh if they are DTensors. But many times user might want to pass in DTensors to this function that lives on different device mesh, i.e. weight and activation could live in different device mesh. When using the local_map, we are extracting the local tensors from DTensors, and as long as the placements user specified match with the actual DTensor placements, user knows clearly that the inputs are intended to live in different mesh. So this PR removes the same mesh check and update doc to clearly document the behavior. The `device_mesh` argument now serves for a main purpose, allow user to specify the device_mesh for the output DTensor reconstruction

pytorch-bot · 2025-06-27T01:12:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157049

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 1 Unrelated Failure

As of commit aa77732 with merge base 81759af ():

NEW FAILURES - The following jobs have failed:

inductor / unit-test / cuda12.8-py3.12-gcc9-sm86 / build (gh)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
inductor / unit-test / linux-jammy-cpu-py3.9-gcc11-inductor / build (gh)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

CANCELLED JOB - The following job was cancelled. Please retry:

trunk / linux-jammy-cuda12.8-py3.10-gcc11-no-ops / build (gh)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Chillee

Makes sense to me!

zpcore · 2025-06-27T21:58:43Z

LGTM!

Overall this seems to be a hack to get around the cross mesh assertion. We basically leave the responsibility to the user to make sure the shape is correct so that we don't need to write the complicated "cross mesh sharding propagation rule". My concern is that the code can easily fail once we run the model on top of a different world_size or input.

wanchaol · 2025-06-27T22:10:04Z

Overall this seems to be a hack to get around the cross mesh assertion. We basically leave the responsibility to the user to make sure the shape is correct so that we don't need to write the complicated "cross mesh sharding propagation rule". My concern is that the code can easily fail once we run the model on top of a different world_size or input.

@zpcore The role of the local_map API is to give back control to user to operate directly on the input DTensor's local shard, and it returns a DTensor according to user provided out_placements, so that the user function that wraps with local_map could do whatever user want. For the wrapped user function, there should be no "cross mesh sharding propagation rule" be written, as the in_placements and out_placements has already been specified by user when using the local_map API/contract (and it act like a "propagation rule" in some sense).

I think the previous assertion asserts the different input DTensors must have the same mesh is for extra safety in the beginning, but it turns out not necessary. The restriction is that it would limit the expressiveness that the in_placements user can specify (suppose user pass two DTensors, one is on 1-D mesh, the other is on 2-D mesh, even if user can specify like this, the previous assertion would error out to user). Let me know if that does not make sense to you.

wanchaol · 2025-06-27T22:10:25Z

@pytorchbot merge

pytorchmergebot · 2025-06-27T22:12:10Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-06-28T01:00:03Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-jammy-py3.9-gcc11-no-ops / build

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

wanchaol · 2025-06-30T17:48:22Z

@pytorchbot merge -f "ci failure not related"

pytorchmergebot · 2025-06-30T17:51:24Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jun 27, 2025

pytorchbot added the open source label Jun 27, 2025

fix lint

aa77732

wanchaol requested review from Chillee, XilunWu, awgu and zpcore June 27, 2025 21:35

wanchaol added release notes: distributed (dtensor) release notes category ciflow/trunk Trigger trunk jobs on your pull request labels Jun 27, 2025

Chillee approved these changes Jun 27, 2025

View reviewed changes

zpcore approved these changes Jun 27, 2025

View reviewed changes

pytorchmergebot added the merging label Jun 27, 2025

pytorchmergebot removed the merging label Jun 28, 2025

pytorchmergebot added the merging label Jun 30, 2025

pytorchmergebot closed this in 2815eea Jun 30, 2025

pytorchmergebot added Merged and removed merging labels Jun 30, 2025

XilunWu mentioned this pull request Jul 18, 2025

[DTensor] fix copy_ strategy #158538

Closed

github-actions bot deleted the local_map branch July 31, 2025 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[dtensor] relax device_mesh argument constraint in local_map #157049

[dtensor] relax device_mesh argument constraint in local_map #157049

Uh oh!

wanchaol commented Jun 27, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jun 27, 2025 •

edited

Loading

Uh oh!

Chillee left a comment

Uh oh!

zpcore commented Jun 27, 2025

Uh oh!

wanchaol commented Jun 27, 2025

Uh oh!

wanchaol commented Jun 27, 2025

Uh oh!

pytorchmergebot commented Jun 27, 2025

Uh oh!

pytorchmergebot commented Jun 28, 2025

Uh oh!

wanchaol commented Jun 30, 2025

Uh oh!

pytorchmergebot commented Jun 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[dtensor] relax device_mesh argument constraint in local_map #157049

[dtensor] relax device_mesh argument constraint in local_map #157049

Uh oh!

Conversation

wanchaol commented Jun 27, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157049

❌ 2 New Failures, 1 Cancelled Job, 1 Unrelated Failure

Uh oh!

Chillee left a comment

Choose a reason for hiding this comment

Uh oh!

zpcore commented Jun 27, 2025

Uh oh!

wanchaol commented Jun 27, 2025

Uh oh!

wanchaol commented Jun 27, 2025

Uh oh!

pytorchmergebot commented Jun 27, 2025

Merge started

Uh oh!

pytorchmergebot commented Jun 28, 2025

Merge failed

Uh oh!

wanchaol commented Jun 30, 2025

Uh oh!

pytorchmergebot commented Jun 30, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wanchaol commented Jun 27, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 27, 2025 •

edited

Loading