[dtensor] Rework partial propagation in pointwise op and support mul #157340

wanchaol · 2025-07-01T01:10:07Z

I am trying to see if I can easily add the linearity support for aten.mul to allow Partial placement to propagate through. But it turns out that I have to completely rework the current linearity propagation.

In short, before this PR, linearity mainly support aten.add and some trival ops. It is done by allowing input Partial to propagate, and in the meanwhile, redistribute Replicate inputs to Partial to preserve the single device semantic, i.e suppose we want to execute aten.add(lhs, rhs) on 2 ranks:

lhs is partial, value on rank 0: r0, lhs value on rank 1: r1
rhs is replicate, value: a

Then in order to preserve single device semantic (which should produce the value of a + r0 + r1), we do rhs/world_size first, then add rhs to lhs. This means every operand would first need be partial, then we can add them together.

But this become non-true for multiplicative operations, like aten.mul, for aten.mul, assuming the same aten.mul(lhs, rhs) and value, we don't need to divide lhs by world_size to preserve single device semantic, b.c. a* (r0+r1) = a* r0 + a* r1

So to accomodate the difference of add/mul, in this PR I:

change linearity to be a int to support different linearity types, add linearity and multiplicative are separate
add checks to ensure only a subset of partial types can support linearity (namely partial-sum/avg)
handle the linearity type plumbing through the pointwise ops.
add mul.Tensor/Scalar to be the multiplicative linearity
added the tests to show that the partial placements can be propagated with aten.mul

Fixes #ISSUE_NUMBER

cc @H-Huang @awgu @fegin @fduwjj @wz337 @wconstab @d4l3k

I am trying to see if I can easily add the linearity support for aten.mul to allow Partial placement to propagate through. But it turns out that I have to completely rework the current linearity propagation. In short, before this PR, linearity mainly support aten.add and some trival ops. It is done by allowing input Partial to propagate, and in the meanwhile, redistribute Replicate inputs to Partial to preserve the single device semantic, i.e suppose we have aten.add(lhs, rhs) on 2 ranks: * lhs is partial, value on rank 0: r0, lhs value on rank 1: r1 * rhs is replicate, value: a Then in order to perserve single device semantic (which should produce the value of `a + r0 + r1`), we do rhs/world_size first, then add rhs to lhs. This means every operand would first be partial, then we can add them together. But this become non-true for multiplicative operations, like aten.mul, for aten.mul, assuming the same `aten.mul(lhs, rhs)` and value, we don't need to divide lhs by world_size to preserve single device semantic, b.c. `a* (r0+r1) = a* r0 + a* r1` So to accomodate the difference of add/mul, in this PR I: * change linearity to be a int to support different linearity types, add linearity and multiplicative are separate * add checks to ensure only a subset of partial types can support linearity (namely partial-sum/avg) * handle the linearity type plumbing through the pointwise ops. * add mul.Tensor/Scalar to be the multiplicative linearity * added the tests to show that the partial placements can be propagated with aten.mul

pytorch-bot · 2025-07-01T01:10:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157340

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 326eeea with merge base 81759af ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

zpcore

LGTM!

test/distributed/tensor/test_pointwise_ops.py

wanchaol · 2025-07-03T16:56:36Z

@pytorchbot merge

pytorchmergebot · 2025-07-03T16:58:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 1, 2025

pytorchbot added the open source label Jul 1, 2025

lint

4431912

wanchaol added the release notes: distributed (dtensor) release notes category label Jul 1, 2025

doc

efd5728

wanchaol requested review from Chillee, awgu, tianyu-l, wconstab and zpcore July 1, 2025 20:51

wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 2, 2025

zpcore approved these changes Jul 2, 2025

View reviewed changes

zpcore reviewed Jul 2, 2025

View reviewed changes

test/distributed/tensor/test_pointwise_ops.py Show resolved Hide resolved

comment

326eeea

pytorchmergebot added the merging label Jul 3, 2025

pytorchmergebot closed this in 5cfe437 Jul 3, 2025

pytorchmergebot added Merged and removed merging labels Jul 3, 2025

github-actions bot deleted the mul_op branch August 3, 2025 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[dtensor] Rework partial propagation in pointwise op and support mul #157340

[dtensor] Rework partial propagation in pointwise op and support mul #157340

Uh oh!

wanchaol commented Jul 1, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 1, 2025 •

edited

Loading

Uh oh!

zpcore left a comment

Uh oh!

Uh oh!

wanchaol commented Jul 3, 2025

Uh oh!

pytorchmergebot commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[dtensor] Rework partial propagation in pointwise op and support mul #157340

[dtensor] Rework partial propagation in pointwise op and support mul #157340

Uh oh!

Conversation

wanchaol commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157340

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

zpcore left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wanchaol commented Jul 3, 2025

Uh oh!

pytorchmergebot commented Jul 3, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wanchaol commented Jul 1, 2025 •

edited

Loading

pytorch-bot bot commented Jul 1, 2025 •

edited

Loading