Support complex numbers in DTensor redistribute #157329

wconstab · 2025-06-30T23:27:27Z

Stack from ghstack (oldest at bottom):

-> Support complex numbers in DTensor redistribute #157329

Add complex number unwrapping in functional collectives used by DTensor.

Complex tensors are not directly supported by underlying comm kernels
(e.g. nccl) but complex tensors can be viewed as real tensors of a
higher rank (added size-2 tensor dim represents real vs im component).
Collective output is then viewed as complex to restore the
original/expected shape and dtype.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k

Add complex number unwrapping in functional collectives used by DTensor. Complex tensors are not directly supported by underlying comm kernels (e.g. nccl) but complex tensors can be viewed as real tensors of a higher rank (added size-2 tensor dim represents real vs im component). Collective output is then viewed as complex to restore the original/expected shape and dtype. [ghstack-poisoned]

pytorch-bot · 2025-06-30T23:27:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157329

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit d1d069e with merge base 070aa59 ():

NEW FAILURES - The following jobs have failed:

trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m1-13) (gh)
Process completed with exit code 1.
trunk / win-vs2022-cpu-py3 / test (default, 1, 3, lf.windows.4xlarge.nonephemeral) (gh)
'Test'

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Add complex number unwrapping in functional collectives used by DTensor. Complex tensors are not directly supported by underlying comm kernels (e.g. nccl) but complex tensors can be viewed as real tensors of a higher rank (added size-2 tensor dim represents real vs im component). Collective output is then viewed as complex to restore the original/expected shape and dtype. ghstack-source-id: da2fdae Pull Request resolved: #157329

Add complex number unwrapping in functional collectives used by DTensor. Complex tensors are not directly supported by underlying comm kernels (e.g. nccl) but complex tensors can be viewed as real tensors of a higher rank (added size-2 tensor dim represents real vs im component). Collective output is then viewed as complex to restore the original/expected shape and dtype. cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k Differential Revision: [D77564148](https://our.internmc.facebook.com/intern/diff/D77564148) [ghstack-poisoned]

XilunWu · 2025-07-01T20:16:07Z

test/distributed/tensor/test_redistribute.py

+        # TODO(whc) it appears complex-allreduce is already being supported becuase this test passes,
+        # but I did not see where the support is


this support is only for NCCL and some reduceOp: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L4393-L4400

ah, thanks for pointing that out. That was the last piece i was missing.

The new piece of info was that CI failed this test for GLOO bc of gloo not supporting allreduce. I therefore added the complex support at Functional.cpp for allreduce, fixing the gloo test, and now I understand why nccl was already passing without it. let me remove this TODO.

Add complex number unwrapping in functional collectives used by DTensor. Complex tensors are not directly supported by underlying comm kernels (e.g. nccl) but complex tensors can be viewed as real tensors of a higher rank (added size-2 tensor dim represents real vs im component). Collective output is then viewed as complex to restore the original/expected shape and dtype. cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

XilunWu

The test change LGTM but the functional collective part needs small change. Thanks Will for adding complex number support!

XilunWu · 2025-07-01T21:17:20Z

torch/csrc/distributed/c10d/Functional.cpp

+  auto input_real = input.is_complex() ? at::view_as_real(input) : input;
+  auto output = input_real.clone(at::MemoryFormat::Contiguous);
+  auto output_ret =
+      all_reduce_(output, std::move(reduce_op), std::move(group_name));
+  return input.is_complex() ? at::view_as_complex(output_ret) : output_ret;


We need the same logic (check reduce_op) (https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L4393-L4400) because this approach only holds numeric correctness for these 4 ops.

i don't know what the best choice is here to reuse the existing helper. I ended up writing a new string-matching check. Lmk if you think its worthwhile to do more refactoring to use the same helper from ProcessGroupNccl, but note that to use that i'd have to first convert into ReduceOp enum, and the conversion helper in this file is incomplete for some reason. (premul_sum)

Add complex number unwrapping in functional collectives used by DTensor. Complex tensors are not directly supported by underlying comm kernels (e.g. nccl) but complex tensors can be viewed as real tensors of a higher rank (added size-2 tensor dim represents real vs im component). Collective output is then viewed as complex to restore the original/expected shape and dtype. cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

Add complex number unwrapping in functional collectives used by DTensor. Complex tensors are not directly supported by underlying comm kernels (e.g. nccl) but complex tensors can be viewed as real tensors of a higher rank (added size-2 tensor dim represents real vs im component). Collective output is then viewed as complex to restore the original/expected shape and dtype. ghstack-source-id: 052ebcd Pull Request resolved: #157329

XilunWu · 2025-07-02T21:00:58Z

test/distributed/tensor/test_redistribute.py

        self.assertEqual(new_tensor.stride(), new_meta_tensor.stride())


+instantiate_parametrized_tests(RedistributeTest)


note to myself: use instantiate_parametrized_tests in DTensor tests

XilunWu

stamp to unblock. We can address the comment in follow-up PR

XilunWu · 2025-07-02T21:03:29Z

torch/csrc/distributed/c10d/Functional.cpp

+    TORCH_CHECK(
+        // TODO - ideally use 'to_reduce_op' helper but it currently errors on
+        // premul_sum
+        reduce_op == "sum" || reduce_op == "avg" || reduce_op == "premul_sum" ||


IMO it would be good to reuse complexViewAsRealAllowed.

i agree but it is not trivial.

convert string to enum: there is a helper in this file, but it is not complete. if i completed it by filling in missing premul_sum, i would be affecting the behavior of other ops using this helper

then i could refactor the complexViewAsRealAllowed out of ProcessGroupNccl.cpp and make it a util and use it.

Happy to do this in another PR, lmk if you have thoughts about (1)

Low prio though.

lmk if you have thoughts about (1)

I think you're right. And I don't have a good solution either that can ensure consistency among helpers and potential extension of ReduceOp enum.

wconstab · 2025-07-02T21:29:10Z

@pytorchbot merge -i

pytorchmergebot · 2025-07-02T21:31:36Z

Merge started

Your change will be merged while ignoring the following 3 checks: pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable), trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m1-13), trunk / win-vs2022-cpu-py3 / test (default, 1, 3, lf.windows.4xlarge.nonephemeral)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 30, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 30, 2025

XilunWu reviewed Jul 1, 2025

View reviewed changes

XilunWu reviewed Jul 2, 2025

View reviewed changes

XilunWu approved these changes Jul 2, 2025

View reviewed changes

pytorchmergebot added the merging label Jul 2, 2025

pytorchmergebot closed this in 4b4c2a7 Jul 2, 2025

pytorchmergebot added Merged and removed merging labels Jul 2, 2025

wconstab deleted the gh/wconstab/421/head branch July 3, 2025 00:18

		# TODO(whc) it appears complex-allreduce is already being supported becuase this test passes,
		# but I did not see where the support is

		self.assertEqual(new_tensor.stride(), new_meta_tensor.stride())


		instantiate_parametrized_tests(RedistributeTest)

Support complex numbers in DTensor redistribute #157329

Support complex numbers in DTensor redistribute #157329

Uh oh!

Conversation

wconstab commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157329

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 2, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wconstab commented Jun 30, 2025 •

edited

Loading

pytorch-bot bot commented Jun 30, 2025 •

edited

Loading