[DTensor][XLA] Support Xla backend in distribute_tensor API #110275

yeounoh · 2023-09-29T04:54:45Z

This addresses #92909 , and enable XLA backend support for distribute_tensor API.

Test plan: added a unit test case & tested with CloudTPU. The CI should skip this unless it's a XLA workflow.

cc @bdhirsh @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @avikchaudhuri @gmagogsfm @zhxchen17 @tugsbayasgalan @wanchaol

pytorch-bot · 2023-09-29T04:54:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110275

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d435177 with merge base 3ca81ae ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wanchaol

Looks pretty good and almost ready to merge! I left a few more comments about testing and and doc suggestions, we can land this once addressing them :)

test/distributed/_tensor/test_xla_integration.py

torch/distributed/_tensor/_xla.py

torch/distributed/_tensor/api.py

torch/distributed/_tensor/device_mesh.py

wanchaol

looks good, we can do a follow up PR to move most logic to pytorch/xla directly

wanchaol · 2023-10-20T19:17:52Z

torch/distributed/_tensor/_xla.py

+
+
+@with_xla
+def xla_distribute_tensor(


Something suggested by @bdhirsh that could further simplify the pytorch integration and be more flexible on your side: we can essentially move the whole logic to the pytorch/xla and expose a xla_distribute_tensor in the pytorch/xla package, and then in the pytorch side we can just use this API directly without having this _xla.py

Sounds good, thanks Brian @bdhirsh

yeounoh · 2023-10-20T22:57:34Z

@pytorchbot merge

pytorchmergebot · 2023-10-20T22:59:32Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

alanwaketan

LGTM except one comment.

alanwaketan · 2023-10-20T21:03:52Z

torch/distributed/_tensor/_xla.py

+      ```
+    """
+    assert dt_mesh.size() == xr.global_runtime_device_count()
+    return Mesh(


Unclear to me how HybridMesh would work here. But we can follow up later.

Good question - HybridMesh is a wrapper class class HybridMesh(Mesh) for multi-pod (ici & dcn shapes), which still uses xs.Mesh. Probably not relevant, yet, in DTensor and we should define a new mesh type, like HybridDeviceMesh and integrate there when the time comes for us to follow-up. Thanks @alanwaketan

JackCaoG · 2023-10-20T23:12:39Z

torch/distributed/_tensor/_xla.py

+    ) -> None:
+        if TORCH_XLA_INITIALIZED:
+            # TODO(yeounoh) replace this with xr.use_spmd() when we deprecate the flag.
+            os.environ["XLA_USE_SPMD"] = "1"


any reason we don;t use xr.use_spmd() directly today?

Yea, we should do some more testing and possibly update the error message. For instance, we should block if we have pre-existing non-xla or non sharded data and tell the user about it. We will add/do some more testing in the downstream before deprecating the flag.

JackCaoG · 2023-10-20T23:24:43Z

torch/distributed/_tensor/_xla.py

+        ), "XLAShardedTensor `tensor` is already annotated with non-replication sharding. "
+        "Clear the existing sharding annotation first, by callling torch_xla.experimental.xla_sharding.clear_sharding API."
+        global_tensor = tensor.global_tensor  # type:ignore[attr-defined]
+    assert global_tensor is not None, "distributing a tensor should not be None"


do we need a

else: raise ValueError

here? Can we handle non XLASHardedTensor?

We handle torch.Tensor or XLAShardedTensor (its global representation). We decided to block DTensor to be consistent with DTensor's eager api.

JackCaoG

mostly lgtm

…110275) This addresses pytorch#92909 , and enable XLA backend support for `distribute_tensor` API. Test plan: added a unit test case & tested with CloudTPU. The CI should skip this unless it's a XLA workflow. Pull Request resolved: pytorch#110275 Approved by: https://github.com/wanchaol, https://github.com/alanwaketan, https://github.com/JackCaoG

yeounoh and others added 16 commits July 8, 2022 02:10

Update XLA_IMAGE_TAG to v0.3 for jq

a1859b5

Merge remote-tracking branch 'upstream/master'

ed548fa

Merge remote-tracking branch 'upstream/master'

c046ca4

Merge remote-tracking branch 'upstream/master'

d0a471b

Merge branch 'pytorch:master' into master

93f41bd

Merge branch 'pytorch:master' into master

a575840

Merge branch 'pytorch:master' into master

b9a1966

Merge branch 'pytorch:master' into master

0c58077

Merge branch 'pytorch:master' into master

0047c6c

Merge branch 'pytorch:master' into master

944ae16

Merge branch 'pytorch:master' into master

909b447

Merge branch 'pytorch:main' into master

b75ef4d

Merge branch 'pytorch:main' into master

8bb8e33

Merge branch 'pytorch:main' into master

ed7f6ad

Merge branch 'pytorch:main' into master

88b3036

Merge branch 'pytorch:main' into master

6e798f3

yeounoh self-assigned this Sep 29, 2023

yeounoh requested review from H-Huang, awgu, d4l3k, fduwjj, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol, wz337 and zhaojuanmao as code owners September 29, 2023 04:54

wanchaol removed request for IvanYashchuk, mruberry, ngimel and tugsbayasgalan October 19, 2023 16:48

wanchaol reviewed Oct 19, 2023

View reviewed changes

yeounoh force-pushed the distributed_tensor_xla_api branch 2 times, most recently from 0348a41 to dff47ce Compare October 20, 2023 04:21

yeounoh and others added 6 commits October 19, 2023 23:33

Merge branch 'pytorch:main' into main

9221ca0

[DTensor][XLA] initial DTensor+XLA integration

2839453

Apply linter patches

29538b7

Refactoring

f261e9a

add more testing

6669184

linter fixes

d435177

yeounoh force-pushed the distributed_tensor_xla_api branch from ada34d3 to d435177 Compare October 20, 2023 06:34

wanchaol approved these changes Oct 20, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 20, 2023

pytorchmergebot added the merging label Oct 20, 2023

alanwaketan approved these changes Oct 20, 2023

View reviewed changes

JackCaoG reviewed Oct 20, 2023

View reviewed changes

JackCaoG approved these changes Oct 20, 2023

View reviewed changes

pytorchmergebot added Merged and removed merging labels Oct 21, 2023

pytorchmergebot closed this in 8376079 Oct 21, 2023



		@with_xla
		def xla_distribute_tensor(

[DTensor][XLA] Support Xla backend in distribute_tensor API #110275

[DTensor][XLA] Support Xla backend in distribute_tensor API #110275

Uh oh!

Conversation

yeounoh commented Sep 29, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110275

✅ No Failures

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeounoh commented Oct 20, 2023

Uh oh!

pytorchmergebot commented Oct 20, 2023

Merge started

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

yeounoh commented Sep 29, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 29, 2023 •

edited

Loading