Use device-agnostic runtime API in distributed DDP/FSDP instead of `cuda` device specific. #137678

zhangxiaoli73 · 2024-10-10T02:28:30Z

Motivation

This PR targets to use device-agnostic runtime API in distributed DDP/FSDP instead of cuda device specific.

cc cc @jgong5 @gujinghui @EikanWang @fengyuan14 @guangyey

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2024-10-10T02:28:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137678

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit df87094 with merge base 034b105 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-py3.9 / test (default, 2, 4, linux.idc.xpu) (gh) (similar failure)
inductor/test_decompose_mem_bound_mm.py::TestDecomposeMemMM::test_check_device

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2024-10-10T02:28:37Z

The committers listed above are authorized under a signed CLA.

✅ login: zhangxiaoli73 (dc40ff8, dda8342, 3f8ba58, ecd4422, a8b56f4, c636868, df87094, 3fbca5f, 0b30a06)

Skylion007 · 2024-10-10T14:23:10Z

torch/_utils.py


 @functools.lru_cache(2)
-def _get_device_module(device_type: str):
+def _get_device_module(device_type: str = None):


Suggested change

def _get_device_module(device_type: str = None):

def _get_device_module(device_type: Optional[str] = None):

Implicit optionals should not be used.

Thanks, got you. Change done.

Skylion007 · 2024-10-10T14:24:40Z

torch/distributed/algorithms/model_averaging/utils.py

    # Make sure the allreduce will not conflict with any other ongoing process group.
    if torch.cuda.is_available():
        torch.cuda.synchronize()
+    elif torch.xpu.is_available():


nit: what about the case where CUDA and XPU devices are both available, but only one is in use? Shouldn't this be based on the parameters and not whether the backend is available? Or based on the distributed group in some way?

Currently, only one accelerator can be available at once on a given host (see https://github.com/pytorch/pytorch/blob/main/docs/source/torch.rst#accelerators). We can still make it more generic in this case.

I used the device type of flat_params (pass to allreduce later) to query the corresponding device module to perform synchronize() on cuda and xpu.

If you need any further adjustments, let me know!

Suggested change

elif torch.xpu.is_available():

params_device_type = flat_params.device.type

if params_device_type in ["cuda","xpu"]:

_get_device_module(params_device_type).synchronize()

zhangxiaoli73 · 2024-10-22T08:34:54Z

@awgu @Skylion007 Thanks for your review comments and all should be addressed. Could you please help review again?

zhangxiaoli73 · 2024-11-01T09:00:08Z

torch/distributed/algorithms/ddp_comm_hooks/mixed_precision_hooks.py

    ret_fut = torch.futures.Future()
    stream = hook_state.upcast_stream
-    with torch.cuda.stream(stream):
+    with _get_device_module().stream(stream):


No corresponding API in torch.acc right now. So call _get_device_module() to get the device module.

Why don't we have torch.acc.stream?

As discussed in #132204 (comment), we will provide the support of with statement for torch.Stream as a context manager.

OK, but I guess if the with stream is supported, we don't need to have to change _get_device_module and all such related changes in the PR, which sounds simpler for this PR? Is with stream change more complex?

Yes, with stream: is simpler than with _get_device_module().stream(stream)

So, does it make sense to support with stream to avoid complicating this PR?

@jgong5 it seems not so easy to support with stream in this PR. @guangyey has a WIP PR to provide with stream
#140138 with calling some new accelerator APIs.

This is really nice discussion, thanks folks.

If we were to use get_device_module, I think it would be safer if we provide a device argument to it.
Previous:
torch.cuda <-- the device module is explicit
Current:
get_device_module() <-- assumes the user has set a "current device" context

Since user may not have done so in their program, I was just a little cautious whether the would cause a BC break.

Oh nvm. get_device_module() would give priority to accelerator than cpu. When that's guaranteed, then the current code is safe :)
Source of torch.get_device_module:

elif device is None: # Using default accelerator type. If no accelerator is available, it automatically returns CPU device.

zhangxiaoli73 · 2024-11-01T09:04:27Z

I have refined this PR with torch.acc which offers device-agnostic runtime APIs. @awgu @Skylion007 Could you please help review again?

kwen2501

I like the PR. Thanks for the contribution. Had some minor questions.
Can you please sign the CLA? Thanks.

kwen2501 · 2024-11-02T07:54:22Z

torch/_utils.py

+    if device_type is None:
+        device_type = torch._C._get_accelerator().type


Some comment would be appreciated here.
cc @albanD @janeyx99 to review this change.
Also, considering the significance of this change itself, does it make sense to put this change into a separate, base PR?

Comments added. Let me know if you want split to a separate PR.

I see that the needed functionality is now supported by:

torch.get_device_module(device=None)

Returns the module associated with a given device(e.g., torch.device(‘cuda’), “mtia:0”, “xpu”, …). If no device is given, return the module for the current accelerator or CPU if none is present.

https://pytorch.org/docs/stable/generated/torch.get_device_module.html

Maybe use that API?

Agreed. Remove change in _get_device_module(...) and use torch.get_device_module(device=None).

kwen2501 · 2024-11-02T07:57:55Z

torch/distributed/algorithms/ddp_comm_hooks/optimizer_overlap_hooks.py

        # backward and set all DDP managed grads to None.
        def wait_for_optim_stream_callback():
-            torch.cuda.current_stream().wait_stream(optim_stream_state.optim_stream)
+            torch.acc.current_stream().wait_stream(optim_stream_state.optim_stream)


Curious: why do we sometimes use _get_device_module().stream and sometimes torch.acc.current_stream()?

Difference in API semantic - torch.cuda.current_stream() is to get a torch.Stream , while torch.cuda.stream(...) is to get StreamContext. https://github.com/pytorch/pytorch/blob/main/torch/cuda/__init__.py#L600

Currently, torch.acc doesn't provide StreamContext API, so I have to get the device module and then get stream context by _get_device_module().stream.

I asked in another hunk, why don't we have torch.acc.stream?

@jgong5 torch.acc.stream is not ready right now and will be supported by @guangyey

As discussed in #132204 (comment), we will provide the support of with statement for torch.Stream as a context manager.

kwen2501 · 2024-11-02T07:58:42Z

torch/distributed/algorithms/model_averaging/utils.py

-    if torch.cuda.is_available():
-        torch.cuda.synchronize()
+    params_device_type = flat_params.device.type
+    if params_device_type in ["cuda", "xpu"]:


Is there a device-agnostic way?

synchronize should be common for each accelerator (at least no functionality impact). So change to

if torch.acc.is_available(): torch.acc.synchronize()

guangyey · 2024-11-05T02:30:51Z

torch/distributed/algorithms/ddp_comm_hooks/mixed_precision_hooks.py

    # enqueue a callback to wait for this stream at end of backward
    def wait_for_stream_cb():
-        torch.cuda.current_stream().wait_stream(stream)
+        torch.acc.current_stream().wait_stream(stream)


Suggested change

torch.acc.current_stream().wait_stream(stream)

torch.accelerator.current_stream().wait_stream(stream)

kwen2501

I think we are getting close to land. Just two (edit: one) questions left around get_device_module

kwen2501 · 2024-11-11T00:28:14Z

torch/_utils.py

+    if device_type is None:
+        device_type = torch._C._get_accelerator().type


I see that the needed functionality is now supported by:

torch.get_device_module(device=None)

Returns the module associated with a given device(e.g., torch.device(‘cuda’), “mtia:0”, “xpu”, …). If no device is given, return the module for the current accelerator or CPU if none is present.

https://pytorch.org/docs/stable/generated/torch.get_device_module.html

Maybe use that API?

kwen2501 · 2024-11-11T00:45:40Z

torch/distributed/algorithms/ddp_comm_hooks/mixed_precision_hooks.py

    ret_fut = torch.futures.Future()
    stream = hook_state.upcast_stream
-    with torch.cuda.stream(stream):
+    with _get_device_module().stream(stream):


This is really nice discussion, thanks folks.

If we were to use get_device_module, I think it would be safer if we provide a device argument to it.
Previous:
torch.cuda <-- the device module is explicit
Current:
get_device_module() <-- assumes the user has set a "current device" context

Since user may not have done so in their program, I was just a little cautious whether the would cause a BC break.

kwen2501

LGTM. Maybe consider using torch.get_device_module(...)? It is more formal.

guangyey

Please fix lint error.

guangyey · 2024-11-11T01:53:00Z

@pytorchbot rebase

pytorchmergebot · 2024-11-11T01:54:29Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-11-11T01:54:32Z

Successfully rebased cherry/distributed-frontend onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cherry/distributed-frontend && git pull --rebase)

zhangxiaoli73 · 2024-11-12T01:36:57Z

LGTM. Maybe consider using torch.get_device_module(...)? It is more formal.

Thanks. torch.get_device_module(...) looks more formal. Let me use this API.

ankurneog · 2024-11-12T04:45:26Z

torch/distributed/algorithms/ddp_comm_hooks/mixed_precision_hooks.py

 import torch.distributed as dist
 from torch.autograd import Variable
 from torch.distributed.utils import _free_storage
+from torch._utils import _get_device_module


this import is no longer needed

ankurneog · 2024-11-12T04:45:47Z

torch/distributed/algorithms/ddp_comm_hooks/optimizer_overlap_hooks.py

 import torch.distributed as dist
 from torch.autograd import Variable
-
+from torch._utils import _get_device_module


this import is no longer needed.

zhangxiaoli73 · 2024-11-13T01:18:46Z

@pytorchbot rebase

pytorch-bot · 2024-11-13T01:18:50Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

guangyey · 2024-11-13T01:57:06Z

@pytorchbot rebase

pytorchmergebot · 2024-11-13T01:58:36Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-11-13T01:58:40Z

Successfully rebased cherry/distributed-frontend onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cherry/distributed-frontend && git pull --rebase)

guangyey · 2024-11-13T05:24:48Z

"Unrelated failures"
@pytorchbot merge

pytorchmergebot · 2024-11-13T05:26:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@jgong5

…uda` device specific. (pytorch#137678) # Motivation This PR targets to use device-agnostic runtime API in distributed DDP/FSDP instead of `cuda` device specific. cc cc [@jgong5](https://github.com/jgong5) [@gujinghui](https://github.com/gujinghui) [@EikanWang](https://github.com/EikanWang) [@fengyuan14](https://github.com/fengyuan14) [@guangyey](https://github.com/guangyey) Pull Request resolved: pytorch#137678 Approved by: https://github.com/kwen2501, https://github.com/guangyey, https://github.com/jgong5

# Motivation In #137678, we help use the device-agnostic APIs to generalize distributed module. As this [comment](#137678 (comment)) said, we will use the with statement of `torch.Stream` once #140138 is landed. Pull Request resolved: #144951 Approved by: https://github.com/kwen2501, https://github.com/albanD

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (ddp) release notes category labels Oct 10, 2024

pytorchbot added the open source label Oct 10, 2024

Skylion007 reviewed Oct 10, 2024

View reviewed changes

soulitzer requested a review from awgu October 10, 2024 14:46

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 10, 2024

guangyey added the ciflow/xpu Run XPU CI tasks label Oct 11, 2024

zhangxiaoli73 requested a review from Skylion007 October 13, 2024 03:16

zhangxiaoli73 force-pushed the cherry/distributed-frontend branch from cb096fd to 0975698 Compare November 1, 2024 08:56

zhangxiaoli73 commented Nov 1, 2024

View reviewed changes

zhangxiaoli73 changed the title ~~Use detected device module in distributed DDP/FSDP instead of cuda device specific.~~ Use device-agnostic runtime API in distributed DDP/FSDP instead of cuda device specific. Nov 1, 2024

kwen2501 reviewed Nov 2, 2024

View reviewed changes

guangyey mentioned this pull request Nov 4, 2024

[WIP] Refactor distributed code via device-agnostic API #132371

Closed

zhangxiaoli73 requested a review from kwen2501 November 4, 2024 03:23

zhangxiaoli73 force-pushed the cherry/distributed-frontend branch from 4584872 to 01b91c7 Compare November 5, 2024 02:13

guangyey reviewed Nov 5, 2024

View reviewed changes

kwen2501 reviewed Nov 11, 2024

View reviewed changes

kwen2501 approved these changes Nov 11, 2024

View reviewed changes

guangyey approved these changes Nov 11, 2024

View reviewed changes

pytorchmergebot force-pushed the cherry/distributed-frontend branch from 2268767 to bafe615 Compare November 11, 2024 01:54

jgong5 approved these changes Nov 12, 2024

View reviewed changes

ankurneog reviewed Nov 12, 2024

View reviewed changes

guangyey added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 12, 2024

zhangxiaoli73 added 9 commits November 13, 2024 01:58

automatical detect accelerator in distributed

ecd4422

add optional device type

c636868

refine if condition check for different acceleraters

dda8342

Use device-agnostic runtime API in DDP/FSDP

0b30a06

add some comments

3f8ba58

correct package name to torch.accelerator

dc40ff8

use torch.get_device_module in DDP hooks

3fbca5f

remove unnecessary import

a8b56f4

fix lint error

df87094

pytorchmergebot force-pushed the cherry/distributed-frontend branch from 03559d3 to df87094 Compare November 13, 2024 01:58

pytorchmergebot added the merging label Nov 13, 2024

pytorchmergebot added the Merged label Nov 13, 2024

pytorchmergebot closed this in 1886e33 Nov 13, 2024

pytorchmergebot removed the merging label Nov 13, 2024

guangyey mentioned this pull request Jan 16, 2025

Use torch with statement in torch distributed module #144951

Closed

	def _get_device_module(device_type: str = None):
	def _get_device_module(device_type: Optional[str] = None):

		if device_type is None:
		device_type = torch._C._get_accelerator().type

	torch.acc.current_stream().wait_stream(stream)
	torch.accelerator.current_stream().wait_stream(stream)

Use device-agnostic runtime API in distributed DDP/FSDP instead of cuda device specific. #137678

Use device-agnostic runtime API in distributed DDP/FSDP instead of cuda device specific. #137678

Uh oh!

Conversation

zhangxiaoli73 commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

pytorch-bot bot commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137678

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

linux-foundation-easycla bot commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangxiaoli73 commented Oct 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangxiaoli73 commented Nov 1, 2024

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 Nov 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangxiaoli73 Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guangyey Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Use device-agnostic runtime API in distributed DDP/FSDP instead of `cuda` device specific. #137678

Use device-agnostic runtime API in distributed DDP/FSDP instead of `cuda` device specific. #137678

zhangxiaoli73 commented Oct 10, 2024 •

edited

Loading

pytorch-bot bot commented Oct 10, 2024 •

edited

Loading

linux-foundation-easycla bot commented Oct 10, 2024 •

edited

Loading

kwen2501 Nov 2, 2024 •

edited

Loading

zhangxiaoli73 Nov 4, 2024 •

edited

Loading

guangyey Nov 5, 2024 •

edited

Loading

kwen2501 left a comment •

edited

Loading

kwen2501 left a comment •

edited

Loading