[SymmetricMemory] introduce user-facing APIs empty() and rendezvous() #139677

yifuwang · 2024-11-05T00:02:39Z

Stack from ghstack (oldest at bottom):

Previously SymmetricMemory only had private pybind APIs:

from torch.distributed._symmetric_memory import _SymmetricMemory
t = _SymmetricMemory.empty_strided_p2p(
    size=(64,),
    stride=(1,),
    dtype=torch.float32,
    device=device,
)
symm_mem_hdl = _SymmetricMemory.rendezvous(t, group_name=group.group_name)

This PR introduces user-facing APIs empty() and rendezvous():

import torch.distributed._symmetric_memory as symm_mem
t = symm_mem.empty(64, device="cuda")
symm_mem_hdl = symm_mem.rendezvous(t, group_name=group.group_name)

Notable differences compared to the pybind APIs:

empty() now resembles torch.empty():
- shape can either be an integer sequence or pack
- no need to/can't specify stride anymore
- device can either be torch.device or string
group_name needs to be specified at rendezvous time as opposed to allocation time. See [SymmetricMemory] support specifying group_name at rendezvous time #139529 for the rationales. I feel the new semantic is superior, hence enforcing it in the public API.
- Currently, the pybind API still support specifying group_name at rendezvous time.

This PR does not change the behavior of the pybind APIs.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-11-05T00:02:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139677

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit 760b83f with merge base b86b534 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: da9afa9 Pull Request resolved: #139677

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 34c4cab Pull Request resolved: #139677

…rendezvous()" Previously `SymmetricMemory` only had private pybind APIs: ```python from torch.distributed._symmetric_memory import _SymmetricMemory t = _SymmetricMemory.empty_strided_p2p( size=(64,), stride=(1,), dtype=torch.float32, device=device, ) symm_mem_hdl = _SymmetricMemory.rendezvous(t, group_name=group.group_name) ``` This PR introduces user-facing APIs: empty() and rendezvous(): ```python import torch.distributed._symmetric_memory as symm_mem t = symm_mem.empty(64, device="cuda") symm_mem_hdl = symm_mem.rendezvous(t, group_name=group.group_name) ``` Notable differences compared to the pybind APIs: - `empty()` now resembles `torch.empty()`: - shape can either be an integer sequence or pack - no need to/can't specify stride anymore - device can either be `torch.device` or string - `group_name` needs to be specified at rendezvous time as opposed to allocation time. See #139529 for the rationales. I feel the new semantic is superior, hence enforcing it in the public API. - Currently, the pybind API still support specifying `group_name` at rendezvous time. This PR does not change the behavior of the pybind APIs. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…endezvous()" Previously `SymmetricMemory` only had private pybind APIs: ```python from torch.distributed._symmetric_memory import _SymmetricMemory t = _SymmetricMemory.empty_strided_p2p( size=(64,), stride=(1,), dtype=torch.float32, device=device, ) symm_mem_hdl = _SymmetricMemory.rendezvous(t, group_name=group.group_name) ``` This PR introduces user-facing APIs empty() and rendezvous(): ```python import torch.distributed._symmetric_memory as symm_mem t = symm_mem.empty(64, device="cuda") symm_mem_hdl = symm_mem.rendezvous(t, group_name=group.group_name) ``` Notable differences compared to the pybind APIs: - `empty()` now resembles `torch.empty()`: - shape can either be an integer sequence or pack - no need to/can't specify stride anymore - device can either be `torch.device` or string - `group_name` needs to be specified at rendezvous time as opposed to allocation time. See #139529 for the rationales. I feel the new semantic is superior, hence enforcing it in the public API. - Currently, the pybind API still support specifying `group_name` at rendezvous time. This PR does not change the behavior of the pybind APIs. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: da3c51a Pull Request resolved: #139677

lw

I think the API looks good, I believe it boils down to the same "verbs" as before (it used to be two steps, it is still two steps) but they have become somewhat simplified (no more explicit strides, one can pass PG instances, ...) and they have become public, which makes it feel more "ok" to call them :)

The only question I have is how do you intend them to be imported in user code:

The function names (empty and rendezvous) are quite common so I guess one won't import them directly (i.e., no from torch.distributed._symmetric_memory import ...).
However, their fully qualified name is long, hence it's perhaps a bit unwieldy to invoke torch.distributed._symmetric_memory.rendezvous directly.
Is the intended usage that people import the module with a shorthand (import torch.distributed._symmetric_memory as symm_mem) and then use that to invoke the functions (symm_mem.empty(...))? This is what is done in the tests.

lw · 2024-11-15T10:27:52Z

torch/distributed/_symmetric_memory/__init__.py

        _SymmetricMemory: the symmetric memory workspace associated with the
        group.
    """
+    enable_symm_mem_for_group(group_name)


Ah this looks like a key change! Does it mean that now, if one calls the fused matmul+comms ops, they lazily initialize the SymmetricMemory on their own?

Ah, right! I realized it's redundant to require users to call enable_symm_mem_for_group(group_name) when they are already explicitly using the group for symmetric memory. This indirectly addresses the tracing issue you encountered.

yifuwang · 2024-11-16T22:03:41Z

The only question I have is how do you intend them to be imported in user code:

The function names (empty and rendezvous) are quite common so I guess one won't import them directly (i.e., no from torch.distributed._symmetric_memory import ...).

However, their fully qualified name is long, hence it's perhaps a bit unwieldy to invoke torch.distributed._symmetric_memory.rendezvous directly.

Is the intended usage that people import the module with a shorthand (import torch.distributed._symmetric_memory as symm_mem) and then use that to invoke the functions (symm_mem.empty(...))? This is what is done in the tests.

The third bullet point is what I had in mind.

My thinking is that, since the API is already scoped at the module level, it might be redundant to further distinguish it at the function level. Here are two approaches we can recommend to users for working with the API:

import torch.distributed._symmetric_memory as symm_mem

t = symm_mem.empty(...)

from torch.distributed._symmetric_memory import empty as symm_mem_empty

t = symm_mem_empty(...)

This approach is also consistent with other empty() variants in distributed (e.g. dtensor api).

Let me know if you have further suggestions!

…endezvous()" Previously `SymmetricMemory` only had private pybind APIs: ```python from torch.distributed._symmetric_memory import _SymmetricMemory t = _SymmetricMemory.empty_strided_p2p( size=(64,), stride=(1,), dtype=torch.float32, device=device, ) symm_mem_hdl = _SymmetricMemory.rendezvous(t, group_name=group.group_name) ``` This PR introduces user-facing APIs empty() and rendezvous(): ```python import torch.distributed._symmetric_memory as symm_mem t = symm_mem.empty(64, device="cuda") symm_mem_hdl = symm_mem.rendezvous(t, group_name=group.group_name) ``` Notable differences compared to the pybind APIs: - `empty()` now resembles `torch.empty()`: - shape can either be an integer sequence or pack - no need to/can't specify stride anymore - device can either be `torch.device` or string - `group_name` needs to be specified at rendezvous time as opposed to allocation time. See #139529 for the rationales. I feel the new semantic is superior, hence enforcing it in the public API. - Currently, the pybind API still support specifying `group_name` at rendezvous time. This PR does not change the behavior of the pybind APIs. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: c7ae713 Pull Request resolved: #139677

…endezvous()" Previously `SymmetricMemory` only had private pybind APIs: ```python from torch.distributed._symmetric_memory import _SymmetricMemory t = _SymmetricMemory.empty_strided_p2p( size=(64,), stride=(1,), dtype=torch.float32, device=device, ) symm_mem_hdl = _SymmetricMemory.rendezvous(t, group_name=group.group_name) ``` This PR introduces user-facing APIs empty() and rendezvous(): ```python import torch.distributed._symmetric_memory as symm_mem t = symm_mem.empty(64, device="cuda") symm_mem_hdl = symm_mem.rendezvous(t, group_name=group.group_name) ``` Notable differences compared to the pybind APIs: - `empty()` now resembles `torch.empty()`: - shape can either be an integer sequence or pack - no need to/can't specify stride anymore - device can either be `torch.device` or string - `group_name` needs to be specified at rendezvous time as opposed to allocation time. See #139529 for the rationales. I feel the new semantic is superior, hence enforcing it in the public API. - Currently, the pybind API still support specifying `group_name` at rendezvous time. This PR does not change the behavior of the pybind APIs. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: b7fabf1 Pull Request resolved: #139677

yifuwang · 2024-11-17T18:32:15Z

@pytorchbot merge

pytorchmergebot · 2024-11-17T18:34:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…pytorch#139677) Previously `SymmetricMemory` only had private pybind APIs: ```python from torch.distributed._symmetric_memory import _SymmetricMemory t = _SymmetricMemory.empty_strided_p2p( size=(64,), stride=(1,), dtype=torch.float32, device=device, ) symm_mem_hdl = _SymmetricMemory.rendezvous(t, group_name=group.group_name) ``` This PR introduces user-facing APIs empty() and rendezvous(): ```python import torch.distributed._symmetric_memory as symm_mem t = symm_mem.empty(64, device="cuda") symm_mem_hdl = symm_mem.rendezvous(t, group_name=group.group_name) ``` Notable differences compared to the pybind APIs: - `empty()` now resembles `torch.empty()`: - shape can either be an integer sequence or pack - no need to/can't specify stride anymore - device can either be `torch.device` or string - `group_name` needs to be specified at rendezvous time as opposed to allocation time. See pytorch#139529 for the rationales. I feel the new semantic is superior, hence enforcing it in the public API. - Currently, the pybind API still support specifying `group_name` at rendezvous time. This PR does not change the behavior of the pybind APIs. Pull Request resolved: pytorch#139677 Approved by: https://github.com/lw ghstack dependencies: pytorch#139529

[SymmetricMemory] initial user-facing API

dd40cf6

[ghstack-poisoned]

yifuwang mentioned this pull request Nov 5, 2024

[SymmetricMemory] resolve a modernize-use-transparent-functors linter warning #139528

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 5, 2024

yifuwang mentioned this pull request Nov 5, 2024

[SymmetricMemory] support specifying group_name at rendezvous time #139529

Closed

yifuwang pushed a commit that referenced this pull request Nov 5, 2024

[SymmetricMemory] initial user-facing API

17ede3d

ghstack-source-id: da9afa9 Pull Request resolved: #139677

yifuwang changed the title ~~[SymmetricMemory] initial user-facing API~~ [SymmetricMemory] initial user-facing APIs Nov 5, 2024

Update on "[SymmetricMemory] initial user-facing APIs"

8af5664

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Update on "[SymmetricMemory] initial user-facing APIs"

9059032

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

yifuwang pushed a commit that referenced this pull request Nov 14, 2024

[SymmetricMemory] initial user-facing API

91c6845

ghstack-source-id: 34c4cab Pull Request resolved: #139677

yifuwang changed the title ~~[SymmetricMemory] initial user-facing APIs~~ [SymmetricMemory] introduce user-facing APIs: empty() and rendezvous() Nov 14, 2024

yifuwang requested review from Chillee, lw and weifengpy November 14, 2024 20:37

yifuwang marked this pull request as ready for review November 14, 2024 20:52

yifuwang changed the title ~~[SymmetricMemory] introduce user-facing APIs: empty() and rendezvous()~~ [SymmetricMemory] introduce user-facing API empty() and rendezvous() Nov 14, 2024

yifuwang changed the title ~~[SymmetricMemory] introduce user-facing API empty() and rendezvous()~~ [SymmetricMemory] introduce user-facing APIs empty() and rendezvous() Nov 14, 2024

yifuwang pushed a commit that referenced this pull request Nov 14, 2024

[SymmetricMemory] introduce user-facing APIs: empty() and rendezvous()

64ba2f1

ghstack-source-id: da3c51a Pull Request resolved: #139677

lw approved these changes Nov 15, 2024

View reviewed changes

yifuwang pushed a commit that referenced this pull request Nov 16, 2024

[SymmetricMemory] introduce user-facing APIs: empty() and rendezvous()

94a4b9d

ghstack-source-id: c7ae713 Pull Request resolved: #139677

yifuwang pushed a commit that referenced this pull request Nov 17, 2024

[SymmetricMemory] introduce user-facing APIs: empty() and rendezvous()

91f95e2

ghstack-source-id: b7fabf1 Pull Request resolved: #139677

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 17, 2024

pytorchmergebot added the merging label Nov 17, 2024

pytorchmergebot added the Merged label Nov 17, 2024

pytorchmergebot closed this in 5a7e147 Nov 17, 2024

pytorchmergebot removed the merging label Nov 17, 2024

github-actions bot deleted the gh/yifuwang/167/head branch December 19, 2024 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SymmetricMemory] introduce user-facing APIs empty() and rendezvous() #139677

[SymmetricMemory] introduce user-facing APIs empty() and rendezvous() #139677

Uh oh!

yifuwang commented Nov 5, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 5, 2024 •

edited

Loading

Uh oh!

lw left a comment

Uh oh!

lw Nov 15, 2024

Uh oh!

yifuwang Nov 16, 2024

Uh oh!

yifuwang commented Nov 16, 2024

Uh oh!

yifuwang commented Nov 17, 2024

Uh oh!

pytorchmergebot commented Nov 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SymmetricMemory] introduce user-facing APIs empty() and rendezvous() #139677

[SymmetricMemory] introduce user-facing APIs empty() and rendezvous() #139677

Uh oh!

Conversation

yifuwang commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139677

❗ 1 Active SEVs

✅ No Failures

Uh oh!

lw left a comment

Choose a reason for hiding this comment

Uh oh!

lw Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

yifuwang Nov 16, 2024

Choose a reason for hiding this comment

Uh oh!

yifuwang commented Nov 16, 2024

Uh oh!

yifuwang commented Nov 17, 2024

Uh oh!

pytorchmergebot commented Nov 17, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yifuwang commented Nov 5, 2024 •

edited

Loading

pytorch-bot bot commented Nov 5, 2024 •

edited

Loading