[SymmMem] Add runtime detection of NVSHMEM #156291

kwen2501 · 2025-06-18T07:38:37Z

Stack from ghstack (oldest at bottom):

-> [SymmMem] Add runtime detection of NVSHMEM #156291

so that we can pick the default backend for SymmetricMemory without
fully relying on env var TORCH_SYMMMEM=CUDA | NVSHMEM

On Python side, the following API is added:
torch.distributed._symmetric_memory.is_nvshmem_available()

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-06-18T07:38:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156291

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6ba7684 with merge base 4d9d884 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (trunk failure)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Skylion007 · 2025-06-18T16:37:01Z

torch/csrc/distributed/c10d/init.cpp

      py::arg("module"));
 #endif

+  module.def(


Nit: this is a C++ keyword. Not good for future versions.

Ah I see

In C++, module is a context-sensitive keyword introduced with C++20 for the modules feature, used in declarations like export module MyModule;.

I guess this file needs a big refactor then.

caffe2/CMakeLists.txt

torch/csrc/distributed/c10d/symm_mem/CUDASymmetricMemoryUtils.cpp

[ghstack-poisoned]

so that we can pick the default backend for SymmetricMemory without relying on env var. ghstack-source-id: 4588d63 Pull-Request-resolved: #156291

[ghstack-poisoned]

so that we can pick the default backend for SymmetricMemory without relying on env var. ghstack-source-id: b4cc209 Pull-Request-resolved: #156291

[ghstack-poisoned]

so that we can pick the default backend for SymmetricMemory without relying on env var. ghstack-source-id: de17da3 Pull-Request-resolved: #156291

[ghstack-poisoned]

so that we can pick the default backend for SymmetricMemory without relying on env var. ghstack-source-id: 2941a79 Pull-Request-resolved: #156291

kwen2501 · 2025-06-19T07:46:44Z

@pytorchbot merge

pytorchmergebot · 2025-06-19T07:48:44Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Update

795a304

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 18, 2025

kwen2501 requested review from fduwjj, fegin and ngimel June 18, 2025 15:29

Skylion007 requested changes Jun 18, 2025

View reviewed changes

Skylion007 approved these changes Jun 18, 2025

View reviewed changes

Skylion007 reviewed Jun 18, 2025

View reviewed changes

caffe2/CMakeLists.txt Outdated Show resolved Hide resolved

fegin reviewed Jun 18, 2025

View reviewed changes

torch/csrc/distributed/c10d/symm_mem/CUDASymmetricMemoryUtils.cpp Show resolved Hide resolved

Update

e7ae83c

[ghstack-poisoned]

Update

9a135a1

[ghstack-poisoned]

Update

b085f2e

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jun 18, 2025

[SymmMem] Add runtime detection of NVSHMEM

b3fc990

so that we can pick the default backend for SymmetricMemory without relying on env var. ghstack-source-id: 4588d63 Pull-Request-resolved: #156291

Update

8b26449

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jun 18, 2025

[SymmMem] Add runtime detection of NVSHMEM

548567f

so that we can pick the default backend for SymmetricMemory without relying on env var. ghstack-source-id: b4cc209 Pull-Request-resolved: #156291

Update

c81c67b

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jun 19, 2025

[SymmMem] Add runtime detection of NVSHMEM

548b7b8

so that we can pick the default backend for SymmetricMemory without relying on env var. ghstack-source-id: de17da3 Pull-Request-resolved: #156291

Update

6ba7684

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jun 19, 2025

[SymmMem] Add runtime detection of NVSHMEM

b9932a8

so that we can pick the default backend for SymmetricMemory without relying on env var. ghstack-source-id: 2941a79 Pull-Request-resolved: #156291

kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 19, 2025

pytorchmergebot added the merging label Jun 19, 2025

pytorchmergebot added the Merged label Jun 19, 2025

pytorchmergebot closed this in 8fcda2c Jun 19, 2025

pytorchmergebot removed the merging label Jun 19, 2025

kwen2501 mentioned this pull request Jun 25, 2025

[SymmMem] Allow selection of allocation backend #156661

Closed

github-actions bot deleted the gh/kwen2501/175/head branch July 20, 2025 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SymmMem] Add runtime detection of NVSHMEM #156291

[SymmMem] Add runtime detection of NVSHMEM #156291

Uh oh!

kwen2501 commented Jun 18, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 18, 2025 •

edited

Loading

Uh oh!

Skylion007 Jun 18, 2025

Uh oh!

kwen2501 Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

kwen2501 commented Jun 19, 2025

Uh oh!

pytorchmergebot commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SymmMem] Add runtime detection of NVSHMEM #156291

[SymmMem] Add runtime detection of NVSHMEM #156291

Uh oh!

Conversation

kwen2501 commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156291

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Skylion007 Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kwen2501 commented Jun 19, 2025

Uh oh!

pytorchmergebot commented Jun 19, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Jun 18, 2025 •

edited

Loading

pytorch-bot bot commented Jun 18, 2025 •

edited

Loading