[SymmMem] Enable NVSHMEM for Triton #155506

kwen2501 · 2025-06-10T00:09:00Z

Stack from ghstack (oldest at bottom):

-> [SymmMem] Enable NVSHMEM for Triton #155506

(This is an Experimental feature)
Allow Triton kernels to invoke NVSHMEM device functions.

Example Triton program

Key parts:

Call nvshmem.enable_triton() to initialize;
Call nvshmem.putmem_block in Triton kernel;
Add extern_libs kwarg at kernel invocation.

import torch.distributed._symmetric_memory._nvshmem_triton as nvshmem

@triton.jit
def put_kernel(
    dst_ptr,
    src_ptr,
    numel: tl.constexpr,
    peer: tl.constexpr,
    BLOCK_SIZE: tl.constexpr,
):
    nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer)


if __name__ == "__main__":
    # Enable NVSHMEM for Triton
    nvshmem_lib = nvshmem.enable_triton()

    # Use torch Symmetric Memory to allocate Symmetric tensors
    ...

    peer = 1 - rank
    if rank == 0:
        kernel = put_kernel[(1, 1, 1)](
            dst_ptr,
            src_ptr,
            numel=numel,
            peer=peer,
            BLOCK_SIZE=BLOCK_SIZE,
            extern_libs=nvshmem_lib,
        ) 

    dist.barrier()
    if rank == 1:
        print(f"Rank {rank}: received {out=}")

Test output:

$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put
Rank 0: writing value 5 to Peer 1
Rank 1: received out=tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:1', dtype=torch.int8)

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-06-10T00:09:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155506

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 14e167b with merge base 4d9d884 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (trunk failure)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ngimel · 2025-06-10T00:41:55Z

torch/csrc/distributed/c10d/nvshmem_extension.cu

+// operations.
+void nvshmemx_cumodule_init(uintptr_t module) {
+  auto cumodule = reinterpret_cast<CUmodule>(module);
+  TORCH_CHECK(


I think it's time to implement NVSHMEM_CHECK similar to AT_CUDA_CHECK

[ghstack-poisoned]

ghstack-source-id: 37a8e96 Pull-Request-resolved: #155506

kwen2501 · 2025-06-10T01:21:01Z

@pytorchbot merge

pytorchmergebot · 2025-06-10T01:23:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-06-10T01:23:48Z

Merge failed

Reason: 8 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

fduwjj · 2025-06-10T02:56:23Z

Can you kindly make linter happy and is the unit test failure real?

fduwjj · 2025-06-10T02:58:53Z

torch/distributed/_symmetric_memory/_nvshmem_triton.py

+    """
+    from triton.runtime.jit import JITFunction
+
+    from torch._C._distributed_c10d import (


I think you need to update pyi file

fegin

Pretty cool! Some nits.

fegin · 2025-06-10T06:20:48Z

test/distributed/test_nvshmem.py

            torch.testing.assert_close(received_chunk, chunk)

+    @skipIfRocm
+    @requires_triton()


Do we have a decorator to skip the test if NVSHMEM is not available?

fegin · 2025-06-10T06:22:41Z

torch/distributed/_symmetric_memory/_nvshmem_triton.py

+    # Detect NVSHMEM device library path from python library path
+    if lib_dir is None:
+        py_lib_path = sysconfig.get_path("purelib")
+        lib_dir = py_lib_path + "/nvidia/nvshmem/lib"


nit: show we use os.path.join?

Skylion007 · 2025-06-10T11:45:47Z

torch/distributed/_symmetric_memory/_nvshmem_triton.py

+        jit_function = kwargs["fn"].jit_function
+        kernel_cache, _, _, _ = jit_function.device_caches[device]
+        kernel = kernel_cache.get(key, None)
+        kernel.run


Is this missing parens?

Thanks for the careful look.
It is interesting that without this line, things don't work.
And I don't actually want the run to execute.

d4l3k · 2025-06-10T16:23:01Z

torch/distributed/_symmetric_memory/_nvshmem_triton.py

+
+
+@core.extern
+def putmem_block(dst, src, nelems, pe, _builder=None):  # type: ignore[no-untyped-def]


Can we also add a generic "torch.putmem_block" abstraction that can do dispatching? How hard is dynamic dispatch for cuda/triton kernel?

cc @wconstab

[ghstack-poisoned]

fduwjj · 2025-06-11T16:39:22Z

Also since #155573 is merged. You might need to rebase your PR on top of it :)

[ghstack-poisoned]

ghstack-source-id: c17af34 Pull-Request-resolved: #155506

kwen2501 · 2025-06-12T00:15:20Z

@pytorchbot merge

pytorchmergebot · 2025-06-12T00:17:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This is a requirement of most SHMEM backends. Otherwise, allocations may misalign across ranks. In this PR, we make the (total) input size and output size a constant number, even though the split sizes are created random. (Previously we sum the splits up as input size, which creates misalignment in SHMEM heap across ranks). Pull Request resolved: #155835 Approved by: https://github.com/fduwjj, https://github.com/fegin, https://github.com/Skylion007 ghstack dependencies: #155506

No code enqueues entries to `ptr_to_symm_mem_`, thus it is always empty. This PR removes it and supports relying functionalities via the `allocations_` map. Pull Request resolved: #155968 Approved by: https://github.com/Skylion007 ghstack dependencies: #155506, #155835

`NVSHMEMSymmetricMemory.cu` and `nvshmem_extension.cu` are under the same compilation condition now (i.e. only when `USE_NVSHMEM=True`), see https://github.com/pytorch/pytorch/blob/main/caffe2/CMakeLists.txt#L1013-L1018. Therefore there is no need to build an extra layer to hide dependency. Pull Request resolved: #155971 Approved by: https://github.com/Skylion007 ghstack dependencies: #155506, #155835, #155968

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Calling `nvshmem_free` when an `NVSHMEMAllocation` is being destructed. Use a `is_finalizing()` as a guard as done in `CUDASymmetricMemory.cu` to avoid "driver shutting down" error (destruction fiasco). Pull Request resolved: #155975 Approved by: https://github.com/ngimel ghstack dependencies: #155506, #155835, #155968, #155971

The rank-to-global-rank exchange is a major overhead in `NVSHMEMSymmetricMemory` creation. We should cache its result on per-group basis. Before this change: ``` TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py exchanged_n_times: 18 ``` After this change: ``` TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py exchanged_n_times: 1 ``` Pull Request resolved: #156116 Approved by: https://github.com/fegin, https://github.com/ngimel ghstack dependencies: #155506, #155835, #155968, #155971, #155975

Avoiding a copy, not expecting a caller to change its value. Pull Request resolved: #156117 Approved by: https://github.com/fegin ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116

so that we can pick the default backend for SymmetricMemory without fully relying on env var `TORCH_SYMMMEM=CUDA | NVSHMEM` On Python side, the following API is added: `torch.distributed._symmetric_memory.is_nvshmem_available()` Pull Request resolved: #156291 Approved by: https://github.com/Skylion007 ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116, #156117

Update

5d86ce5

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 10, 2025

kwen2501 requested review from fduwjj, fegin and ngimel June 10, 2025 00:19

ngimel approved these changes Jun 10, 2025

View reviewed changes

Update

51859a5

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jun 10, 2025

[SymmMem] Enable NVSHMEM for Triton

3dc1cfe

ghstack-source-id: 37a8e96 Pull-Request-resolved: #155506

kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 10, 2025

pytorchmergebot added the merging label Jun 10, 2025

pytorchmergebot removed the merging label Jun 10, 2025

fduwjj reviewed Jun 10, 2025

View reviewed changes

fegin approved these changes Jun 10, 2025

View reviewed changes

Skylion007 reviewed Jun 10, 2025

View reviewed changes

d4l3k reviewed Jun 10, 2025

View reviewed changes

Update

41bc5fd

[ghstack-poisoned]

Update

0822c0b

[ghstack-poisoned]

fduwjj approved these changes Jun 10, 2025

View reviewed changes

Update

ad8b73b

[ghstack-poisoned]

Update

14e167b

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jun 11, 2025

[SymmMem] Enable NVSHMEM for Triton

936bc43

ghstack-source-id: c17af34 Pull-Request-resolved: #155506

pytorchmergebot added the merging label Jun 12, 2025

pytorchmergebot closed this in 9e9484d Jun 12, 2025

pytorchmergebot added Merged and removed merging labels Jun 12, 2025

github-actions bot deleted the gh/kwen2501/166/head branch July 14, 2025 02:21



		@core.extern
		def putmem_block(dst, src, nelems, pe, _builder=None): # type: ignore[no-untyped-def]

[SymmMem] Enable NVSHMEM for Triton #155506

[SymmMem] Enable NVSHMEM for Triton #155506

Uh oh!

Conversation

kwen2501 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example Triton program

Test output:

Uh oh!

pytorch-bot bot commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155506

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Jun 10, 2025

Uh oh!

pytorchmergebot commented Jun 10, 2025

Merge started

Uh oh!

pytorchmergebot commented Jun 10, 2025

Merge failed

Uh oh!

fduwjj commented Jun 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fduwjj commented Jun 11, 2025

Uh oh!

kwen2501 commented Jun 12, 2025

Uh oh!

pytorchmergebot commented Jun 12, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kwen2501 commented Jun 10, 2025 •

edited

Loading

pytorch-bot bot commented Jun 10, 2025 •

edited

Loading

fegin left a comment •

edited

Loading