Turn on compile with NVSHMEM #154538

kwen2501 · 2025-05-28T18:07:14Z

Stack from ghstack (oldest at bottom):

Before:
USE_NVSHMEM=1 need to be explicit set in build environment.

After:
USE_NVSHMEM=1 is the default for CUDA/Rocm on Linux.

[ghstack-poisoned]

pytorch-bot · 2025-05-28T18:07:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154538

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a9095a4 with merge base 241f8dc ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: f738679 Pull Request resolved: #154538

Skylion007 · 2025-05-28T19:01:10Z

We'll need https://pypi.nvidia.com/nvidia-nvshmem-cu12/ added to the release builds at some point if they aren't already

kwen2501 · 2025-05-28T19:09:57Z

Thanks @Skylion007 . Thinking the same. Do you know how I can do that?

Skylion007 · 2025-05-28T19:17:06Z

At the very least, you need to add it to here:

pytorch/.github/scripts/generate_binary_build_matrix.py

Line 74 in 24980d2

"12.8": (

and have @atalman or someone with write access upload a binary to the S3 so the nightly builds do not fail (you need to regenerate the YAMLs after editing this file) Trigger CI Flows binary

kwen2501 · 2025-05-28T22:19:00Z

I added the extra dependency in #154568.

Before: `USE_NVSHMEM=1` need to be explicit set in build environment. After: `USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux. [ghstack-poisoned]

ngimel · 2025-05-29T03:55:06Z

If we want binaries to have nvshmem, we need to make sure it's installed and discoverable for binary builds?

kwen2501 · 2025-05-29T15:54:47Z

@ngimel Yes, at compile time, NVSHMEM is auto detected by tools/setup_helpers/cmake.py. (I am going to refactor it into findNVSHMEM.cmake).

At user side, I added #154568 to make NVSHMEM an install dependency.

Before: `USE_NVSHMEM=1` need to be explicit set in build environment. After: `USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux. [ghstack-poisoned]

kwen2501 · 2025-06-03T15:16:56Z

@pytorchbot merge

pytorchmergebot · 2025-06-03T15:18:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Before: `USE_NVSHMEM=1` need to be explicit set in build environment. After: `USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux. Pull Request resolved: pytorch#154538 Approved by: https://github.com/ngimel

NVSHMEM 3.2.5 (released Mar 2025) have both cu11 and cu12 builds. See: https://pypi.nvidia.com/nvidia-nvshmem-cu12/ https://pypi.nvidia.com/nvidia-nvshmem-cu11/ Pull Request resolved: #154568 Approved by: https://github.com/atalman ghstack dependencies: #154538

Before: `USE_NVSHMEM=1` need to be explicit set in build environment. After: `USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux. Pull Request resolved: pytorch#154538 Approved by: https://github.com/ngimel

NVSHMEM 3.2.5 (released Mar 2025) have both cu11 and cu12 builds. See: https://pypi.nvidia.com/nvidia-nvshmem-cu12/ https://pypi.nvidia.com/nvidia-nvshmem-cu11/ Pull Request resolved: pytorch#154568 Approved by: https://github.com/atalman ghstack dependencies: pytorch#154538

This reverts commit 3685b10.

Turn on compile with NVSHMEM

d889d2e

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request May 28, 2025

Turn on compile with NVSHMEM

ce6fb87

ghstack-source-id: f738679 Pull Request resolved: #154538

kwen2501 requested review from atalman and malfet May 28, 2025 18:22

kwen2501 added the release notes: distributed (c10d) release notes category label May 28, 2025

kwen2501 mentioned this pull request May 28, 2025

Add NVSHMEM to PYTORCH_EXTRA_INSTALL_REQUIREMENTS #154568

Closed

kwen2501 requested a review from Skylion007 May 28, 2025 22:17

Update on "Turn on compile with NVSHMEM"

d0aef9c

Before: `USE_NVSHMEM=1` need to be explicit set in build environment. After: `USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux. [ghstack-poisoned]

kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label May 29, 2025

kwen2501 requested a review from ngimel May 29, 2025 02:49

ngimel approved these changes May 29, 2025

View reviewed changes

Update on "Turn on compile with NVSHMEM"

a9095a4

Before: `USE_NVSHMEM=1` need to be explicit set in build environment. After: `USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux. [ghstack-poisoned]

pytorchmergebot added the merging label Jun 3, 2025

pytorchmergebot added the Merged label Jun 3, 2025

pytorchmergebot closed this in 3685b10 Jun 3, 2025

pytorchmergebot removed the merging label Jun 3, 2025

github-actions bot deleted the gh/kwen2501/160/head branch July 4, 2025 02:20

atalman added a commit to atalman/pytorch that referenced this pull request Jul 10, 2025

Revert "Turn on compile with NVSHMEM (pytorch#154538)"

6ee46e4

This reverts commit 3685b10.

atalman added a commit that referenced this pull request Jul 11, 2025

Revert "Turn on compile with NVSHMEM (#154538)" (#158040)

058b58a

This reverts commit 3685b10.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Turn on compile with NVSHMEM #154538

Turn on compile with NVSHMEM #154538

Uh oh!

kwen2501 commented May 28, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 28, 2025 •

edited

Loading

Uh oh!

Skylion007 commented May 28, 2025

Uh oh!

kwen2501 commented May 28, 2025

Uh oh!

Skylion007 commented May 28, 2025 •

edited

Loading

Uh oh!

kwen2501 commented May 28, 2025

Uh oh!

ngimel commented May 29, 2025

Uh oh!

kwen2501 commented May 29, 2025

Uh oh!

kwen2501 commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Turn on compile with NVSHMEM #154538

Turn on compile with NVSHMEM #154538

Uh oh!

Conversation

kwen2501 commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154538

✅ No Failures

Uh oh!

Skylion007 commented May 28, 2025

Uh oh!

kwen2501 commented May 28, 2025

Uh oh!

Skylion007 commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 commented May 28, 2025

Uh oh!

ngimel commented May 29, 2025

Uh oh!

kwen2501 commented May 29, 2025

Uh oh!

kwen2501 commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented May 28, 2025 •

edited

Loading

pytorch-bot bot commented May 28, 2025 •

edited

Loading

Skylion007 commented May 28, 2025 •

edited

Loading