KEMBAR78
Turn on compile with NVSHMEM by kwen2501 · Pull Request #154538 · pytorch/pytorch · GitHub
Skip to content

Conversation

@kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented May 28, 2025

Stack from ghstack (oldest at bottom):

Before:
USE_NVSHMEM=1 need to be explicit set in build environment.

After:
USE_NVSHMEM=1 is the default for CUDA/Rocm on Linux.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented May 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154538

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a9095a4 with merge base 241f8dc (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501 added a commit that referenced this pull request May 28, 2025
ghstack-source-id: f738679
Pull Request resolved: #154538
@kwen2501 kwen2501 requested review from atalman and malfet May 28, 2025 18:22
@kwen2501 kwen2501 added the release notes: distributed (c10d) release notes category label May 28, 2025
@Skylion007
Copy link
Collaborator

We'll need https://pypi.nvidia.com/nvidia-nvshmem-cu12/ added to the release builds at some point if they aren't already

@kwen2501
Copy link
Contributor Author

Thanks @Skylion007 . Thinking the same. Do you know how I can do that?

@Skylion007
Copy link
Collaborator

Skylion007 commented May 28, 2025

At the very least, you need to add it to here:

and have @atalman or someone with write access upload a binary to the S3 so the nightly builds do not fail (you need to regenerate the YAMLs after editing this file) Trigger CI Flows binary

@kwen2501
Copy link
Contributor Author

I added the extra dependency in #154568.

Before:
`USE_NVSHMEM=1` need to be explicit set in build environment.

After:
`USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux. 

[ghstack-poisoned]
@kwen2501 kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label May 29, 2025
@kwen2501 kwen2501 requested a review from ngimel May 29, 2025 02:49
@ngimel
Copy link
Collaborator

ngimel commented May 29, 2025

If we want binaries to have nvshmem, we need to make sure it's installed and discoverable for binary builds?

@kwen2501
Copy link
Contributor Author

@ngimel Yes, at compile time, NVSHMEM is auto detected by tools/setup_helpers/cmake.py. (I am going to refactor it into findNVSHMEM.cmake).

At user side, I added #154568 to make NVSHMEM an install dependency.

Before:
`USE_NVSHMEM=1` need to be explicit set in build environment.

After:
`USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux. 

[ghstack-poisoned]
@kwen2501
Copy link
Contributor Author

kwen2501 commented Jun 3, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

iupaikov-amd pushed a commit to ROCm/pytorch that referenced this pull request Jun 4, 2025
Before:
`USE_NVSHMEM=1` need to be explicit set in build environment.

After:
`USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux.
Pull Request resolved: pytorch#154538
Approved by: https://github.com/ngimel
pytorchmergebot pushed a commit that referenced this pull request Jun 4, 2025
NVSHMEM 3.2.5 (released Mar 2025) have both cu11 and cu12 builds.
See:
https://pypi.nvidia.com/nvidia-nvshmem-cu12/
https://pypi.nvidia.com/nvidia-nvshmem-cu11/
Pull Request resolved: #154568
Approved by: https://github.com/atalman
ghstack dependencies: #154538
angelayi pushed a commit to angelayi/pytorch that referenced this pull request Jun 5, 2025
Before:
`USE_NVSHMEM=1` need to be explicit set in build environment.

After:
`USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux.
Pull Request resolved: pytorch#154538
Approved by: https://github.com/ngimel
angelayi pushed a commit to angelayi/pytorch that referenced this pull request Jun 5, 2025
@github-actions github-actions bot deleted the gh/kwen2501/160/head branch July 4, 2025 02:20
atalman added a commit to atalman/pytorch that referenced this pull request Jul 10, 2025
atalman added a commit that referenced this pull request Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants