KEMBAR78
Update NVSHMEM to 3.3.24 and fix download link by tinglvv · Pull Request #161321 · pytorch/pytorch · GitHub
Skip to content

Conversation

@tinglvv
Copy link
Collaborator

@tinglvv tinglvv commented Aug 22, 2025

#159779

Update NVSHMEM 3.3.24 for PyTorch CUDA13 Binary Cannot Be Built with SM_75 with NVSHMEM
Enabled back sm_75 for NVSHMEM
Fixed the NVSHMEM download link for the issue with 3.3.20 download in issue - [CD] nvshem-3.3.9 wheels for aarch64 is not manylinux2_28 compliant

Todo: Should also enable back build ARM with NVSHMEM since it is compatible with manylinux2_28

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @atalman @malfet @ptrblck @eqy @nWEIdia

@tinglvv tinglvv requested a review from jeffdaily as a code owner August 22, 2025 22:43
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 22, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161321

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 39 Pending

As of commit 96a4ae4 with merge base 7376111 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Aug 22, 2025
@tinglvv tinglvv changed the title Update nvshmem 3.3.24 and fix download link Update NVSHMEM to 3.3.24 and fix download link Aug 22, 2025
@tinglvv tinglvv requested a review from a team as a code owner August 22, 2025 22:46
@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 25, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased nvshmem-link-fix-3.3.24 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout nvshmem-link-fix-3.3.24 && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the nvshmem-link-fix-3.3.24 branch from b3df5ca to 184d56e Compare August 25, 2025 15:27
@pytorch-bot pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 25, 2025
@tinglvv tinglvv added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Aug 25, 2025
@johnnynunez
Copy link
Contributor

this resolves your issue about sbsa wheels @tinglvv

@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 26, 2025

@pytorchbot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 26, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: windows-binary-wheel / wheel-py3_13t-xpu-build, windows-binary-wheel / wheel-py3_14t-xpu-build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_14-rocm6_4-test

Details for Dev Infra team Raised by workflow job

@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 26, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased nvshmem-link-fix-3.3.24 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout nvshmem-link-fix-3.3.24 && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the nvshmem-link-fix-3.3.24 branch from 462efcd to 96a4ae4 Compare August 26, 2025 09:01
@atalman
Copy link
Contributor

atalman commented Aug 26, 2025

@pytorchmergebot merge -f "all required builds look good"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Sep 9, 2025
#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: #162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
pytorchmergebot pushed a commit that referenced this pull request Sep 9, 2025
#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: #162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
pytorch#159779

Update NVSHMEM 3.3.24 for [PyTorch CUDA13 Binary Cannot Be Built with SM_75 with NVSHMEM](pytorch#160980)
Enabled back sm_75 for NVSHMEM
Fixed the NVSHMEM download link for the issue with 3.3.20 download in issue - [[CD] nvshem-3.3.9 wheels for aarch64 is not manylinux2_28 compliant](pytorch#160425)

Todo: Should also enable back build ARM with NVSHMEM since it is compatible with manylinux2_28

Pull Request resolved: pytorch#161321
Approved by: https://github.com/Skylion007, https://github.com/atalman
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: pytorch#162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: pytorch#162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: pytorch#162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: pytorch#162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: pytorch#162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: pytorch#162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: pytorch#162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: pytorch#162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/h100-symm-mem ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants