KEMBAR78
[CD] Add CUDA 13.0 x86 nightly builds by tinglvv · Pull Request #160956 · pytorch/pytorch · GitHub
Skip to content

Conversation

@tinglvv
Copy link
Collaborator

@tinglvv tinglvv commented Aug 19, 2025

#159779

CUDA 13.0.0
NVSHMEM 3.3.20
CUDNN 9.12.0.46

Adding x86 linux builds for CUDA 13.
Adding sbsa docker.
Adding libtorch docker.
Package naming changed for CUDA 13 (removed postfix -cu13 for some packages).

Preparation checklist:

  1. Update index https://download.pytorch.org/whl/nightly/cu130 with pypi packages
  2. Update packaging name based on https://pypi.org/project/cuda-toolkit/ metadata

cc @ptrblck @nWEIdia @atalman @malfet @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

@pytorch-bot pytorch-bot bot added the release notes: releng release notes category label Aug 19, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160956

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Pending

As of commit b2ffcd6 with merge base 19c70c2 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@tinglvv tinglvv added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Aug 19, 2025
@atalman
Copy link
Contributor

atalman commented Aug 19, 2025

@tinglvv current error is related to nvshmem and sm_75:

caffe2/CMakeFiles/torch_nvshmem.dir/cmake_device_link.o -L/usr/local/cuda/targets/x86_64-linux/lib/stubs  -L/usr/local/cuda/targets/x86_64-linux/lib /usr/local/cuda/lib64/libnvshmem_device.a -lcudadevrt -lcudart_static -lrt -lpthread -ldl
nvlink error   : Undefined reference to '_Z23nvshmemi_transfer_quietIL13threadgroup_t3EEvb' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_75)
nvlink error   : Undefined reference to '_Z47nvshmemi_transfer_enforce_consistency_at_targetb' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_75)
nvlink error   : Undefined reference to '_Z30nvshmemi_transfer_amo_nonfetchIlEvPvT_i14nvshmemi_amo_t' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_75)
nvlink error   : Undefined reference to '_Z23nvshmemi_transfer_rma_pIlEvPvT_i' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_75)
nvlink error   : Undefined reference to '_Z21nvshmemi_transfer_rmaIL13threadgroup_t3EL13nvshmemi_op_t4EEvPvS2_mi' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_75)
nvlink error   : Undefined reference to 'nvshmemi_device_state_d' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_75)```

@atalman
Copy link
Contributor

atalman commented Aug 19, 2025

Please note Docker builds are using correct version: nvSHMEM 3.3.20 for CUDA 13 (x86_64)

@kwen2501
Copy link
Contributor

Does the NVSHMEM error occur to sm_75 only? If so, it would point to NVSHMEM 3.3.20 dropping support of this arch?

@atalman
Copy link
Contributor

atalman commented Aug 19, 2025

@pytorch-bot pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 19, 2025
@tinglvv tinglvv mentioned this pull request Aug 19, 2025
15 tasks
malfet pushed a commit to pytorch/test-infra that referenced this pull request Aug 19, 2025
@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 20, 2025

https://download.pytorch.org/whl/nightly/cu130 is updated after pytorch/test-infra#7038. Rerunning the test.

@tinglvv tinglvv marked this pull request as ready for review August 20, 2025 17:45
@tinglvv tinglvv requested review from a team and jeffdaily as code owners August 20, 2025 17:45
atalman pushed a commit to pytorch/test-infra that referenced this pull request Aug 20, 2025
@tinglvv tinglvv changed the title Add CUDA 13.0 linux builds Add CUDA 13.0 x86 builds Aug 20, 2025
@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 20, 2025

Disabled sm_75 for NVSHMEM for CUDA 13 build temporarily (3069af6), will need to enable when hotfix in 3.3.21 is released.

@atalman
Copy link
Contributor

atalman commented Aug 22, 2025

@pytorchmergebot merge -f "signal looks good"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@tinglvv tinglvv changed the title Add CUDA 13.0 x86 builds [CD] Add CUDA 13.0 x86 nightly builds Aug 22, 2025
@kwen2501
Copy link
Contributor

kwen2501 commented Sep 4, 2025

@tinglvv @atalman Following up, NVSHMEM had a hot release last week (3.3.24) that fixes the above build problem.
Shall we re-enable NVSHMEM for CUDA 13 with a version bump of NVSHMEM?
Or, had we actually not disabled NVSHMEM?

@kwen2501
Copy link
Contributor

kwen2501 commented Sep 4, 2025

@tinglvv
Copy link
Collaborator Author

tinglvv commented Sep 4, 2025

@tinglvv @atalman Following up, NVSHMEM had a hot release last week (3.3.24) that fixes the above build problem. Shall we re-enable NVSHMEM for CUDA 13 with a version bump of NVSHMEM? Or, had we actually not disabled NVSHMEM?

Hi @kwen2501 , thanks for bringing this up. I updated to NVSHMEM 3.3.24 in a separate PR - #161321.
For aarch64 build it is still disabled though (#160465), the libnvshem_host.so.3 is present but no build. We should enable it back as well.

@kwen2501
Copy link
Contributor

kwen2501 commented Sep 4, 2025

@tinglvv Thanks.
I uploaded #162206 to use the same NVSHMEM version across CUDA builds. Can you please take a look?

For ARM,

  • is the glibc issue fixed?
  • is the build available for download now?

@tinglvv
Copy link
Collaborator Author

tinglvv commented Sep 4, 2025

@tinglvv Thanks. I uploaded #162206 to use the same NVSHMEM version across CUDA builds. Can you please take a look?

For ARM,

  • is the glibc issue fixed?
  • is the build available for download now?

The other PR looks good! For ARM, both issues are fixed (glibc issue and download link fixed), so we can enable it back. I'll open a PR to put it back.

pytorchmergebot pushed a commit that referenced this pull request Sep 5, 2025
Related to #159779

Adding CUDA 13.0 libtorch builds, followup after #160956
Removing CUDA 12.9 builds, See #159980

Pull Request resolved: #161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
daisyden pushed a commit to daisyden/pytorch that referenced this pull request Sep 8, 2025
…161916)

Related to pytorch#159779

Adding CUDA 13.0 libtorch builds, followup after pytorch#160956
Removing CUDA 12.9 builds, See pytorch#159980

Pull Request resolved: pytorch#161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
pytorch#159779

CUDA 13.0.0
NVSHMEM 3.3.20
CUDNN 9.12.0.46

Adding x86 linux builds for CUDA 13.
Adding libtorch docker.
Package naming changed for CUDA 13 (removed postfix -cu13 for some packages).

Preparation checklist:
1. Update index https://download.pytorch.org/whl/nightly/cu130 with pypi packages
2. Update packaging name based on https://pypi.org/project/cuda-toolkit/ metadata

Pull Request resolved: pytorch#160956
Approved by: https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…161916)

Related to pytorch#159779

Adding CUDA 13.0 libtorch builds, followup after pytorch#160956
Removing CUDA 12.9 builds, See pytorch#159980

Pull Request resolved: pytorch#161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
malfet added a commit that referenced this pull request Sep 19, 2025
Undo changes introduced in #160956 as driver has been updated to 580 for both fleets
pytorchmergebot pushed a commit that referenced this pull request Sep 19, 2025
Undo changes introduced in #160956 as driver has been updated to 580 for both fleets

Fixes #163342
Pull Request resolved: #163349
Approved by: https://github.com/seemethere
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…161916)

Related to pytorch#159779

Adding CUDA 13.0 libtorch builds, followup after pytorch#160956
Removing CUDA 12.9 builds, See pytorch#159980

Pull Request resolved: pytorch#161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets

Fixes pytorch#163342
Pull Request resolved: pytorch#163349
Approved by: https://github.com/seemethere
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…161916)

Related to pytorch#159779

Adding CUDA 13.0 libtorch builds, followup after pytorch#160956
Removing CUDA 12.9 builds, See pytorch#159980

Pull Request resolved: pytorch#161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets

Fixes pytorch#163342
Pull Request resolved: pytorch#163349
Approved by: https://github.com/seemethere
atalman pushed a commit to atalman/pytorch that referenced this pull request Sep 24, 2025
Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets

Fixes pytorch#163342
Pull Request resolved: pytorch#163349
Approved by: https://github.com/seemethere
atalman added a commit that referenced this pull request Sep 25, 2025
Undo changes introduced in #160956 as driver has been updated to 580 for both fleets

Fixes #163342
Pull Request resolved: #163349
Approved by: https://github.com/seemethere

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…161916)

Related to pytorch#159779

Adding CUDA 13.0 libtorch builds, followup after pytorch#160956
Removing CUDA 12.9 builds, See pytorch#159980

Pull Request resolved: pytorch#161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets

Fixes pytorch#163342
Pull Request resolved: pytorch#163349
Approved by: https://github.com/seemethere
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/h100-symm-mem Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: releng release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants