[CD] Add CUDA 13.0 x86 nightly builds #160956

tinglvv · 2025-08-19T09:26:00Z

CUDA 13.0.0
NVSHMEM 3.3.20
CUDNN 9.12.0.46

Adding x86 linux builds for CUDA 13.
Adding sbsa docker.
Adding libtorch docker.
Package naming changed for CUDA 13 (removed postfix -cu13 for some packages).

Preparation checklist:

Update index https://download.pytorch.org/whl/nightly/cu130 with pypi packages
Update packaging name based on https://pypi.org/project/cuda-toolkit/ metadata

cc @ptrblck @nWEIdia @atalman @malfet @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

pytorch-bot · 2025-08-19T09:26:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160956

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Pending

As of commit b2ffcd6 with merge base 19c70c2 ():

NEW FAILURES - The following jobs have failed:

windows-binary-wheel / wheel-py3_14-xpu-test (gh)
Process completed with exit code 1.
windows-binary-wheel / wheel-py3_14t-xpu-test (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

.ci/manywheel/build_cuda.sh

atalman · 2025-08-19T15:14:09Z

@tinglvv current error is related to nvshmem and sm_75:

caffe2/CMakeFiles/torch_nvshmem.dir/cmake_device_link.o -L/usr/local/cuda/targets/x86_64-linux/lib/stubs  -L/usr/local/cuda/targets/x86_64-linux/lib /usr/local/cuda/lib64/libnvshmem_device.a -lcudadevrt -lcudart_static -lrt -lpthread -ldl
nvlink error   : Undefined reference to '_Z23nvshmemi_transfer_quietIL13threadgroup_t3EEvb' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_75)
nvlink error   : Undefined reference to '_Z47nvshmemi_transfer_enforce_consistency_at_targetb' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_75)
nvlink error   : Undefined reference to '_Z30nvshmemi_transfer_amo_nonfetchIlEvPvT_i14nvshmemi_amo_t' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_75)
nvlink error   : Undefined reference to '_Z23nvshmemi_transfer_rma_pIlEvPvT_i' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_75)
nvlink error   : Undefined reference to '_Z21nvshmemi_transfer_rmaIL13threadgroup_t3EL13nvshmemi_op_t4EEvPvS2_mi' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_75)
nvlink error   : Undefined reference to 'nvshmemi_device_state_d' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_75)```

atalman · 2025-08-19T15:39:20Z

Please note Docker builds are using correct version: nvSHMEM 3.3.20 for CUDA 13 (x86_64)

kwen2501 · 2025-08-19T16:26:48Z

Does the NVSHMEM error occur to sm_75 only? If so, it would point to NVSHMEM 3.3.20 dropping support of this arch?

atalman · 2025-08-19T16:27:26Z

Don't see sm75 mentioned here: https://docs.nvidia.com/nvshmem/release-notes-install-guide/release-notes/release-3320.html

Adding dependencies to unblock pytorch/pytorch#160956

.github/scripts/generate_binary_build_matrix.py

Related to pytorch/pytorch#160956, please see: pytorch/pytorch#160956 (comment)

tinglvv · 2025-08-20T17:44:46Z

https://download.pytorch.org/whl/nightly/cu130 is updated after pytorch/test-infra#7038. Rerunning the test.

@atalman

Related to pytorch/pytorch#160956 follow up for #7038 cc @atalman

tinglvv · 2025-08-20T18:55:03Z

Disabled sm_75 for NVSHMEM for CUDA 13 build temporarily (3069af6), will need to enable when hotfix in 3.3.21 is released.

atalman · 2025-08-22T11:29:18Z

@pytorchmergebot merge -f "signal looks good"

pytorchmergebot · 2025-08-22T11:30:47Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kwen2501 · 2025-09-04T20:12:39Z

@tinglvv @atalman Following up, NVSHMEM had a hot release last week (3.3.24) that fixes the above build problem.
Shall we re-enable NVSHMEM for CUDA 13 with a version bump of NVSHMEM?
Or, had we actually not disabled NVSHMEM?

kwen2501 · 2025-09-04T20:19:18Z

Is this the only code change related to "disable nvshmem"?
https://github.com/pytorch/pytorch/pull/160956/files#diff-65f865c422a6ac7945d21c7042c57699e2fa5ae0daae1320493bbb72a26100c2R13-R17

tinglvv · 2025-09-04T20:30:54Z

@tinglvv @atalman Following up, NVSHMEM had a hot release last week (3.3.24) that fixes the above build problem. Shall we re-enable NVSHMEM for CUDA 13 with a version bump of NVSHMEM? Or, had we actually not disabled NVSHMEM?

Hi @kwen2501 , thanks for bringing this up. I updated to NVSHMEM 3.3.24 in a separate PR - #161321.
For aarch64 build it is still disabled though (#160465), the libnvshem_host.so.3 is present but no build. We should enable it back as well.

kwen2501 · 2025-09-04T20:41:17Z

@tinglvv Thanks.
I uploaded #162206 to use the same NVSHMEM version across CUDA builds. Can you please take a look?

For ARM,

is the glibc issue fixed?
is the build available for download now?

tinglvv · 2025-09-04T20:49:06Z

@tinglvv Thanks. I uploaded #162206 to use the same NVSHMEM version across CUDA builds. Can you please take a look?

For ARM,

is the glibc issue fixed?

is the build available for download now?

The other PR looks good! For ARM, both issues are fixed (glibc issue and download link fixed), so we can enable it back. I'll open a PR to put it back.

Related to #159779 Adding CUDA 13.0 libtorch builds, followup after #160956 Removing CUDA 12.9 builds, See #159980 Pull Request resolved: #161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>

…161916) Related to pytorch#159779 Adding CUDA 13.0 libtorch builds, followup after pytorch#160956 Removing CUDA 12.9 builds, See pytorch#159980 Pull Request resolved: pytorch#161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>

pytorch#159779 CUDA 13.0.0 NVSHMEM 3.3.20 CUDNN 9.12.0.46 Adding x86 linux builds for CUDA 13. Adding libtorch docker. Package naming changed for CUDA 13 (removed postfix -cu13 for some packages). Preparation checklist: 1. Update index https://download.pytorch.org/whl/nightly/cu130 with pypi packages 2. Update packaging name based on https://pypi.org/project/cuda-toolkit/ metadata Pull Request resolved: pytorch#160956 Approved by: https://github.com/atalman Co-authored-by: atalman <atalman@fb.com>

…161916) Related to pytorch#159779 Adding CUDA 13.0 libtorch builds, followup after pytorch#160956 Removing CUDA 12.9 builds, See pytorch#159980 Pull Request resolved: pytorch#161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>

Undo changes introduced in #160956 as driver has been updated to 580 for both fleets

Undo changes introduced in #160956 as driver has been updated to 580 for both fleets Fixes #163342 Pull Request resolved: #163349 Approved by: https://github.com/seemethere

…161916) Related to pytorch#159779 Adding CUDA 13.0 libtorch builds, followup after pytorch#160956 Removing CUDA 12.9 builds, See pytorch#159980 Pull Request resolved: pytorch#161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>

Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets Fixes pytorch#163342 Pull Request resolved: pytorch#163349 Approved by: https://github.com/seemethere

…161916) Related to pytorch#159779 Adding CUDA 13.0 libtorch builds, followup after pytorch#160956 Removing CUDA 12.9 builds, See pytorch#159980 Pull Request resolved: pytorch#161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>

Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets Fixes pytorch#163342 Pull Request resolved: pytorch#163349 Approved by: https://github.com/seemethere

Undo changes introduced in #160956 as driver has been updated to 580 for both fleets Fixes #163342 Pull Request resolved: #163349 Approved by: https://github.com/seemethere Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

…161916) Related to pytorch#159779 Adding CUDA 13.0 libtorch builds, followup after pytorch#160956 Removing CUDA 12.9 builds, See pytorch#159980 Pull Request resolved: pytorch#161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>

Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets Fixes pytorch#163342 Pull Request resolved: pytorch#163349 Approved by: https://github.com/seemethere

pytorch-bot bot added the release notes: releng release notes category label Aug 19, 2025

tinglvv added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Aug 19, 2025

pytorchbot added the open source label Aug 19, 2025

atalman reviewed Aug 19, 2025

View reviewed changes

.ci/manywheel/build_cuda.sh Show resolved Hide resolved

nWEIdia mentioned this pull request Aug 19, 2025

PyTorch CUDA13 Binary Cannot Be Built with SM_75 with NVSHMEM #160980

Closed

pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 19, 2025

tinglvv mentioned this pull request Aug 19, 2025

Enable CUDA 13.0 binaries #159779

Closed

15 tasks

atalman mentioned this pull request Aug 19, 2025

Update nvidia cuda 13.0 dependencies pytorch/test-infra#7030

Merged

malfet pushed a commit to pytorch/test-infra that referenced this pull request Aug 19, 2025

Update nvidia cuda 13.0 dependencies (#7030)

bd0ee9f

Adding dependencies to unblock pytorch/pytorch#160956

atalman reviewed Aug 20, 2025

View reviewed changes

.github/scripts/generate_binary_build_matrix.py Show resolved Hide resolved

atalman mentioned this pull request Aug 20, 2025

Add cuda 13.0 packages to update index script pytorch/test-infra#7038

Merged

atalman added a commit to pytorch/test-infra that referenced this pull request Aug 20, 2025

Add cuda 13.0 packages to update index script (#7038)

4c38b67

Related to pytorch/pytorch#160956, please see: pytorch/pytorch#160956 (comment)

tinglvv marked this pull request as ready for review August 20, 2025 17:45

tinglvv requested review from a team and jeffdaily as code owners August 20, 2025 17:45

tinglvv mentioned this pull request Aug 20, 2025

fix cudnn naming for CUDA 13 package pytorch/test-infra#7039

Merged

atalman pushed a commit to pytorch/test-infra that referenced this pull request Aug 20, 2025

fix cudnn naming for CUDA 13 package (#7039)

cc278a5

Related to pytorch/pytorch#160956 follow up for #7038 cc @atalman

tinglvv changed the title ~~Add CUDA 13.0 linux builds~~ Add CUDA 13.0 x86 builds Aug 20, 2025

tinglvv and others added 4 commits August 20, 2025 14:21

Add cuda 13 linux builds

3e24c8e

test building x86 for now

a6f81c8

Run regenerate.sh

1877ace

Add cuda 13 to Docker build

4634a55

pytorchmergebot added the merging label Aug 22, 2025

pytorchmergebot closed this in 49ff884 Aug 22, 2025

pytorchmergebot added Merged and removed merging labels Aug 22, 2025

tinglvv changed the title ~~Add CUDA 13.0 x86 builds~~ [CD] Add CUDA 13.0 x86 nightly builds Aug 22, 2025

atalman mentioned this pull request Sep 1, 2025

[CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds #161916

Closed

malfet added a commit that referenced this pull request Sep 19, 2025

Simplify NVIDIA driver installation step

8cab9c5

Undo changes introduced in #160956 as driver has been updated to 580 for both fleets

malfet mentioned this pull request Sep 19, 2025

[CD] Simplify NVIDIA driver installation step #163349

Closed

atalman mentioned this pull request Sep 24, 2025

[CD] Simplify NVIDIA driver installation step (#163349) #163790

Merged

[CD] Add CUDA 13.0 x86 nightly builds #160956

[CD] Add CUDA 13.0 x86 nightly builds #160956

Uh oh!

Conversation

tinglvv commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160956

❌ 2 New Failures, 3 Pending

Uh oh!

Uh oh!

atalman commented Aug 19, 2025

Uh oh!

atalman commented Aug 19, 2025

Uh oh!

kwen2501 commented Aug 19, 2025

Uh oh!

atalman commented Aug 19, 2025

Uh oh!

Uh oh!

tinglvv commented Aug 20, 2025

Uh oh!

tinglvv commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atalman commented Aug 22, 2025

Uh oh!

pytorchmergebot commented Aug 22, 2025

Merge started

Uh oh!

kwen2501 commented Sep 4, 2025

Uh oh!

kwen2501 commented Sep 4, 2025

Uh oh!

tinglvv commented Sep 4, 2025

Uh oh!

kwen2501 commented Sep 4, 2025

Uh oh!

tinglvv commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tinglvv commented Aug 19, 2025 •

edited

Loading

pytorch-bot bot commented Aug 19, 2025 •

edited

Loading

tinglvv commented Aug 20, 2025 •

edited

Loading