[CD] [aarch64] Add CUDA 13.0 sbsa nightly build #161257

tinglvv · 2025-08-22T08:10:55Z

#159779

CUDA SBSA build for CUDA 13.0

Supported archs: sm_80 to sm_120. Including support for Thor (sm_110), SPARK (sm_121), GB300 (sm_103).
"This release adds support of SM110 GPUs for arm64-sbsa on Linux." from 13.0 release notes https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
Use -compress-mode=size for binary size reduction, 13.0 wheel is 2.18 GB, when compared with 12.9 3.28 GB, that is 1.1 GB of savings and ~33.5% smaller.
Refactored the libs_to_copy list with common libs, and version_specific_libs.

TODO: add the other CUDA archs in the existing support matrix of x86 to SBSA build as well

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ptrblck @atalman @nWEIdia @malfet

pytorch-bot · 2025-08-22T08:11:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161257

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Unrelated Failures

As of commit 2325f1b with merge base b2db293 ():

NEW FAILURES - The following jobs have failed:

macos-arm64-binary-wheel / wheel-py3_10-cpu-build (gh)
No files were found with the provided path: /Users/runner/work/_temp/artifacts. No artifacts will be uploaded.
trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_cuda.py::TestCudaMallocAsync::test_allocator_settings
trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 4, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_cuda_expandable_segments.py::TestBlockStateAbsorption::test_allocate_in_thread_to_pool

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh) (similar failure)
Build left local git repository checkout dirty
trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh) (similar failure)
Build left local git repository checkout dirty

This comment was automatically generated by Dr. CI and updates every 15 minutes.

.ci/aarch64_linux/aarch64_wheel_ci_build.py

johnnynunez · 2025-08-25T05:21:31Z

Is it possible to add that archs for next releases?
How much size is the wheel? #161378

.ci/aarch64_linux/aarch64_ci_build.sh

nWEIdia

Overall looks good, let's add THOR support.
"This release adds support of SM110 GPUs for arm64-sbsa on Linux." from 13.0 release notes https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

tinglvv · 2025-08-25T05:42:04Z

Is it possible to add that archs for next releases? How much size is the wheel? #161378

This PR is part of the CUDA 13 upstream bringup #159779, adding sm80-120 support for CUDA 13 SBSA build.

You can check the size of the wheel in https://github.com/pytorch/pytorch/actions/runs/17186608287, 2.18 GB, with --compress-mode=size enabled. Comparing with 12.9 3.28 GB, that is 1.1 GB of savings and ~33.5% smaller.

#161378 seems dup as this PR will add support for Thor (sm_110), Spark (sm_120 compatible) and GB300 (sm_100 compatible). Thanks.

johnnynunez · 2025-08-25T05:45:36Z

Is it possible to add that archs for next releases? How much size is the wheel? #161378

This PR is part of the CUDA 13 upstream bringup #159779, adding sm80-120 support for CUDA 13 SBSA build.

You can check the size of the wheel in https://github.com/pytorch/pytorch/actions/runs/17186608287, 2.18 GB, with --compress-mode=size enabled. Comparing with 12.9 3.28 GB, that is 1.1 GB of savings and ~33.5% smaller.

#161378 seems dup as this PR will add support for Thor (sm_110), Spark and GB300 (sm_120 compatible). Thanks.

Thanks! Very kind.
I’m very concerned because now, we have more devices available with ARM than x86 for generate wheels, so that increase wheels size. Majority of frameworks don’t want distribute for edge/small devices like jetson or spark, but it will be very beneficial if they can do it, due that compile in small devices is hard

tinglvv · 2025-08-25T18:02:00Z

After adding sm_110, build failure in nvshmem, which doesn't support building sm_110. https://github.com/pytorch/pytorch/actions/runs/17200303693/job/48789533255?pr=161257
nvlink error : Undefined reference to '_Z23nvshmemi_transfer_quietIL13threadgroup_t3EEvb' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_110)

johnnynunez · 2025-08-25T19:08:49Z

After adding sm_110, build failure in nvshmem, which doesn't support building sm_110. https://github.com/pytorch/pytorch/actions/runs/17200303693/job/48789533255?pr=161257 nvlink error : Undefined reference to '_Z23nvshmemi_transfer_quietIL13threadgroup_t3EEvb' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_110)

What nvshem are you using?
It is only supported in the latest 3.3.24. I’ve got it in jetson-containers

https://pypi.jetson-ai-lab.io/sbsa/cu130/torch/2.9.0.dev20250823+cu130.g3e5b021

also, cusparselt for thor minimum version is 0.8.0

tinglvv · 2025-08-26T04:48:51Z

As mentioned by the NVSHMEM team, NVSHMEM by definition is intended for Datacenter GPUs (clusters), therefore not expected to support Thor. https://docs.nvidia.com/nvshmem/release-notes-install-guide/release-notes/release-3324.html
Skipping sm_110 for NVSHMEM on sbsa.

tinglvv · 2025-08-26T21:03:47Z

Previous build failures shows there are more places that call into undefined references than the header file
https://github.com/pytorch/pytorch/actions/runs/17232507556/job/48889683226

nvlink error   : Undefined reference to 'nvshmem_int64_p' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_110)
nvlink error   : Undefined reference to 'nvshmemx_barrier_all_block' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_110)
nvlink error   : Undefined reference to 'nvshmemx_getmem_block' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_110)

Introduce _NVSHMEM_DEVICELIB_SUPPORTED as a helper to decide whether to set NVSHMEM_HOSTLIB_ONLY before including nvshmem.h, and guard whether to run the unsupported functions. 83d2623

tinglvv · 2025-08-27T04:15:53Z

@pytorchbot merge -i

pytorchmergebot · 2025-08-27T04:18:14Z

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

atalman · 2025-08-27T14:36:06Z

@pytorchmergebot merge -f "All required signal look good"

pytorchmergebot · 2025-08-27T14:37:37Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch#159779 CUDA SBSA build for CUDA 13.0 1. Supported archs: sm_80 to sm_120. Including support for Thor (sm_110), SPARK (sm_121), GB300 (sm_103). "This release adds support of SM110 GPUs for arm64-sbsa on Linux." from 13.0 release notes https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html 2. Use -compress-mode=size for binary size reduction, 13.0 wheel is 2.18 GB, when compared with 12.9 3.28 GB, that is 1.1 GB of savings and ~33.5% smaller. 3. Refactored the libs_to_copy list with common libs, and version_specific_libs. TODO: add the other CUDA archs in the existing support matrix of x86 to SBSA build as well Pull Request resolved: pytorch#161257 Approved by: https://github.com/nWEIdia, https://github.com/atalman

pytorch-bot bot added the release notes: releng release notes category label Aug 22, 2025

tinglvv added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Aug 22, 2025

pytorchbot added the open source label Aug 22, 2025

tinglvv force-pushed the cu130-sbsa-build branch from 3503cd5 to 98cadf4 Compare August 22, 2025 16:10

tinglvv changed the title ~~Test CUDA 13.0 sbsa build~~ Add CUDA 13.0 sbsa build Aug 22, 2025

tinglvv mentioned this pull request Aug 22, 2025

Enable CUDA 13.0 binaries #159779

Closed

15 tasks

tinglvv marked this pull request as ready for review August 22, 2025 16:18

tinglvv requested a review from a team as a code owner August 22, 2025 16:18

tinglvv changed the title ~~Add CUDA 13.0 sbsa build~~ [CD] [aarch64] Add CUDA 13.0 sbsa nightly build Aug 22, 2025

Skylion007 reviewed Aug 22, 2025

View reviewed changes

.ci/aarch64_linux/aarch64_wheel_ci_build.py Show resolved Hide resolved

tinglvv added this to PyTorch + CUDA Aug 22, 2025

ptrblck moved this to In Progress in PyTorch + CUDA Aug 22, 2025

tinglvv self-assigned this Aug 24, 2025

nWEIdia reviewed Aug 25, 2025

View reviewed changes

.ci/aarch64_linux/aarch64_ci_build.sh Show resolved Hide resolved

nWEIdia reviewed Aug 25, 2025

View reviewed changes

nWEIdia approved these changes Aug 25, 2025

View reviewed changes

tinglvv force-pushed the cu130-sbsa-build branch from cd5e111 to bc3abe9 Compare August 26, 2025 06:39

pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 26, 2025

tinglvv added 4 commits August 26, 2025 13:51

Test CUDA 13.0 sbsa build

506d170

move lib order

f45682c

Use compress-mode=size and correct libcufile vers

66c185a

Fix yml file

160d944

tinglvv added 4 commits August 26, 2025 13:51

Add sm_110 for Thor support

052cd3f

Skip sm_110 for NVSHMEM sbsa

a2a1937

move nvshmem_skip_sm_110 logic

1fa1b42

Use _NVSHMEM_DEVICELIB_SUPPORTED logic to guard relevant functions

83d2623

tinglvv force-pushed the cu130-sbsa-build branch from 86a0315 to 83d2623 Compare August 26, 2025 20:55

tinglvv added 2 commits August 26, 2025 14:11

Simply logic as sm_110 can be unconditionally skipped

727b12f

regenerate the yml files

2325f1b

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 27, 2025

pytorchmergebot added the merging label Aug 27, 2025

pytorchmergebot removed the merging label Aug 27, 2025

atalman approved these changes Aug 27, 2025

View reviewed changes

pytorchmergebot added the merging label Aug 27, 2025

pytorchmergebot closed this in 9632f4e Aug 27, 2025

pytorchmergebot added the Merged label Aug 27, 2025

github-project-automation bot moved this from In Progress to Done in PyTorch + CUDA Aug 27, 2025

pytorchmergebot removed the merging label Aug 27, 2025

atalman removed this from PyTorch + CUDA Sep 26, 2025

[CD] [aarch64] Add CUDA 13.0 sbsa nightly build #161257

[CD] [aarch64] Add CUDA 13.0 sbsa nightly build #161257

Uh oh!

Conversation

tinglvv commented Aug 22, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161257

❌ 3 New Failures, 2 Unrelated Failures

Uh oh!

Uh oh!

johnnynunez commented Aug 25, 2025

Uh oh!

Uh oh!

nWEIdia left a comment

Choose a reason for hiding this comment

Uh oh!

tinglvv commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnnynunez commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tinglvv commented Aug 25, 2025

Uh oh!

johnnynunez commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tinglvv commented Aug 26, 2025

Uh oh!

tinglvv commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tinglvv commented Aug 27, 2025

Uh oh!

pytorchmergebot commented Aug 27, 2025

Merge failed

Uh oh!

atalman commented Aug 27, 2025

Uh oh!

pytorchmergebot commented Aug 27, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

tinglvv commented Aug 22, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 22, 2025 •

edited

Loading

tinglvv commented Aug 25, 2025 •

edited

Loading

johnnynunez commented Aug 25, 2025 •

edited

Loading

johnnynunez commented Aug 25, 2025 •

edited

Loading

tinglvv commented Aug 26, 2025 •

edited

Loading