KEMBAR78
[CD] [aarch64] Add CUDA 13.0 sbsa nightly build by tinglvv · Pull Request #161257 · pytorch/pytorch · GitHub
Skip to content

Conversation

@tinglvv
Copy link
Collaborator

@tinglvv tinglvv commented Aug 22, 2025

#159779

CUDA SBSA build for CUDA 13.0

  1. Supported archs: sm_80 to sm_120. Including support for Thor (sm_110), SPARK (sm_121), GB300 (sm_103).
    "This release adds support of SM110 GPUs for arm64-sbsa on Linux." from 13.0 release notes https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
  2. Use -compress-mode=size for binary size reduction, 13.0 wheel is 2.18 GB, when compared with 12.9 3.28 GB, that is 1.1 GB of savings and ~33.5% smaller.
  3. Refactored the libs_to_copy list with common libs, and version_specific_libs.

TODO: add the other CUDA archs in the existing support matrix of x86 to SBSA build as well

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ptrblck @atalman @nWEIdia @malfet

@pytorch-bot pytorch-bot bot added the release notes: releng release notes category label Aug 22, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 22, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161257

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Unrelated Failures

As of commit 2325f1b with merge base b2db293 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@tinglvv tinglvv added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Aug 22, 2025
@tinglvv tinglvv changed the title Test CUDA 13.0 sbsa build Add CUDA 13.0 sbsa build Aug 22, 2025
@tinglvv tinglvv mentioned this pull request Aug 22, 2025
15 tasks
@tinglvv tinglvv marked this pull request as ready for review August 22, 2025 16:18
@tinglvv tinglvv requested a review from a team as a code owner August 22, 2025 16:18
@tinglvv tinglvv changed the title Add CUDA 13.0 sbsa build [CD] [aarch64] Add CUDA 13.0 sbsa nightly build Aug 22, 2025
@ptrblck ptrblck moved this to In Progress in PyTorch + CUDA Aug 22, 2025
@tinglvv tinglvv self-assigned this Aug 24, 2025
@johnnynunez
Copy link
Contributor

Is it possible to add that archs for next releases?
How much size is the wheel? #161378

Copy link
Collaborator

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, let's add THOR support.
"This release adds support of SM110 GPUs for arm64-sbsa on Linux." from 13.0 release notes https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 25, 2025

Is it possible to add that archs for next releases? How much size is the wheel? #161378

This PR is part of the CUDA 13 upstream bringup #159779, adding sm80-120 support for CUDA 13 SBSA build.

You can check the size of the wheel in https://github.com/pytorch/pytorch/actions/runs/17186608287, 2.18 GB, with --compress-mode=size enabled. Comparing with 12.9 3.28 GB, that is 1.1 GB of savings and ~33.5% smaller.

#161378 seems dup as this PR will add support for Thor (sm_110), Spark (sm_120 compatible) and GB300 (sm_100 compatible). Thanks.

@johnnynunez
Copy link
Contributor

johnnynunez commented Aug 25, 2025

Is it possible to add that archs for next releases? How much size is the wheel? #161378

This PR is part of the CUDA 13 upstream bringup #159779, adding sm80-120 support for CUDA 13 SBSA build.

You can check the size of the wheel in https://github.com/pytorch/pytorch/actions/runs/17186608287, 2.18 GB, with --compress-mode=size enabled. Comparing with 12.9 3.28 GB, that is 1.1 GB of savings and ~33.5% smaller.

#161378 seems dup as this PR will add support for Thor (sm_110), Spark and GB300 (sm_120 compatible). Thanks.

Thanks! Very kind.
I’m very concerned because now, we have more devices available with ARM than x86 for generate wheels, so that increase wheels size. Majority of frameworks don’t want distribute for edge/small devices like jetson or spark, but it will be very beneficial if they can do it, due that compile in small devices is hard

@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 25, 2025

After adding sm_110, build failure in nvshmem, which doesn't support building sm_110. https://github.com/pytorch/pytorch/actions/runs/17200303693/job/48789533255?pr=161257
nvlink error : Undefined reference to '_Z23nvshmemi_transfer_quietIL13threadgroup_t3EEvb' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_110)

@johnnynunez
Copy link
Contributor

johnnynunez commented Aug 25, 2025

After adding sm_110, build failure in nvshmem, which doesn't support building sm_110. https://github.com/pytorch/pytorch/actions/runs/17200303693/job/48789533255?pr=161257 nvlink error : Undefined reference to '_Z23nvshmemi_transfer_quietIL13threadgroup_t3EEvb' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_110)

What nvshem are you using?
It is only supported in the latest 3.3.24. I’ve got it in jetson-containers

https://pypi.jetson-ai-lab.io/sbsa/cu130/torch/2.9.0.dev20250823+cu130.g3e5b021

also, cusparselt for thor minimum version is 0.8.0

@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 26, 2025

As mentioned by the NVSHMEM team, NVSHMEM by definition is intended for Datacenter GPUs (clusters), therefore not expected to support Thor. https://docs.nvidia.com/nvshmem/release-notes-install-guide/release-notes/release-3324.html
Skipping sm_110 for NVSHMEM on sbsa.

@pytorch-bot pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 26, 2025
@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 26, 2025

Previous build failures shows there are more places that call into undefined references than the header file
https://github.com/pytorch/pytorch/actions/runs/17232507556/job/48889683226

nvlink error   : Undefined reference to 'nvshmem_int64_p' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_110)
nvlink error   : Undefined reference to 'nvshmemx_barrier_all_block' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_110)
nvlink error   : Undefined reference to 'nvshmemx_getmem_block' in 'caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu.o' (target: sm_110)

Introduce _NVSHMEM_DEVICELIB_SUPPORTED as a helper to decide whether to set NVSHMEM_HOSTLIB_ONLY before including nvshmem.h, and guard whether to run the unsupported functions. 83d2623

@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 27, 2025

@pytorchbot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 27, 2025
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)
Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@atalman
Copy link
Contributor

atalman commented Aug 27, 2025

@pytorchmergebot merge -f "All required signal look good"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-project-automation github-project-automation bot moved this from In Progress to Done in PyTorch + CUDA Aug 27, 2025
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
pytorch#159779

CUDA SBSA build for CUDA 13.0
1. Supported archs: sm_80 to sm_120. Including support for Thor (sm_110), SPARK (sm_121), GB300 (sm_103).
"This release adds support of SM110 GPUs for arm64-sbsa on Linux." from 13.0 release notes https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
2. Use -compress-mode=size for binary size reduction, 13.0 wheel is 2.18 GB, when compared with 12.9 3.28 GB, that is 1.1 GB of savings and ~33.5% smaller.
3. Refactored the libs_to_copy list with common libs, and version_specific_libs.

TODO: add the other CUDA archs in the existing support matrix of x86 to SBSA build as well

Pull Request resolved: pytorch#161257
Approved by: https://github.com/nWEIdia, https://github.com/atalman
@atalman atalman removed this from PyTorch + CUDA Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/h100-symm-mem ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: releng release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants