[CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds #161916

atalman · 2025-09-01T15:44:42Z

Related to #159779

Adding CUDA 13.0 libtorch builds, followup after #160956
Removing CUDA 12.9 builds, See #159980

cc @albanD @ptrblck @nWEIdia @malfet

pytorch-bot · 2025-09-01T15:44:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161916

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit b13f017 with merge base 403a3a3 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

windows-arm64-binary-libtorch-debug / libtorch-cpu-shared-with-deps-debug-build (gh) (detected as infra flaky with no log or failing log classifier)
windows-binary-wheel / wheel-py3_13t-xpu-build (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

atalman · 2025-09-01T16:30:36Z

@pytorchmergebot rebase -b main

pytorchmergebot · 2025-09-01T16:32:09Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-09-01T16:32:12Z

Successfully rebased remove_cuda129 onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout remove_cuda129 && git pull --rebase)

atalman · 2025-09-01T20:12:48Z

Starting from CUDA 13.0:

The cuFFT binary libcufft_static_nocallback.a has been removed. libcufft_static.a can be used as a replacement.

From: https://docs.nvidia.com/cuda/cufft/deprecated-functionality.html

atalman · 2025-09-02T11:58:55Z

Looks like new error, cc @tinglvv :

2025-09-02T00:24:54.3285263Z   /usr/bin/ld: /usr/local/cuda/extras/CUPTI/lib64/libcupti_static.a(libcupti_static_partial_mangled.o): relocation R_X86_64_PC32 against symbol `_nv024841cupti' can not be used when making a shared object; recompile with -fPIC
  /usr/bin/ld: final link failed: bad value
  collect2: error: ld returned 1 exit status

Could that be the libcupti static was not compiled with -fPIC option ?

atalman · 2025-09-02T14:46:51Z

.ci/manywheel/build_cuda.sh

+                "libcudart.so.13"
+                "libnvrtc.so.13"
+                "libcupti.so.13")
+            export USE_CUPTI_SO=1


Here we are packing cupti.so anyways, why we are still statically linking with cupti ?
cc @malfet

I'm not sure I understand the question. We shouldn't be linking statically with CUPTI if it's being packaged...

Correct I believe this was an issue USE_CUPTI_SO was set to 0 hence we statically linked with it

atalman · 2025-09-02T15:26:43Z

@tinglvv new failure:

  FAILED: [code=1] bin/Dimname_test
  : && /usr/bin/c++ -Wno-deprecated-declarations -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic -Wl,--no-as-needed caffe2/CMakeFiles/Dimname_test.dir/__/aten/src/ATen/test/Dimname_test.cpp.o -o bin/Dimname_test -L/lib/intel64   -L/lib/intel64_win   -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/pytorch/build/lib  lib/libgtest_main.a  lib/libgtest.a  lib/libgmock.a  -lstdc++  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch.so" -Wl,--as-needed  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed  lib/libprotobuf.a  /opt/intel/lib/libmkl_intel_lp64.a  /opt/intel/lib/libmkl_gnu_thread.a  /opt/intel/lib/libmkl_core.a  -fopenmp  /usr/lib/x86_64-linux-gnu/libpthread.so  -lm  /usr/lib/x86_64-linux-gnu/libdl.so  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed  lib/libc10_cuda.so  lib/libc10.so  /usr/local/cuda/lib64/libcudart_static.a  -ldl  /usr/lib/x86_64-linux-gnu/librt.so  lib/libgtest.a  -pthread && /opt/_internal/cpython-3.9.0/lib/python3.9/site-packages/cmake/data/bin/cmake -E __run_co_compile --lwyu="ldd;-u;-r" --source=bin/Dimname_test && :
  /usr/bin/ld: /pytorch/build/lib/libtorch_cuda.so: undefined reference to `__cudaRegisterLinkedBinary_0cc4b128_20_separate_callback_cu_0aba18a7_23160'
  collect2: error: ld returned 1 exit status
  [7708/7822] Linking CXX executable bin/Dict_test
  FAILED: [code=1] bin/Dict_test

tinglvv · 2025-09-02T17:58:39Z

@tinglvv new failure:

  FAILED: [code=1] bin/Dimname_test
  : && /usr/bin/c++ -Wno-deprecated-declarations -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic -Wl,--no-as-needed caffe2/CMakeFiles/Dimname_test.dir/__/aten/src/ATen/test/Dimname_test.cpp.o -o bin/Dimname_test -L/lib/intel64   -L/lib/intel64_win   -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/pytorch/build/lib  lib/libgtest_main.a  lib/libgtest.a  lib/libgmock.a  -lstdc++  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch.so" -Wl,--as-needed  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed  lib/libprotobuf.a  /opt/intel/lib/libmkl_intel_lp64.a  /opt/intel/lib/libmkl_gnu_thread.a  /opt/intel/lib/libmkl_core.a  -fopenmp  /usr/lib/x86_64-linux-gnu/libpthread.so  -lm  /usr/lib/x86_64-linux-gnu/libdl.so  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed  lib/libc10_cuda.so  lib/libc10.so  /usr/local/cuda/lib64/libcudart_static.a  -ldl  /usr/lib/x86_64-linux-gnu/librt.so  lib/libgtest.a  -pthread && /opt/_internal/cpython-3.9.0/lib/python3.9/site-packages/cmake/data/bin/cmake -E __run_co_compile --lwyu="ldd;-u;-r" --source=bin/Dimname_test && :
  /usr/bin/ld: /pytorch/build/lib/libtorch_cuda.so: undefined reference to `__cudaRegisterLinkedBinary_0cc4b128_20_separate_callback_cu_0aba18a7_23160'
  collect2: error: ld returned 1 exit status
  [7708/7822] Linking CXX executable bin/Dict_test
  FAILED: [code=1] bin/Dict_test

`__cudaRegisterLinkedBinary_0cc4b128_20_separate_callback_cu_0aba18a7_23160' shows "separate_callback.cu", so issue is due to there's a small amount of device code in cufft_static that was not present in cufft_static_nocallback that is related to callbacks. Because of this, compilation needs to be slightly modified to include device linking. There's an extra build command to fix the linking. We could also try turning off the static linking for cufft first.

atalman · 2025-09-03T00:15:03Z

Last failure, missing dependencies:

Missing dependency from libcufile /usr/bin/ld: warning: libmlx5.so.1, needed by /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: librdmacm.so.1, needed by /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libibverbs.so.1, needed by /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1, not found (try using -rpath or -rpath-link)

tinglvv · 2025-09-03T21:59:34Z

Last failure, missing dependencies:

Missing dependency from libcufile /usr/bin/ld: warning: libmlx5.so.1, needed by /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: librdmacm.so.1, needed by /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libibverbs.so.1, needed by /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1, not found (try using -rpath or -rpath-link)

Missing deps are related to RDMA/Infiniband packages, can be installed through apt-get install -y libibverbs-dev. Need to install it for the libtorch docker which is based on ubuntu:20.04

tinglvv · 2025-09-04T01:37:42Z

Two more issues after updating the installation apt-get install -y libibverbs-dev

Need to copy runtime dependencies as well, other than the 3 missing libs

root@35eac87debe4:/# ldd /usr/lib/x86_64-linux-gnu/libibverbs.so.1
        linux-vdso.so.1 (0x00007ffdaecbe000)
        libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x00007f03281d9000)
        libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x00007f03281b6000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f0328193000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f032818d000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0327f9b000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f032827d000)

libnl-route-3.so.200 and libnl-3.so.200 are also needed. Copy them as well.

installation
Besides apt-get install -y libibverbs-dev, apt-get install -y librdmacm-dev is also needed to install librdmacm.so.1

tinglvv · 2025-09-04T15:07:56Z

Good news build is passing now after fixing the missing deps, new failure in test https://github.com/pytorch/pytorch/actions/runs/17454954118/job/49578127160

+ g++ /pytorch/.ci/pytorch/test_example_code/simple-torch-test.cpp -I/tmp/libtorch/include -I/tmp/libtorch/include/torch/csrc/api/include -std=gnu++17 -L/tmp/libtorch/lib -Wl,-R/tmp/libtorch/lib -Wl,--no-as-needed -ltorch -ltorch_cpu -ltorch_cuda -lc10 -o simple-torch-test
/usr/bin/ld: /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1: undefined reference to `cuFileLoggerErr(char const*, char const*)'
/usr/bin/ld: /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1: undefined reference to `cuFileLoggerDbg(char const*, char const*)'

tinglvv · 2025-09-05T05:05:48Z

CI passing after disabling the CUFILE to unblock. Merging for now.

tinglvv · 2025-09-05T05:05:57Z

@pytorchmergebot merge

pytorchmergebot · 2025-09-05T05:07:47Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…161916) Related to pytorch#159779 Adding CUDA 13.0 libtorch builds, followup after pytorch#160956 Removing CUDA 12.9 builds, See pytorch#159980 Pull Request resolved: pytorch#161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>

Revert part of #161916 to continue building CUDA 12.9 nightly Pull Request resolved: #163029 Approved by: https://github.com/malfet

Revert part of #161916 to continue building CUDA 12.9 nightly Pull Request resolved: #163029 Approved by: https://github.com/malfet (cherry picked from commit 4400c5d)

* Continue to build nightly CUDA 12.9 for internal (#163029) Revert part of #161916 to continue building CUDA 12.9 nightly Pull Request resolved: #163029 Approved by: https://github.com/malfet (cherry picked from commit 4400c5d) * Fix lint Signed-off-by: Huy Do <huydhn@gmail.com> --------- Signed-off-by: Huy Do <huydhn@gmail.com> Co-authored-by: Huy Do <huydhn@gmail.com>

Revert part of pytorch#161916 to continue building CUDA 12.9 nightly Pull Request resolved: pytorch#163029 Approved by: https://github.com/malfet

atalman requested a review from a team as a code owner September 1, 2025 15:44

pytorch-bot bot added the topic: not user facing topic category label Sep 1, 2025

atalman added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Sep 1, 2025

atalman added the skip-pr-sanity-checks label Sep 1, 2025

pytorchmergebot force-pushed the remove_cuda129 branch from 561a612 to 1ace630 Compare September 1, 2025 16:32

jeanschmidt approved these changes Sep 1, 2025

View reviewed changes

Skylion007 approved these changes Sep 1, 2025

View reviewed changes

atalman force-pushed the remove_cuda129 branch from 1ace630 to 70cd368 Compare September 1, 2025 20:12

atalman added 2 commits September 1, 2025 14:35

[CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds

de7ec0a

fix

666f389

atalman force-pushed the remove_cuda129 branch from 70cd368 to 666f389 Compare September 1, 2025 21:35

atalman added 2 commits September 1, 2025 15:12

fix

c38f0ae

static

cd1bb89

use_cupti_so_fix_deps

3ce2015

atalman commented Sep 2, 2025

View reviewed changes

turn_off_static_linking

abbbacf

tinglvv mentioned this pull request Sep 2, 2025

Enable CUDA 13.0 binaries #159779

Closed

15 tasks

Add missing libs related to RDMA

a73f046

tinglvv requested a review from jeffdaily as a code owner September 3, 2025 07:43

atalman added 2 commits September 3, 2025 09:23

Update libtorch Dockerfile

cbaec9f

Update Dockerfile

a131a76

Use libibverbs-dev instead of libibverbs-devel

2acda07

fix /usr/lib path

5141276

tinglvv added 2 commits September 3, 2025 18:40

Add the two runtime deps as well

04c932f

Fix the libnl-route-3.so.200 lib name

3510256

turn_off_cufile

b13f017

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 5, 2025

pytorchmergebot added the merging label Sep 5, 2025

pytorchmergebot added the Merged label Sep 5, 2025

pytorchmergebot closed this in bffc7dd Sep 5, 2025

pytorchmergebot removed the merging label Sep 5, 2025

tinglvv mentioned this pull request Sep 5, 2025

libcufile symbol mismatch error in CUDA 13.0 libtorch build #162280

Open

huydhn mentioned this pull request Sep 16, 2025

Continue to build nightly CUDA 12.9 for internal #163029

Closed

malfet mentioned this pull request Sep 18, 2025

[Docs] Get started Locally still points to CUDA-12.9 #163287

Closed

pytorchmergebot pushed a commit that referenced this pull request Oct 11, 2025

Continue to build nightly CUDA 12.9 for internal (#163029)

4400c5d

Revert part of #161916 to continue building CUDA 12.9 nightly Pull Request resolved: #163029 Approved by: https://github.com/malfet

pytorchbot mentioned this pull request Oct 14, 2025

Continue to build nightly CUDA 12.9 for internal #165466

Merged

[CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds #161916

[CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds #161916

Uh oh!

Conversation

atalman commented Sep 1, 2025 • edited by tinglvv Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161916

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

atalman commented Sep 1, 2025

Uh oh!

pytorchmergebot commented Sep 1, 2025

Uh oh!

pytorchmergebot commented Sep 1, 2025

Uh oh!

atalman commented Sep 1, 2025

Uh oh!

atalman commented Sep 2, 2025

Uh oh!

atalman Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

malfet Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

atalman Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

atalman commented Sep 2, 2025

Uh oh!

tinglvv commented Sep 2, 2025

Uh oh!

atalman commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tinglvv commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tinglvv commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tinglvv commented Sep 4, 2025

Uh oh!

tinglvv commented Sep 5, 2025

Uh oh!

tinglvv commented Sep 5, 2025

Uh oh!

pytorchmergebot commented Sep 5, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

atalman commented Sep 1, 2025 •

edited by tinglvv

Loading

pytorch-bot bot commented Sep 1, 2025 •

edited

Loading

atalman commented Sep 3, 2025 •

edited

Loading

tinglvv commented Sep 3, 2025 •

edited

Loading

tinglvv commented Sep 4, 2025 •

edited

Loading