KEMBAR78
[CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds by atalman · Pull Request #161916 · pytorch/pytorch · GitHub
Skip to content

Conversation

@atalman
Copy link
Contributor

@atalman atalman commented Sep 1, 2025

Related to #159779

Adding CUDA 13.0 libtorch builds, followup after #160956
Removing CUDA 12.9 builds, See #159980

cc @albanD @ptrblck @nWEIdia @malfet

@atalman atalman requested a review from a team as a code owner September 1, 2025 15:44
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161916

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit b13f017 with merge base 403a3a3 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Sep 1, 2025
@atalman atalman added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Sep 1, 2025
@atalman
Copy link
Contributor Author

atalman commented Sep 1, 2025

@pytorchmergebot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased remove_cuda129 onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout remove_cuda129 && git pull --rebase)

@atalman
Copy link
Contributor Author

atalman commented Sep 1, 2025

Starting from CUDA 13.0:

The cuFFT binary libcufft_static_nocallback.a has been removed. libcufft_static.a can be used as a replacement.

From: https://docs.nvidia.com/cuda/cufft/deprecated-functionality.html

@atalman
Copy link
Contributor Author

atalman commented Sep 2, 2025

Looks like new error, cc @tinglvv :

2025-09-02T00:24:54.3285263Z   /usr/bin/ld: /usr/local/cuda/extras/CUPTI/lib64/libcupti_static.a(libcupti_static_partial_mangled.o): relocation R_X86_64_PC32 against symbol `_nv024841cupti' can not be used when making a shared object; recompile with -fPIC
  /usr/bin/ld: final link failed: bad value
  collect2: error: ld returned 1 exit status

Could that be the libcupti static was not compiled with -fPIC option ?

"libcudart.so.13"
"libnvrtc.so.13"
"libcupti.so.13")
export USE_CUPTI_SO=1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are packing cupti.so anyways, why we are still statically linking with cupti ?
cc @malfet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the question. We shouldn't be linking statically with CUPTI if it's being packaged...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct I believe this was an issue USE_CUPTI_SO was set to 0 hence we statically linked with it

@atalman
Copy link
Contributor Author

atalman commented Sep 2, 2025

@tinglvv new failure:

  FAILED: [code=1] bin/Dimname_test
  : && /usr/bin/c++ -Wno-deprecated-declarations -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic -Wl,--no-as-needed caffe2/CMakeFiles/Dimname_test.dir/__/aten/src/ATen/test/Dimname_test.cpp.o -o bin/Dimname_test -L/lib/intel64   -L/lib/intel64_win   -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/pytorch/build/lib  lib/libgtest_main.a  lib/libgtest.a  lib/libgmock.a  -lstdc++  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch.so" -Wl,--as-needed  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed  lib/libprotobuf.a  /opt/intel/lib/libmkl_intel_lp64.a  /opt/intel/lib/libmkl_gnu_thread.a  /opt/intel/lib/libmkl_core.a  -fopenmp  /usr/lib/x86_64-linux-gnu/libpthread.so  -lm  /usr/lib/x86_64-linux-gnu/libdl.so  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed  lib/libc10_cuda.so  lib/libc10.so  /usr/local/cuda/lib64/libcudart_static.a  -ldl  /usr/lib/x86_64-linux-gnu/librt.so  lib/libgtest.a  -pthread && /opt/_internal/cpython-3.9.0/lib/python3.9/site-packages/cmake/data/bin/cmake -E __run_co_compile --lwyu="ldd;-u;-r" --source=bin/Dimname_test && :
  /usr/bin/ld: /pytorch/build/lib/libtorch_cuda.so: undefined reference to `__cudaRegisterLinkedBinary_0cc4b128_20_separate_callback_cu_0aba18a7_23160'
  collect2: error: ld returned 1 exit status
  [7708/7822] Linking CXX executable bin/Dict_test
  FAILED: [code=1] bin/Dict_test

@tinglvv
Copy link
Collaborator

tinglvv commented Sep 2, 2025

@tinglvv new failure:

  FAILED: [code=1] bin/Dimname_test
  : && /usr/bin/c++ -Wno-deprecated-declarations -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic -Wl,--no-as-needed caffe2/CMakeFiles/Dimname_test.dir/__/aten/src/ATen/test/Dimname_test.cpp.o -o bin/Dimname_test -L/lib/intel64   -L/lib/intel64_win   -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/pytorch/build/lib  lib/libgtest_main.a  lib/libgtest.a  lib/libgmock.a  -lstdc++  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch.so" -Wl,--as-needed  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed  lib/libprotobuf.a  /opt/intel/lib/libmkl_intel_lp64.a  /opt/intel/lib/libmkl_gnu_thread.a  /opt/intel/lib/libmkl_core.a  -fopenmp  /usr/lib/x86_64-linux-gnu/libpthread.so  -lm  /usr/lib/x86_64-linux-gnu/libdl.so  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed  lib/libc10_cuda.so  lib/libc10.so  /usr/local/cuda/lib64/libcudart_static.a  -ldl  /usr/lib/x86_64-linux-gnu/librt.so  lib/libgtest.a  -pthread && /opt/_internal/cpython-3.9.0/lib/python3.9/site-packages/cmake/data/bin/cmake -E __run_co_compile --lwyu="ldd;-u;-r" --source=bin/Dimname_test && :
  /usr/bin/ld: /pytorch/build/lib/libtorch_cuda.so: undefined reference to `__cudaRegisterLinkedBinary_0cc4b128_20_separate_callback_cu_0aba18a7_23160'
  collect2: error: ld returned 1 exit status
  [7708/7822] Linking CXX executable bin/Dict_test
  FAILED: [code=1] bin/Dict_test

`__cudaRegisterLinkedBinary_0cc4b128_20_separate_callback_cu_0aba18a7_23160' shows "separate_callback.cu", so issue is due to there's a small amount of device code in cufft_static that was not present in cufft_static_nocallback that is related to callbacks. Because of this, compilation needs to be slightly modified to include device linking. There's an extra build command to fix the linking. We could also try turning off the static linking for cufft first.

@tinglvv tinglvv mentioned this pull request Sep 2, 2025
15 tasks
@atalman
Copy link
Contributor Author

atalman commented Sep 3, 2025

Last failure, missing dependencies:

Missing dependency from libcufile /usr/bin/ld: warning: libmlx5.so.1, needed by /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: librdmacm.so.1, needed by /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libibverbs.so.1, needed by /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1, not found (try using -rpath or -rpath-link)

@tinglvv tinglvv requested a review from jeffdaily as a code owner September 3, 2025 07:43
@tinglvv
Copy link
Collaborator

tinglvv commented Sep 3, 2025

Last failure, missing dependencies:

Missing dependency from libcufile /usr/bin/ld: warning: libmlx5.so.1, needed by /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: librdmacm.so.1, needed by /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libibverbs.so.1, needed by /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1, not found (try using -rpath or -rpath-link)

Missing deps are related to RDMA/Infiniband packages, can be installed through apt-get install -y libibverbs-dev. Need to install it for the libtorch docker which is based on ubuntu:20.04

@tinglvv
Copy link
Collaborator

tinglvv commented Sep 4, 2025

Two more issues after updating the installation apt-get install -y libibverbs-dev

  1. Need to copy runtime dependencies as well, other than the 3 missing libs
root@35eac87debe4:/# ldd /usr/lib/x86_64-linux-gnu/libibverbs.so.1
        linux-vdso.so.1 (0x00007ffdaecbe000)
        libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x00007f03281d9000)
        libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x00007f03281b6000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f0328193000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f032818d000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0327f9b000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f032827d000)

libnl-route-3.so.200 and libnl-3.so.200 are also needed. Copy them as well.

  1. installation
    Besides apt-get install -y libibverbs-dev, apt-get install -y librdmacm-dev is also needed to install librdmacm.so.1

@tinglvv
Copy link
Collaborator

tinglvv commented Sep 4, 2025

Good news build is passing now after fixing the missing deps, new failure in test https://github.com/pytorch/pytorch/actions/runs/17454954118/job/49578127160

+ g++ /pytorch/.ci/pytorch/test_example_code/simple-torch-test.cpp -I/tmp/libtorch/include -I/tmp/libtorch/include/torch/csrc/api/include -std=gnu++17 -L/tmp/libtorch/lib -Wl,-R/tmp/libtorch/lib -Wl,--no-as-needed -ltorch -ltorch_cpu -ltorch_cuda -lc10 -o simple-torch-test
/usr/bin/ld: /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1: undefined reference to `cuFileLoggerErr(char const*, char const*)'
/usr/bin/ld: /tmp/libtorch/lib/libcufile_rdma-5257f22c.so.1: undefined reference to `cuFileLoggerDbg(char const*, char const*)'

@tinglvv
Copy link
Collaborator

tinglvv commented Sep 5, 2025

CI passing after disabling the CUFILE to unblock. Merging for now.

@tinglvv
Copy link
Collaborator

tinglvv commented Sep 5, 2025

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 5, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

daisyden pushed a commit to daisyden/pytorch that referenced this pull request Sep 8, 2025
…161916)

Related to pytorch#159779

Adding CUDA 13.0 libtorch builds, followup after pytorch#160956
Removing CUDA 12.9 builds, See pytorch#159980

Pull Request resolved: pytorch#161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…161916)

Related to pytorch#159779

Adding CUDA 13.0 libtorch builds, followup after pytorch#160956
Removing CUDA 12.9 builds, See pytorch#159980

Pull Request resolved: pytorch#161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…161916)

Related to pytorch#159779

Adding CUDA 13.0 libtorch builds, followup after pytorch#160956
Removing CUDA 12.9 builds, See pytorch#159980

Pull Request resolved: pytorch#161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…161916)

Related to pytorch#159779

Adding CUDA 13.0 libtorch builds, followup after pytorch#160956
Removing CUDA 12.9 builds, See pytorch#159980

Pull Request resolved: pytorch#161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…161916)

Related to pytorch#159779

Adding CUDA 13.0 libtorch builds, followup after pytorch#160956
Removing CUDA 12.9 builds, See pytorch#159980

Pull Request resolved: pytorch#161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
pytorchmergebot pushed a commit that referenced this pull request Oct 11, 2025
Revert part of #161916 to continue building CUDA 12.9 nightly

Pull Request resolved: #163029
Approved by: https://github.com/malfet
pytorchbot pushed a commit that referenced this pull request Oct 14, 2025
Revert part of #161916 to continue building CUDA 12.9 nightly

Pull Request resolved: #163029
Approved by: https://github.com/malfet

(cherry picked from commit 4400c5d)
Camyll pushed a commit that referenced this pull request Oct 16, 2025
* Continue to build nightly CUDA 12.9 for internal (#163029)

Revert part of #161916 to continue building CUDA 12.9 nightly

Pull Request resolved: #163029
Approved by: https://github.com/malfet

(cherry picked from commit 4400c5d)

* Fix lint

Signed-off-by: Huy Do <huydhn@gmail.com>

---------

Signed-off-by: Huy Do <huydhn@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
Revert part of pytorch#161916 to continue building CUDA 12.9 nightly

Pull Request resolved: pytorch#163029
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged skip-pr-sanity-checks topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants