-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds #161916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161916
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit b13f017 with merge base 403a3a3 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchmergebot rebase -b main |
|
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
|
Successfully rebased |
561a612 to
1ace630
Compare
1ace630 to
70cd368
Compare
|
Starting from CUDA 13.0: The cuFFT binary libcufft_static_nocallback.a has been removed. libcufft_static.a can be used as a replacement. From: https://docs.nvidia.com/cuda/cufft/deprecated-functionality.html |
70cd368 to
666f389
Compare
|
Looks like new error, cc @tinglvv : Could that be the libcupti static was not compiled with |
| "libcudart.so.13" | ||
| "libnvrtc.so.13" | ||
| "libcupti.so.13") | ||
| export USE_CUPTI_SO=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are packing cupti.so anyways, why we are still statically linking with cupti ?
cc @malfet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the question. We shouldn't be linking statically with CUPTI if it's being packaged...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct I believe this was an issue USE_CUPTI_SO was set to 0 hence we statically linked with it
|
@tinglvv new failure: |
`__cudaRegisterLinkedBinary_0cc4b128_20_separate_callback_cu_0aba18a7_23160' shows "separate_callback.cu", so issue is due to there's a small amount of device code in cufft_static that was not present in cufft_static_nocallback that is related to callbacks. Because of this, compilation needs to be slightly modified to include device linking. There's an extra build command to fix the linking. We could also try turning off the static linking for cufft first. |
|
Last failure, missing dependencies: |
Missing deps are related to RDMA/Infiniband packages, can be installed through |
|
Two more issues after updating the installation
libnl-route-3.so.200 and libnl-3.so.200 are also needed. Copy them as well.
|
|
Good news build is passing now after fixing the missing deps, new failure in test https://github.com/pytorch/pytorch/actions/runs/17454954118/job/49578127160 |
|
CI passing after disabling the CUFILE to unblock. Merging for now. |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…161916) Related to pytorch#159779 Adding CUDA 13.0 libtorch builds, followup after pytorch#160956 Removing CUDA 12.9 builds, See pytorch#159980 Pull Request resolved: pytorch#161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>
…161916) Related to pytorch#159779 Adding CUDA 13.0 libtorch builds, followup after pytorch#160956 Removing CUDA 12.9 builds, See pytorch#159980 Pull Request resolved: pytorch#161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>
…161916) Related to pytorch#159779 Adding CUDA 13.0 libtorch builds, followup after pytorch#160956 Removing CUDA 12.9 builds, See pytorch#159980 Pull Request resolved: pytorch#161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>
…161916) Related to pytorch#159779 Adding CUDA 13.0 libtorch builds, followup after pytorch#160956 Removing CUDA 12.9 builds, See pytorch#159980 Pull Request resolved: pytorch#161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>
…161916) Related to pytorch#159779 Adding CUDA 13.0 libtorch builds, followup after pytorch#160956 Removing CUDA 12.9 builds, See pytorch#159980 Pull Request resolved: pytorch#161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>
Revert part of #161916 to continue building CUDA 12.9 nightly Pull Request resolved: #163029 Approved by: https://github.com/malfet
Revert part of #161916 to continue building CUDA 12.9 nightly Pull Request resolved: #163029 Approved by: https://github.com/malfet (cherry picked from commit 4400c5d)
* Continue to build nightly CUDA 12.9 for internal (#163029) Revert part of #161916 to continue building CUDA 12.9 nightly Pull Request resolved: #163029 Approved by: https://github.com/malfet (cherry picked from commit 4400c5d) * Fix lint Signed-off-by: Huy Do <huydhn@gmail.com> --------- Signed-off-by: Huy Do <huydhn@gmail.com> Co-authored-by: Huy Do <huydhn@gmail.com>
Revert part of pytorch#161916 to continue building CUDA 12.9 nightly Pull Request resolved: pytorch#163029 Approved by: https://github.com/malfet
Related to #159779
Adding CUDA 13.0 libtorch builds, followup after #160956
Removing CUDA 12.9 builds, See #159980
cc @albanD @ptrblck @nWEIdia @malfet