KEMBAR78
Add CUDA 12.8 manywheel x86 Builds to Binaries Matrix by tinglvv · Pull Request #145792 · pytorch/pytorch · GitHub
Skip to content

Conversation

@tinglvv
Copy link
Collaborator

@tinglvv tinglvv commented Jan 27, 2025

#145570

Adding cuda 12.8.0 x86 builds first

TODO: resolve libtorch build failure and add build in #146084

cc @atalman @malfet @ptrblck @nWEIdia

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 27, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145792

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 121 Pending

As of commit d9d6492 with merge base 2af8767 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jan 27, 2025
@tinglvv tinglvv added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Jan 27, 2025
@tinglvv tinglvv marked this pull request as ready for review January 28, 2025 18:54
@tinglvv tinglvv requested a review from a team as a code owner January 28, 2025 18:54
@tinglvv
Copy link
Collaborator Author

tinglvv commented Jan 28, 2025

Will keep the +PTX for nightlies.
Will remove it for 2.7.0 stable release.

"nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add another PR updating NCCL

@Skylion007
Copy link
Collaborator

This needs someone with S3 access to upload the relevant binaries to the nightly bucket? @albanD @malfet ?

@malfet
Copy link
Contributor

malfet commented Jan 28, 2025

This needs someone with S3 access to upload the relevant binaries to the nightly bucket? @albanD @malfet ?

@atalman you have taken care of this one already, haven't you?

@atalman
Copy link
Contributor

atalman commented Jan 29, 2025

Hi @malfet binaries are uploaded to cu128 bucket

Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tinglvv please fix lint

@tinglvv
Copy link
Collaborator Author

tinglvv commented Jan 30, 2025

Libtorch Build failure - https://github.com/pytorch/pytorch/actions/runs/13042203635/job/36386381759
CUDAContext.cpp:(.text+0x157): additional relocation overflows omitted from the output /usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax collect2: error: ld returned 1 exit status

Seems the binary size might be too large, need to refine TORCH_CUDA_ARCH_LIST based on #39968.

Skipping the libtorch wheel addition for now.

@tinglvv tinglvv changed the title Add CUDA 12.8 Linux Builds to Binaries Matrix Add CUDA 12.8 manywheel x86 Builds to Binaries Matrix Jan 30, 2025
@tinglvv
Copy link
Collaborator Author

tinglvv commented Jan 30, 2025

Lint keeps getting this failure, unsure of the reason
File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module> main() File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main run_cmd_or_die(f"docker exec -t {container_name} /exec") File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}") RuntimeError: Command docker exec -t d0819bbeba3a6439ccfdbf23e2ed66a4c6521cd53f005452b928484515f6acf2 /exec failed with exit code 5

@Skylion007
Copy link
Collaborator

Too unblock try passing this flag: #39968 (comment) that might it link while we figure out best way to reduce the code size.


cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
Copy link
Collaborator

@Skylion007 Skylion007 Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add --host-linker-script=use-lcs to the TORCH_NVCC_FLAGS at the top of this file, that should fix this issue without changing the CUDA_ARCH_LIST

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tinglvv
Copy link
Collaborator Author

tinglvv commented Jan 30, 2025

New Build Failure in libtorch after the ld relink error https://github.com/pytorch/pytorch/actions/runs/13056929705/job/36430236293

[7278/7628] Linking CXX executable bin/example_allreduce
FAILED: bin/example_allreduce 
: && /opt/rh/devtoolset-9/root/usr/bin/c++ -Wno-deprecated-declarations -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic     -Wl,--no-as-needed test_cpp_c10d/CMakeFiles/example_allreduce.dir/example/allreduce.cpp.o -o bin/example_allreduce -L/lib/intel64   -L/lib/intel64_win   -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/pytorch/build/lib  lib/libtorch_cpu.so  lib/libtorch_cuda.so  lib/libc10_cuda.so  /usr/local/cuda/lib64/libcudart_static.a  -ldl  /lib64/librt.so  lib/libprotobuf.a  -pthread  lib/libc10.so  /opt/intel/lib/libmkl_intel_lp64.a  /opt/intel/lib/libmkl_gnu_thread.a  /opt/intel/lib/libmkl_core.a  -fopenmp  -lpthread  /lib64/libm.so  /lib64/libdl.so  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed && /opt/_internal/cpython-3.9.0/lib/python3.9/site-packages/cmake/data/bin/cmake -E __run_co_compile --lwyu="ldd;-u;-r" --source=bin/example_allreduce && :
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/cuda/lib64/libcublasLt.so.12: undefined reference to `__cxa_thread_atexit_impl@GLIBC_2.18'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/cuda/lib64/libcublasLt.so.12: undefined reference to `log2f@GLIBC_2.27'
collect2: error: ld returned 1 exit status

Let me just merge the manywheel changes for now.

@tinglvv
Copy link
Collaborator Author

tinglvv commented Jan 31, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 31, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@atalman
Copy link
Contributor

atalman commented Jan 31, 2025

@pytorchmergebot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cu128-binaries onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout cu128-binaries && git pull --rebase)

@atalman
Copy link
Contributor

atalman commented Jan 31, 2025

@pytorchmergebot merge -f "lint is passing everything else was tested"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@tinglvv
Copy link
Collaborator Author

tinglvv commented Jan 31, 2025

Thanks Andrey for merging the PR. Nightly x86 builds should be available from tonight.

@Skylion007
Copy link
Collaborator

New Build Failure in libtorch after the ld relink error https://github.com/pytorch/pytorch/actions/runs/13056929705/job/36430236293

[7278/7628] Linking CXX executable bin/example_allreduce
FAILED: bin/example_allreduce 
: && /opt/rh/devtoolset-9/root/usr/bin/c++ -Wno-deprecated-declarations -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic     -Wl,--no-as-needed test_cpp_c10d/CMakeFiles/example_allreduce.dir/example/allreduce.cpp.o -o bin/example_allreduce -L/lib/intel64   -L/lib/intel64_win   -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/pytorch/build/lib  lib/libtorch_cpu.so  lib/libtorch_cuda.so  lib/libc10_cuda.so  /usr/local/cuda/lib64/libcudart_static.a  -ldl  /lib64/librt.so  lib/libprotobuf.a  -pthread  lib/libc10.so  /opt/intel/lib/libmkl_intel_lp64.a  /opt/intel/lib/libmkl_gnu_thread.a  /opt/intel/lib/libmkl_core.a  -fopenmp  -lpthread  /lib64/libm.so  /lib64/libdl.so  -Wl,--no-as-needed,"/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed && /opt/_internal/cpython-3.9.0/lib/python3.9/site-packages/cmake/data/bin/cmake -E __run_co_compile --lwyu="ldd;-u;-r" --source=bin/example_allreduce && :
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/cuda/lib64/libcublasLt.so.12: undefined reference to `__cxa_thread_atexit_impl@GLIBC_2.18'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/cuda/lib64/libcublasLt.so.12: undefined reference to `log2f@GLIBC_2.27'
collect2: error: ld returned 1 exit status

Let me just merge the manywheel changes for now.

Apparentyl, we need to pass -Xlinker --script -Xlinker ./lcs to the relink script after generating it with nvcc --host-link-script gen-lcs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants