Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker #149540

atalman · 2025-03-19T18:30:13Z

Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort #149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds
Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: Remove 12.4 x86 builds and 12.6 sbsa builds from nightly #148895

pytorch-bot · 2025-03-19T18:30:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149540

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 19 Pending, 1 Unrelated Failure

As of commit f865fe7 with merge base 94d761f ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge) (gh) (#144480)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

d4l3k · 2025-03-19T18:33:38Z

.ci/docker/common/install_cuda_aarch64.sh

-    popd
-    rm -rf tmp_cusparselt
-}
+NCCL_VERSION=v2.26.2-1


Does this have any issues for older cuda versions? For x86 build we need to use the old nccl version for cuda11.8? cc @kwen2501

This is only for cuda aarch64 build. Currently for this build we support only CUDA 12.8

Has this NCCL_VERSION gone through distributed CI testing? Does CI have sufficient signals? Or we would only know when we merge this?

Oh, this is aarch64 binary, so we don't have GH100 (multiple of them) in CI yet.

malfet

LGTM, but please add a followup issue to unify NCCL definitions across aarch64 and x86 builds

.ci/docker/common/install_cuda_aarch64.sh

revert previous commit

.ci/docker/common/install_cuda_aarch64.sh

atalman · 2025-03-19T23:18:29Z

@pytorchmergebot merge -f "lint is green and aarch64 docker build as well"

pytorchmergebot · 2025-03-19T23:19:53Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

atalman · 2025-03-19T23:39:11Z

@pytorchbot cherry-pick --onto release/2.7 --fixes "aarch64 cuda failures with nccl" -c critical

pytorchbot · 2025-03-19T23:44:07Z

Cherry picking #149540

Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x c9de76a1e45e24e45dc00285859457092b7b2a0c returned non-zero exit code 1

Auto-merging .ci/docker/common/install_cuda_aarch64.sh
CONFLICT (content): Merge conflict in .ci/docker/common/install_cuda_aarch64.sh
Auto-merging .github/workflows/build-manywheel-images.yml
error: could not apply c9de76a1e45... Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia

… aarch64 cuda 12.6 docker #149540 (#149624) Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: #149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: #148895 Pull Request resolved: #149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia

…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia

Modify cuda aarch64 install for cudnn and nccl

b67a042

atalman requested a review from jeffdaily as a code owner March 19, 2025 18:30

pytorch-bot bot added the topic: not user facing topic category label Mar 19, 2025

d4l3k reviewed Mar 19, 2025

View reviewed changes

atalman changed the title ~~Modify cuda aarch64 install for cudnn and nccl~~ Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker Mar 19, 2025

remove_cuda_126_aarch

12636f8

atalman requested a review from a team as a code owner March 19, 2025 19:58

seemethere approved these changes Mar 19, 2025

View reviewed changes

malfet approved these changes Mar 19, 2025

View reviewed changes

atalman mentioned this pull request Mar 19, 2025

Unify nccl versions for x86 and aarch64 builds #149554

Closed

nWEIdia reviewed Mar 19, 2025

View reviewed changes

.ci/docker/common/install_cuda_aarch64.sh Show resolved Hide resolved

atalman commented Mar 19, 2025

View reviewed changes

.ci/docker/common/install_cuda_aarch64.sh Show resolved Hide resolved

Update .ci/docker/common/install_cuda_aarch64.sh

6eb92bf

atalman commented Mar 19, 2025

View reviewed changes

.ci/docker/common/install_cuda_aarch64.sh Outdated Show resolved Hide resolved

revert previous commit

f865fe7

revert previous commit

atalman commented Mar 19, 2025

View reviewed changes

.ci/docker/common/install_cuda_aarch64.sh Show resolved Hide resolved

nWEIdia approved these changes Mar 19, 2025

View reviewed changes

pytorchmergebot added the merging label Mar 19, 2025

pytorchmergebot added the Merged label Mar 19, 2025

pytorchmergebot closed this in c9de76a Mar 19, 2025

pytorchmergebot removed the merging label Mar 19, 2025

seemethere mentioned this pull request Mar 20, 2025

ci: Remove mentions and usages of DESIRED_DEVTOOLSET and cxx11 #149443

Closed

atalman mentioned this pull request Mar 20, 2025

[cherry-pick] Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker #149540 #149624

Merged

atalman mentioned this pull request Mar 20, 2025

[v.2.7.0] Release Tracker #149044

Closed

Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker #149540

Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker #149540

Uh oh!

Conversation

atalman commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149540

⏳ 19 Pending, 1 Unrelated Failure

Uh oh!

d4l3k Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

atalman Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

nWEIdia Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

nWEIdia Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

atalman commented Mar 19, 2025

Uh oh!

pytorchmergebot commented Mar 19, 2025

Merge started

Uh oh!

atalman commented Mar 19, 2025

Uh oh!

pytorchbot commented Mar 19, 2025

Cherry picking #149540

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

atalman commented Mar 19, 2025 •

edited

Loading

pytorch-bot bot commented Mar 19, 2025 •

edited

Loading