KEMBAR78
Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker by atalman · Pull Request #149540 · pytorch/pytorch · GitHub
Skip to content

Conversation

atalman
Copy link
Contributor

@atalman atalman commented Mar 19, 2025

  1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort #149351
    TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

  2. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: Remove 12.4 x86 builds and 12.6 sbsa builds from nightly  #148895

@atalman atalman requested a review from jeffdaily as a code owner March 19, 2025 18:30
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149540

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 19 Pending, 1 Unrelated Failure

As of commit f865fe7 with merge base 94d761f (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 19, 2025
popd
rm -rf tmp_cusparselt
}
NCCL_VERSION=v2.26.2-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have any issues for older cuda versions? For x86 build we need to use the old nccl version for cuda11.8? cc @kwen2501

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for cuda aarch64 build. Currently for this build we support only CUDA 12.8

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this NCCL_VERSION gone through distributed CI testing? Does CI have sufficient signals? Or we would only know when we merge this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is aarch64 binary, so we don't have GH100 (multiple of them) in CI yet.

@atalman atalman changed the title Modify cuda aarch64 install for cudnn and nccl Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker Mar 19, 2025
@atalman atalman requested a review from a team as a code owner March 19, 2025 19:58
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please add a followup issue to unify NCCL definitions across aarch64 and x86 builds

revert previous commit
@atalman
Copy link
Contributor Author

atalman commented Mar 19, 2025

@pytorchmergebot merge -f "lint is green and aarch64 docker build as well"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@atalman
Copy link
Contributor Author

atalman commented Mar 19, 2025

@pytorchbot cherry-pick --onto release/2.7 --fixes "aarch64 cuda failures with nccl" -c critical

@pytorchbot
Copy link
Collaborator

Cherry picking #149540

Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x c9de76a1e45e24e45dc00285859457092b7b2a0c returned non-zero exit code 1

Auto-merging .ci/docker/common/install_cuda_aarch64.sh
CONFLICT (content): Merge conflict in .ci/docker/common/install_cuda_aarch64.sh
Auto-merging .github/workflows/build-manywheel-images.yml
error: could not apply c9de76a1e45... Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

atalman added a commit to atalman/pytorch that referenced this pull request Mar 20, 2025
…12.6 docker (pytorch#149540)

1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895
Pull Request resolved: pytorch#149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
atalman added a commit to atalman/pytorch that referenced this pull request Mar 26, 2025
…12.6 docker (pytorch#149540)

1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895
Pull Request resolved: pytorch#149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
malfet pushed a commit that referenced this pull request Mar 26, 2025
… aarch64 cuda 12.6 docker #149540 (#149624)

Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540)

1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: #149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: #148895
Pull Request resolved: #149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
…12.6 docker (pytorch#149540)

1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895
Pull Request resolved: pytorch#149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants