-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker #149540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149540
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 19 Pending, 1 Unrelated FailureAs of commit f865fe7 with merge base 94d761f ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| popd | ||
| rm -rf tmp_cusparselt | ||
| } | ||
| NCCL_VERSION=v2.26.2-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have any issues for older cuda versions? For x86 build we need to use the old nccl version for cuda11.8? cc @kwen2501
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for cuda aarch64 build. Currently for this build we support only CUDA 12.8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this NCCL_VERSION gone through distributed CI testing? Does CI have sufficient signals? Or we would only know when we merge this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is aarch64 binary, so we don't have GH100 (multiple of them) in CI yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please add a followup issue to unify NCCL definitions across aarch64 and x86 builds
revert previous commit
|
@pytorchmergebot merge -f "lint is green and aarch64 docker build as well" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot cherry-pick --onto release/2.7 --fixes "aarch64 cuda failures with nccl" -c critical |
Cherry picking #149540Command Details for Dev Infra teamRaised by workflow job |
…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
… aarch64 cuda 12.6 docker #149540 (#149624) Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: #149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: #148895 Pull Request resolved: #149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort #149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds
Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: Remove 12.4 x86 builds and 12.6 sbsa builds from nightly #148895