nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort #149351

d4l3k · 2025-03-17T21:53:11Z

Yaml generated from:

python .github/scripts/generate_ci_workflows.py

Test plan:

Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7

rm -rf third_party/nccl
python setup.py develop

pytorch-bot · 2025-03-17T21:53:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149351

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 6b79c80 with merge base 8bc7bd9 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor-rocm / rocm6.3-py3.10-inductor / test (inductor, 2, 2, linux.rocm.gpu.2) (gh) (matched linux rule in flaky-rules.json)
Error response from daemon: cannot stop container: db6dc8b33f96: tried to kill container, but did not receive an exit event

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (#149370)
REGRESSION: benchmark ('aotdispatcher_training_subclass_cpu', 'compile_time_instruction_count') failed, actual result 9807153730 is 1.50% higher than expected 9662000000 ±+1.50% if this is an expected regression, please update the expected results.
pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge) (gh) (#144480)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501

LGTM! Thanks!
I will let @atalman and @malfet confirm!

atalman

lgtm

malfet · 2025-03-17T23:58:10Z

@pytorchbot merge

pytorchmergebot · 2025-03-18T00:00:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-18T00:05:43Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

malfet · 2025-03-18T00:07:01Z

@pytorchbot merge -i

pytorchmergebot · 2025-03-18T00:08:43Z

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge), pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

atalman · 2025-03-19T15:42:54Z

Hi @d4l3k looks like this causes failure with aarch64 CUDA builds : https://github.com/pytorch/pytorch/actions/runs/13937741027/job/39008788394#step:14:4197

Errors:

In file included from include/core.h:34,
                 from bootstrap.cc:8:
bootstrap.cc: In function ‘ncclResult_t netIsend(ncclNet_t*, void*, void*, int, void*, int, void**, int*)’:
bootstrap.cc:156:25: error: too many arguments to function
  156 |     NCCLCHECK(net->isend(sendComm, data, (size_t)size, tag, dataHandle, NULL, sendReq));
      |               ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/checks.h:123:22: note: in definition of macro ‘NCCLCHECK’
  123 |   ncclResult_t RES = call; \
      |                      ^~~~
bootstrap.cc: In function ‘ncclResult_t netIrecv(ncclNet_t*, void*, void*, int, void*, int, void**, int*)’:
bootstrap.cc:171:46: error: cannot convert ‘size_t*’ {aka ‘long unsigned int*’} to ‘int*’ in argument passing
  171 |     NCCLCHECK(net->irecv(recvComm, 1, &data, &size64, &tag, &dataHandle, NULL, recvReq));
      |                                              ^~~~~~~
      |                                              |
      |                                              size_t* {aka long unsigned int*}
include/checks.h:123:22: note: in definition of macro ‘NCCLCHECK’
  123 |   ncclResult_t RES = call; \
      |                      ^~~~
bootstrap.cc: In function ‘ncclResult_t netRingConnect(ncclNet_t*, bootstrapListen_t*, char*, void**, ncclNetDeviceHandle_t**, void**, ncclNetDeviceHandle_t**, volatile uint32_t*)’:
bootstrap.cc:543:53: error: cannot convert ‘char*’ to ‘void**’ in argument passing
  543 |       NCCLCHECK(net->connect(listen->net.dev, NULL, peerHandle, sendComm, sendDevHandle));

Observing aarch64 failure in nightly: https://github.com/pytorch/pytorch/actions/runs/13917778961/job/38943911228 Similar to: pytorch/vision#8982 ``` 2025-03-18T08:44:58.4128744Z Repairing Wheel with AuditWheel 2025-03-18T08:44:58.5440988Z INFO:auditwheel.main_repair:Repairing torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl 2025-03-18T08:45:20.3393288Z Traceback (most recent call last): 2025-03-18T08:45:20.3393732Z File "/opt/python/cp39-cp39/bin/auditwheel", line 8, in <module> 2025-03-18T08:45:20.3394115Z sys.exit(main()) 2025-03-18T08:45:20.3394559Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main.py", line 53, in main 2025-03-18T08:45:20.3395064Z result: int | None = args.func(args, p) 2025-03-18T08:45:20.3395626Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main_repair.py", line 203, in execute 2025-03-18T08:45:20.3396163Z out_wheel = repair_wheel( 2025-03-18T08:45:20.3396657Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/repair.py", line 84, in repair_wheel 2025-03-18T08:45:20.3397184Z raise ValueError(msg) 2025-03-18T08:45:20.3397620Z ValueError: Cannot repair wheel, because required library "libarm_compute.so" could not be located 2025-03-18T08:45:20.3678843Z Traceback (most recent call last): 2025-03-18T08:45:20.3679267Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 236, in <module> 2025-03-18T08:45:20.3680988Z pytorch_wheel_name = complete_wheel("/pytorch/") 2025-03-18T08:45:20.3681449Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 141, in complete_wheel 2025-03-18T08:45:20.3681976Z check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder) 2025-03-18T08:45:20.3682860Z File "/opt/python/cp39-cp39/lib/python3.9/subprocess.py", line 373, in check_call 2025-03-18T08:45:20.3683308Z raise CalledProcessError(retcode, cmd) 2025-03-18T08:45:20.3684034Z subprocess.CalledProcessError: Command '['auditwheel', 'repair', 'dist/torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl']' returned non-zero exit status 1. 2025-03-18T08:45:20.3790063Z ##[error]Process completed with exit code 1. 2025-03-18T08:45:20.3862012Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main 2025-03-18T08:45:20.3862448Z with: ``` Please note aarch64 CUDA failures are related to: #149351 Pull Request resolved: #149471 Approved by: https://github.com/malfet

Observing aarch64 failure in nightly: https://github.com/pytorch/pytorch/actions/runs/13917778961/job/38943911228 Similar to: pytorch/vision#8982 ``` 2025-03-18T08:44:58.4128744Z Repairing Wheel with AuditWheel 2025-03-18T08:44:58.5440988Z INFO:auditwheel.main_repair:Repairing torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl 2025-03-18T08:45:20.3393288Z Traceback (most recent call last): 2025-03-18T08:45:20.3393732Z File "/opt/python/cp39-cp39/bin/auditwheel", line 8, in <module> 2025-03-18T08:45:20.3394115Z sys.exit(main()) 2025-03-18T08:45:20.3394559Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main.py", line 53, in main 2025-03-18T08:45:20.3395064Z result: int | None = args.func(args, p) 2025-03-18T08:45:20.3395626Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main_repair.py", line 203, in execute 2025-03-18T08:45:20.3396163Z out_wheel = repair_wheel( 2025-03-18T08:45:20.3396657Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/repair.py", line 84, in repair_wheel 2025-03-18T08:45:20.3397184Z raise ValueError(msg) 2025-03-18T08:45:20.3397620Z ValueError: Cannot repair wheel, because required library "libarm_compute.so" could not be located 2025-03-18T08:45:20.3678843Z Traceback (most recent call last): 2025-03-18T08:45:20.3679267Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 236, in <module> 2025-03-18T08:45:20.3680988Z pytorch_wheel_name = complete_wheel("/pytorch/") 2025-03-18T08:45:20.3681449Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 141, in complete_wheel 2025-03-18T08:45:20.3681976Z check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder) 2025-03-18T08:45:20.3682860Z File "/opt/python/cp39-cp39/lib/python3.9/subprocess.py", line 373, in check_call 2025-03-18T08:45:20.3683308Z raise CalledProcessError(retcode, cmd) 2025-03-18T08:45:20.3684034Z subprocess.CalledProcessError: Command '['auditwheel', 'repair', 'dist/torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl']' returned non-zero exit status 1. 2025-03-18T08:45:20.3790063Z ##[error]Process completed with exit code 1. 2025-03-18T08:45:20.3862012Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main 2025-03-18T08:45:20.3862448Z with: ``` Please note aarch64 CUDA failures are related to: #149351 Pull Request resolved: #149471 Approved by: https://github.com/malfet (cherry picked from commit 4df66e0)

Fixes pytorch#149153 Yaml generated from: ``` python .github/scripts/generate_ci_workflows.py ``` Test plan: Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7 ``` rm -rf third_party/nccl python setup.py develop ``` Pull Request resolved: pytorch#149351 Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet

…12.6 docker (#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: #149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: #148895 Pull Request resolved: #149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia

…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia

Skylion007 · 2025-03-20T18:16:12Z

.ci/docker/ci_commit_pins/nccl-cu12.txt

@@ -1 +1 @@
-v2.25.1-1


Ugh, we forgot to update some of the common docker install sh scripts to the latest NCCL version too?

@Skylion007 sorry about that! Do you have code pointers? I can send another PR

Also do you have any repro instructions on how to trigger those so I don't miss anything again?

@Skylion007 will this fix it? #149778

…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia

…149351) (#149546) * nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351) Fixes #149153 Yaml generated from: ``` python .github/scripts/generate_ci_workflows.py ``` Test plan: Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7 ``` rm -rf third_party/nccl python setup.py develop ``` Pull Request resolved: #149351 Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet * fixed_regenerations --------- Co-authored-by: Tristan Rice <rice@fn.lc>

… aarch64 cuda 12.6 docker #149540 (#149624) Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: #149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: #148895 Pull Request resolved: #149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia

…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia

nccl: upgrade to 2.26.2

6b79c80

d4l3k requested review from atalman and kwen2501 March 17, 2025 21:53

d4l3k requested review from a team and jeffdaily as code owners March 17, 2025 21:53

pytorch-bot bot added ciflow/inductor topic: not user facing topic category labels Mar 17, 2025

kwen2501 approved these changes Mar 17, 2025

View reviewed changes

atalman approved these changes Mar 17, 2025

View reviewed changes

atalman added this to the 2.7.0 milestone Mar 17, 2025

malfet approved these changes Mar 17, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 17, 2025

pytorchmergebot added the merging label Mar 17, 2025

pytorchmergebot removed the merging label Mar 18, 2025

pytorchmergebot added the merging label Mar 18, 2025

pytorchmergebot added the Merged label Mar 18, 2025

pytorchmergebot closed this in dea7157 Mar 18, 2025

pytorchmergebot removed the merging label Mar 18, 2025

d4l3k deleted the d4l3k/nccl2.26.2 branch March 18, 2025 16:47

d4l3k mentioned this pull request Mar 18, 2025

ProcessGroupNCCL: ncclCommAbort hangs with NCCL 2.25.1-1 #149153

Closed

sclarkson mentioned this pull request Mar 18, 2025

[v.2.7.0] Release Tracker #149044

Closed

atalman mentioned this pull request Mar 19, 2025

Pin auditwheel to 6.2.0 #149471

Closed

pytorchbot mentioned this pull request Mar 19, 2025

Pin auditwheel to 6.2.0 #149525

Merged

This was referenced Mar 19, 2025

[cherry-pick] nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351) #149546

Merged

Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker #149540

Closed

Unify nccl versions for x86 and aarch64 builds #149554

Closed

Skylion007 reviewed Mar 20, 2025

View reviewed changes

atalman mentioned this pull request Apr 3, 2025

Release 2.7.0 validations checklist and cherry-picks #150628

Closed

65 tasks

		@@ -1 +1 @@
		v2.25.1-1

nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort #149351

nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort #149351

Uh oh!

Conversation

d4l3k commented Mar 17, 2025

Uh oh!

pytorch-bot bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149351

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

malfet commented Mar 17, 2025

Uh oh!

pytorchmergebot commented Mar 18, 2025

Merge started

Uh oh!

pytorchmergebot commented Mar 18, 2025

Merge failed

Uh oh!

malfet commented Mar 18, 2025

Uh oh!

pytorchmergebot commented Mar 18, 2025

Merge started

Uh oh!

atalman commented Mar 19, 2025

Uh oh!

Skylion007 Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Mar 17, 2025 •

edited

Loading