KEMBAR78
nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort by d4l3k · Pull Request #149351 · pytorch/pytorch · GitHub
Skip to content

Conversation

@d4l3k
Copy link
Member

@d4l3k d4l3k commented Mar 17, 2025

Fixes #149153

Yaml generated from:

python .github/scripts/generate_ci_workflows.py

Test plan:

Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7

rm -rf third_party/nccl
python setup.py develop

@d4l3k d4l3k requested review from atalman and kwen2501 March 17, 2025 21:53
@d4l3k d4l3k requested review from a team and jeffdaily as code owners March 17, 2025 21:53
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149351

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 6b79c80 with merge base 8bc7bd9 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!
I will let @atalman and @malfet confirm!

Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@atalman atalman added this to the 2.7.0 milestone Mar 17, 2025
@malfet
Copy link
Contributor

malfet commented Mar 17, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 17, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@malfet
Copy link
Contributor

malfet commented Mar 18, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge), pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@atalman
Copy link
Contributor

atalman commented Mar 19, 2025

Hi @d4l3k looks like this causes failure with aarch64 CUDA builds : https://github.com/pytorch/pytorch/actions/runs/13937741027/job/39008788394#step:14:4197

Errors:

In file included from include/core.h:34,
                 from bootstrap.cc:8:
bootstrap.cc: In function ‘ncclResult_t netIsend(ncclNet_t*, void*, void*, int, void*, int, void**, int*)’:
bootstrap.cc:156:25: error: too many arguments to function
  156 |     NCCLCHECK(net->isend(sendComm, data, (size_t)size, tag, dataHandle, NULL, sendReq));
      |               ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/checks.h:123:22: note: in definition of macro ‘NCCLCHECK’
  123 |   ncclResult_t RES = call; \
      |                      ^~~~
bootstrap.cc: In function ‘ncclResult_t netIrecv(ncclNet_t*, void*, void*, int, void*, int, void**, int*)’:
bootstrap.cc:171:46: error: cannot convert ‘size_t*’ {aka ‘long unsigned int*’} to ‘int*’ in argument passing
  171 |     NCCLCHECK(net->irecv(recvComm, 1, &data, &size64, &tag, &dataHandle, NULL, recvReq));
      |                                              ^~~~~~~
      |                                              |
      |                                              size_t* {aka long unsigned int*}
include/checks.h:123:22: note: in definition of macro ‘NCCLCHECK’
  123 |   ncclResult_t RES = call; \
      |                      ^~~~
bootstrap.cc: In function ‘ncclResult_t netRingConnect(ncclNet_t*, bootstrapListen_t*, char*, void**, ncclNetDeviceHandle_t**, void**, ncclNetDeviceHandle_t**, volatile uint32_t*)’:
bootstrap.cc:543:53: error: cannot convert ‘char*’ to ‘void**’ in argument passing
  543 |       NCCLCHECK(net->connect(listen->net.dev, NULL, peerHandle, sendComm, sendDevHandle));

pytorchmergebot pushed a commit that referenced this pull request Mar 19, 2025
Observing aarch64 failure in nightly:
https://github.com/pytorch/pytorch/actions/runs/13917778961/job/38943911228

Similar to: pytorch/vision#8982

```
2025-03-18T08:44:58.4128744Z Repairing Wheel with AuditWheel
2025-03-18T08:44:58.5440988Z INFO:auditwheel.main_repair:Repairing torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl
2025-03-18T08:45:20.3393288Z Traceback (most recent call last):
2025-03-18T08:45:20.3393732Z   File "/opt/python/cp39-cp39/bin/auditwheel", line 8, in <module>
2025-03-18T08:45:20.3394115Z     sys.exit(main())
2025-03-18T08:45:20.3394559Z   File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main.py", line 53, in main
2025-03-18T08:45:20.3395064Z     result: int | None = args.func(args, p)
2025-03-18T08:45:20.3395626Z   File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main_repair.py", line 203, in execute
2025-03-18T08:45:20.3396163Z     out_wheel = repair_wheel(
2025-03-18T08:45:20.3396657Z   File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/repair.py", line 84, in repair_wheel
2025-03-18T08:45:20.3397184Z     raise ValueError(msg)
2025-03-18T08:45:20.3397620Z ValueError: Cannot repair wheel, because required library "libarm_compute.so" could not be located
2025-03-18T08:45:20.3678843Z Traceback (most recent call last):
2025-03-18T08:45:20.3679267Z   File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 236, in <module>
2025-03-18T08:45:20.3680988Z     pytorch_wheel_name = complete_wheel("/pytorch/")
2025-03-18T08:45:20.3681449Z   File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 141, in complete_wheel
2025-03-18T08:45:20.3681976Z     check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder)
2025-03-18T08:45:20.3682860Z   File "/opt/python/cp39-cp39/lib/python3.9/subprocess.py", line 373, in check_call
2025-03-18T08:45:20.3683308Z     raise CalledProcessError(retcode, cmd)
2025-03-18T08:45:20.3684034Z subprocess.CalledProcessError: Command '['auditwheel', 'repair', 'dist/torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl']' returned non-zero exit status 1.
2025-03-18T08:45:20.3790063Z ##[error]Process completed with exit code 1.
2025-03-18T08:45:20.3862012Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main
2025-03-18T08:45:20.3862448Z with:
```

Please note aarch64 CUDA failures are related to: #149351
Pull Request resolved: #149471
Approved by: https://github.com/malfet
pytorchbot pushed a commit that referenced this pull request Mar 19, 2025
Observing aarch64 failure in nightly:
https://github.com/pytorch/pytorch/actions/runs/13917778961/job/38943911228

Similar to: pytorch/vision#8982

```
2025-03-18T08:44:58.4128744Z Repairing Wheel with AuditWheel
2025-03-18T08:44:58.5440988Z INFO:auditwheel.main_repair:Repairing torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl
2025-03-18T08:45:20.3393288Z Traceback (most recent call last):
2025-03-18T08:45:20.3393732Z   File "/opt/python/cp39-cp39/bin/auditwheel", line 8, in <module>
2025-03-18T08:45:20.3394115Z     sys.exit(main())
2025-03-18T08:45:20.3394559Z   File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main.py", line 53, in main
2025-03-18T08:45:20.3395064Z     result: int | None = args.func(args, p)
2025-03-18T08:45:20.3395626Z   File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main_repair.py", line 203, in execute
2025-03-18T08:45:20.3396163Z     out_wheel = repair_wheel(
2025-03-18T08:45:20.3396657Z   File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/repair.py", line 84, in repair_wheel
2025-03-18T08:45:20.3397184Z     raise ValueError(msg)
2025-03-18T08:45:20.3397620Z ValueError: Cannot repair wheel, because required library "libarm_compute.so" could not be located
2025-03-18T08:45:20.3678843Z Traceback (most recent call last):
2025-03-18T08:45:20.3679267Z   File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 236, in <module>
2025-03-18T08:45:20.3680988Z     pytorch_wheel_name = complete_wheel("/pytorch/")
2025-03-18T08:45:20.3681449Z   File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 141, in complete_wheel
2025-03-18T08:45:20.3681976Z     check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder)
2025-03-18T08:45:20.3682860Z   File "/opt/python/cp39-cp39/lib/python3.9/subprocess.py", line 373, in check_call
2025-03-18T08:45:20.3683308Z     raise CalledProcessError(retcode, cmd)
2025-03-18T08:45:20.3684034Z subprocess.CalledProcessError: Command '['auditwheel', 'repair', 'dist/torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl']' returned non-zero exit status 1.
2025-03-18T08:45:20.3790063Z ##[error]Process completed with exit code 1.
2025-03-18T08:45:20.3862012Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main
2025-03-18T08:45:20.3862448Z with:
```

Please note aarch64 CUDA failures are related to: #149351
Pull Request resolved: #149471
Approved by: https://github.com/malfet

(cherry picked from commit 4df66e0)
atalman pushed a commit to atalman/pytorch that referenced this pull request Mar 19, 2025
Fixes pytorch#149153

Yaml generated from:

```
python .github/scripts/generate_ci_workflows.py
```

Test plan:

Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7

```
rm -rf third_party/nccl
python setup.py develop
```

Pull Request resolved: pytorch#149351
Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet
pytorchmergebot pushed a commit that referenced this pull request Mar 19, 2025
…12.6 docker (#149540)

1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: #149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: #148895
Pull Request resolved: #149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
atalman added a commit to atalman/pytorch that referenced this pull request Mar 20, 2025
…12.6 docker (pytorch#149540)

1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895
Pull Request resolved: pytorch#149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
@@ -1 +1 @@
v2.25.1-1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, we forgot to update some of the common docker install sh scripts to the latest NCCL version too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Skylion007 sorry about that! Do you have code pointers? I can send another PR

Also do you have any repro instructions on how to trigger those so I don't miss anything again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Skylion007 will this fix it? #149778

atalman added a commit to atalman/pytorch that referenced this pull request Mar 26, 2025
…12.6 docker (pytorch#149540)

1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895
Pull Request resolved: pytorch#149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
malfet pushed a commit that referenced this pull request Mar 26, 2025
…149351) (#149546)

* nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351)

Fixes #149153

Yaml generated from:

```
python .github/scripts/generate_ci_workflows.py
```

Test plan:

Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7

```
rm -rf third_party/nccl
python setup.py develop
```

Pull Request resolved: #149351
Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet

* fixed_regenerations

---------

Co-authored-by: Tristan Rice <rice@fn.lc>
malfet pushed a commit that referenced this pull request Mar 26, 2025
… aarch64 cuda 12.6 docker #149540 (#149624)

Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540)

1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: #149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: #148895
Pull Request resolved: #149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
…12.6 docker (pytorch#149540)

1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895
Pull Request resolved: pytorch#149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ProcessGroupNCCL: ncclCommAbort hangs with NCCL 2.25.1-1

6 participants