KEMBAR78
gloo: fix building system gloo with CUDA/HIP by nlbrown2 · Pull Request #146637 · pytorch/pytorch · GitHub
Skip to content

Conversation

@nlbrown2
Copy link
Contributor

@nlbrown2 nlbrown2 commented Feb 6, 2025

Fix incorrect linking of Gloo's libraries when building with system Gloo. Previously, either Gloo's native library or Gloo's CUDA library were linked. However, Gloo had changed such that all users of Gloo must link the native library, and can optionally link the CUDA or HIP library for Gloo + CUDA/HIP support.
This had been updated when building/linking with vendored Gloo, but not when using system Gloo.

Fixes: #146239

Reported-by: Adam J Stewart ajstewart426@gmail.com

cc @malfet @seemethere @ptrblck @msaroufim @eqy @jerryzh168

Fix incorrect linking of Gloo's libraries when building with system
Gloo. Previously, either Gloo's native library or Gloo's CUDA library
were linked. However, Gloo had changed such that all users of Gloo must
link the native library, and can optionally link the CUDA or HIP
library for Gloo + CUDA/HIP support.
This had been updated when building/linking with vendored Gloo, but not
when using system Gloo.

Fixes: pytorch#146239

Reported-by: Adam J Stewart <ajstewart426@gmail.com>
Signed-off-by: Nathan Brown <nathan.brown@arm.com>
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 6, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146637

Note: Links to docs will display an error until the docs builds have been completed.

❌ 14 New Failures, 7 Cancelled Jobs

As of commit e9bfe6f with merge base 5d81bc3 (image):

NEW FAILURES - The following jobs have failed:

  • pull / linux-jammy-cpu-py3.10-gcc11-bazel-test / build-and-test (default, 1, 1, lf.linux.4xlarge) (gh)
    Final attempt failed. Child_process exited with error code 1
  • pull / linux-jammy-cuda12.8-cudnn9-py3.9-clang12 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-cuda12.8-py3.10-gcc11 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3_9-clang9-xla / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.10-clang18-asan / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.13-clang12 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-clang12 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-clang12-onnx / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-gcc11 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-gcc11-mobile-lightweight-dispatch-build / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-gcc11-no-ops / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-gcc11-pch / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-rocm-py3.10 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-xpu-2025.1-py3.9 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/reuse-old-whl'. Did you forget to run actions/checkout before running your local action?

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@nlbrown2
Copy link
Contributor Author

nlbrown2 commented Feb 6, 2025

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Feb 6, 2025
@cpuhrsch cpuhrsch requested a review from malfet February 8, 2025 01:39
@cpuhrsch cpuhrsch added module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 8, 2025
@nlbrown2
Copy link
Contributor Author

Hello,
Any updates on this PR? Any desired changes?

Thanks,
Nathan

@adamjstewart
Copy link
Contributor

Pinging for review

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jun 15, 2025
@adamjstewart
Copy link
Contributor

Waiting on @malfet or others for review

@msaroufim msaroufim requested a review from d4l3k June 15, 2025 15:54
@github-actions github-actions bot closed this Jul 15, 2025
@adamjstewart
Copy link
Contributor

Still waiting for review...

@malfet malfet reopened this Aug 6, 2025
@malfet malfet added release notes: build release notes category topic: bug fixes topic category and removed topic: not user facing topic category labels Aug 6, 2025
@malfet
Copy link
Contributor

malfet commented Aug 6, 2025

@pytorchbot merge -f "CI was green before"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Fix incorrect linking of Gloo's libraries when building with system Gloo. Previously, either Gloo's native library or Gloo's CUDA library were linked. However, Gloo had changed such that all users of Gloo must link the native library, and can optionally link the CUDA or HIP library for Gloo + CUDA/HIP support.
This had been updated when building/linking with vendored Gloo, but not when using system Gloo.

Fixes: pytorch#146239

Reported-by: Adam J Stewart <ajstewart426@gmail.com>

Pull Request resolved: pytorch#146637
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general open source release notes: build release notes category Stale topic: bug fixes topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch_shm_manager: undefined reference to gloo

6 participants