-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Open
Labels
module: ncclProblems related to nccl supportProblems related to nccl supporttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
Please refer to this PR: NVIDIA/nccl#1112
We have noticed a significant performance reduction when using NVLSTree algorithm when running nccl-tests all-reduce on a cluster of 8 AWS P5 nodes.
nccl-tests 4GB all-reduce output:
out-of-place in-place size count type redop root time algbw busbw #wrong time algbw busbw #wrong (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 4294967296 1073741824 float sum -1 81146 52.93 104.20 N/A 80945 53.06 104.46 N/A
after fix is applied
out-of-place in-place size count type redop root time algbw busbw #wrong time algbw busbw #wrong (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 4294967296 1073741824 float sum -1 30222 142.11 279.78 N/A 30309 141.70 278.98 N/A
Possible solutions:
We build nccl from source after this PR: pytorch/builder#1670
Hence cherry-pick of the NVIDIA/nccl#1112 fix would be needed
We would need to use nccl branch with the cherry pick included here: https://github.com/pytorch/builder/blob/main/common/install_cuda.sh#L48
@ptrblck Can you recommend minimal repo steps for this issue so we can confirm the issue and resolution ?
Versions
2.2.0
nightly
Metadata
Metadata
Assignees
Labels
module: ncclProblems related to nccl supportProblems related to nccl supporttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Status
Cold Storage