[NCCL 2.19.3] Performance reduction when using NVLSTree algorithm when running nccl-tests all-reduce 

### 🐛 Describe the bug

Please refer to this PR: https://github.com/NVIDIA/nccl/pull/1112

>We have noticed a significant performance reduction when using NVLSTree algorithm when running nccl-tests all-reduce on a cluster of 8 AWS P5 nodes.
>
>nccl-tests 4GB all-reduce output:
>
>```
>                                                                   out-of-place                       in-place
>            size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
>             (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
>      4294967296    1073741824     float     sum      -1    81146   52.93  104.20    N/A    80945   53.06  104.46    N/A
>```
>after fix is applied
>```
>                                                                   out-of-place                       in-place
>            size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
>             (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
>      4294967296    1073741824     float     sum      -1    30222  142.11  279.78    N/A    30309  141.70  278.98    N/A
>```


Possible solutions:
We build nccl from source after this PR: https://github.com/pytorch/builder/pull/1670
Hence cherry-pick of the https://github.com/NVIDIA/nccl/pull/1112 fix would be needed
 We would need to use nccl branch with the cherry pick included here: https://github.com/pytorch/builder/blob/main/common/install_cuda.sh#L48

@ptrblck Can you recommend minimal repo steps for this issue so we can confirm the issue and resolution ?

cc @ptrblck @malfet 

### Versions

2.2.0
nightly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NCCL 2.19.3] Performance reduction when using NVLSTree algorithm when running nccl-tests all-reduce #117748

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[NCCL 2.19.3] Performance reduction when using NVLSTree algorithm when running nccl-tests all-reduce #117748

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions