[Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR #138488

kwen2501 · 2024-10-21T18:15:31Z

Stack from ghstack (oldest at bottom):

Forward fix for build issue introduced by #137855:

In file included from fbcode/caffe2/torch/csrc/distributed/c10d/NCCLUtils.cpp:2:
fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:508:21: error: use of undeclared identifier 'NCCL_SPLIT_NOCOLOR'
  508 |     int split_color{NCCL_SPLIT_NOCOLOR - 1};
      |                     ^

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-21T18:15:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138488

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7ed82a8 with merge base 195d0a6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fduwjj · 2024-10-21T18:18:32Z

maybe next time, you want to import into fbcode and wait for internal CI as well?

fduwjj · 2024-10-21T18:20:37Z

We do need this release note tag. Since it is not user facing, it won't get mentioned in the release note.

kwen2501 · 2024-10-21T18:23:32Z

@pytorchbot merge

pytorchmergebot · 2024-10-21T18:25:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

- Added default value for `nccl_nonblocking_timeout` (30 mins, previous: -1). - Reuse C10D_CHECK_TIMEOUT in other CHECK macros Pull Request resolved: #138374 Approved by: https://github.com/eqy ghstack dependencies: #137855, #138488

Previously we only wait for comm to become ready after its initialization. That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc. Therefore, we just ensure comm is ready every "next time" we need to access ncclComm. The place to add such gate keeper is `getNcclComm`. Pull Request resolved: #138384 Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj ghstack dependencies: #137855, #138488, #138374

### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384

Forward fix for build issue introduced by #137855: ``` In file included from fbcode/caffe2/torch/csrc/distributed/c10d/NCCLUtils.cpp:2: fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:508:21: error: use of undeclared identifier 'NCCL_SPLIT_NOCOLOR' 508 | int split_color{NCCL_SPLIT_NOCOLOR - 1}; | ^ ``` Pull Request resolved: #138488 Approved by: https://github.com/fduwjj ghstack dependencies: #137855

- Added default value for `nccl_nonblocking_timeout` (30 mins, previous: -1). - Reuse C10D_CHECK_TIMEOUT in other CHECK macros Pull Request resolved: #138374 Approved by: https://github.com/eqy ghstack dependencies: #137855, #138488

Previously we only wait for comm to become ready after its initialization. That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc. Therefore, we just ensure comm is ready every "next time" we need to access ncclComm. The place to add such gate keeper is `getNcclComm`. Pull Request resolved: #138384 Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj ghstack dependencies: #137855, #138488, #138374

### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384

[Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR

7ed82a8

[ghstack-poisoned]

kwen2501 mentioned this pull request Oct 19, 2024

[PGNCCL] Add default value for nccl_nonblocking_timeout #138374

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 21, 2024

This was referenced Oct 21, 2024

[PGNCCL] Ensure comm is ready before all accesses #138384

Closed

[PGNCCL] Enable non-blocking API mode by default #137544

Closed

kwen2501 requested review from fduwjj and shuqiangzhang October 21, 2024 18:16

fduwjj approved these changes Oct 21, 2024

View reviewed changes

kwen2501 added topic: not user facing topic category and removed release notes: distributed (c10d) release notes category labels Oct 21, 2024

fduwjj added the release notes: distributed (c10d) release notes category label Oct 21, 2024

kwen2501 removed the release notes: distributed (c10d) release notes category label Oct 21, 2024

fduwjj added the release notes: distributed (c10d) release notes category label Oct 21, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 21, 2024

pytorchmergebot added the merging label Oct 21, 2024

pytorchmergebot added the Merged label Oct 21, 2024

pytorchmergebot closed this in 364340c Oct 21, 2024

pytorchmergebot removed the merging label Oct 21, 2024

kwen2501 mentioned this pull request Oct 21, 2024

[PGNCCL] Use non-blocking mode by default in eager init #138527

Closed

wdvr mentioned this pull request Oct 22, 2024

[c10d] Fix color value for comm split being negative #137855

Closed

kwen2501 mentioned this pull request Oct 22, 2024

[Distributed] Add more tests to use eager init #138644

Closed

kwen2501 mentioned this pull request Oct 24, 2024

[PGNCCL] Fix P2P data corruption in non-blocking mode #138860

Closed

github-actions bot deleted the gh/kwen2501/80/head branch November 21, 2024 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR #138488

[Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR #138488

Uh oh!

kwen2501 commented Oct 21, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 21, 2024 •

edited

Loading

Uh oh!

fduwjj commented Oct 21, 2024

Uh oh!

fduwjj commented Oct 21, 2024

Uh oh!

kwen2501 commented Oct 21, 2024

Uh oh!

pytorchmergebot commented Oct 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR #138488

[Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR #138488

Uh oh!

Conversation

kwen2501 commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138488

✅ No Failures

Uh oh!

fduwjj commented Oct 21, 2024

Uh oh!

fduwjj commented Oct 21, 2024

Uh oh!

kwen2501 commented Oct 21, 2024

Uh oh!

pytorchmergebot commented Oct 21, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kwen2501 commented Oct 21, 2024 •

edited

Loading

pytorch-bot bot commented Oct 21, 2024 •

edited

Loading