KEMBAR78
[c10d] Fix color value for comm split being negative by kwen2501 · Pull Request #137855 · pytorch/pytorch · GitHub
Skip to content

Conversation

@kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Oct 13, 2024

Stack from ghstack (oldest at bottom):

Fixes #137856.

Issue 1

Today under ProcessGroupNCCL::Options, color is declared as:

    int64_t split_color{0};

When passing this variable to ncclCommSplit which accepts int, the value may overflow and become negative, as in #137856. But NCCL API only accepts non-negative colors (or NCCL_SPLIT_NOCOLOR).

But that's not all.

Issue 2

split_color is pybind'ed to python frontend. If we just change from int64_t to int in C++, pybind will complain:

[rank0]: TypeError: (): incompatible function arguments. The following argument types are supported:
[rank0]:     1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None

This is because python int represents a wider range than C++ int. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with c_int's max value.

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137855

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 1af064d with merge base 195d0a6 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 13, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please commit the suggested changes from pytorch's linter.

Fixes #137856.

### Issue 1
Today under `ProcessGroupNCCL::Options`, color is declared as:
```
    int64_t split_color{0};
```
When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856.

But that's not all.

### Issue 2
`split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain:
```
[rank0]: TypeError: (): incompatible function arguments. The following argument types are supported:
[rank0]:     1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None
```
This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with world_size to fix this issue (because color does not need a bigger "entropy" than world_size).

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
Fixes #137856.

### Issue 1
Today under `ProcessGroupNCCL::Options`, color is declared as:
```
    int64_t split_color{0};
```
When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856.

But that's not all.

### Issue 2
`split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain:
```
[rank0]: TypeError: (): incompatible function arguments. The following argument types are supported:
[rank0]:     1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None
```
This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with world_size to get the color value (because color does not need a bigger "entropy" than world_size).

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Oct 15, 2024
Some UCC tests became unstable recently, with or without the M60 to T4 upgrade.
See for example: #137855 (without upgrade), #137161 (with upgrade).
So I am extracting the disablement from #137161 here.

Failure signature:
```
RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:496] [Rank 0][ProcessGroupUCC-0][READY]failed to post triggered collective, error code -6: Unhandled error, system error code 0
```

Earlier discussed here:
https://github.com/pytorch/pytorch/pull/137161/files#r1797353294

Cc: @Aidyn-A @eqy
Pull Request resolved: #137932
Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/eqy
Fixes #137856.

### Issue 1
Today under `ProcessGroupNCCL::Options`, color is declared as:
```
    int64_t split_color{0};
```
When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856.

But that's not all.

### Issue 2
`split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain:
```
[rank0]: TypeError: (): incompatible function arguments. The following argument types are supported:
[rank0]:     1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None
```
This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with world_size to get the color value (because color does not need a bigger "entropy" than world_size).

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
Fixes #137856.

### Issue 1
Today under `ProcessGroupNCCL::Options`, color is declared as:
```
    int64_t split_color{0};
```
When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856.

But that's not all.

### Issue 2
`split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain:
```
[rank0]: TypeError: (): incompatible function arguments. The following argument types are supported:
[rank0]:     1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None
```
This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with world_size to get the color value (because color does not need a bigger "entropy" than world_size).

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
// comm ptr is valid. Therefore we add a manual wait here for safety.
// TODO: remove this wait after NCCL fix the semantics.
while (!comm->ncclComm_) {
C10D_SCHED_SLEEP();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, possible infinite sleep?

Copy link
Contributor Author

@kwen2501 kwen2501 Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Can I defer a jail break to the PR on top where we have a definite timeout value?
Today the default value is -1, which means no timeout.
I agree the deferral is not nice, but there is quite some work to pull part of the top PR down here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #138374

// * NCCL_SPLIT_NOCOLOR (-1): not in group;
// * NCCL_SPLIT_NOCOLOR - 1: uninitialized.
// [Note 1]: the type must be `int` instead of `int64_t` because NCCL API
// accepts int. Otherwise, an imlicit conversion may happen at the API call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo implicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in later PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #138374

else:
pg_name = str(_world.group_count)
_world.group_count += 1
# TODO: why is group count incremented only in the else path?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also where is _world.pg_names updated? It doesn't look like it gets updated when we create a new name here, but you seem to rely on its length as a proxy for 'group_count'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is updated here:

_world.pg_names[pg] = group_name

@kwen2501 kwen2501 changed the title [c10d] Fix color value for comm split [c10d] Fix color value for comm split being negative Oct 16, 2024
Fixes #137856.

### Issue 1
Today under `ProcessGroupNCCL::Options`, color is declared as:
```
    int64_t split_color{0};
```
When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856. But NCCL API only accepts non-negative colors (or `NCCL_SPLIT_NOCOLOR`).

But that's not all.

### Issue 2
`split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain:
```
[rank0]: TypeError: (): incompatible function arguments. The following argument types are supported:
[rank0]:     1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None
```
This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with `c_int`'s max value.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding the timeout. I think the timeout is too long but i am not sure what value is ok.

@kwen2501
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 18, 2024
@kwen2501
Copy link
Contributor Author

@pytorchbot merge -i "The failures does not seem related"

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 19, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: The failures does not seem related

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

@kwen2501
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-docs / build-docs-python-false, trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 5, 5, lf.linux.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Oct 21, 2024
Forward fix for build issue introduced by #137855:
```
In file included from fbcode/caffe2/torch/csrc/distributed/c10d/NCCLUtils.cpp:2:
fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:508:21: error: use of undeclared identifier 'NCCL_SPLIT_NOCOLOR'
  508 |     int split_color{NCCL_SPLIT_NOCOLOR - 1};
      |                     ^
```

Pull Request resolved: #138488
Approved by: https://github.com/fduwjj
ghstack dependencies: #137855
@wdvr
Copy link
Contributor

wdvr commented Oct 22, 2024

@pytorchmergebot --help

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 22, 2024

PyTorchBot Help

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

In order to invoke the bot on your PR, include a line that starts with
@pytorchbot anywhere in a comment. That line will form the command; no
multi-line commands are allowed. Some commands may be used on issues as specified below.

Example:
    Some extra context, blah blah, wow this PR looks awesome

    @pytorchbot merge

optional arguments:
  -h, --help            Show this help message and exit.

command:
  {merge,revert,rebase,label,drci,cherry-pick,close}
    merge               Merge a PR
    revert              Revert a PR
    rebase              Rebase a PR
    label               Add label to a PR
    drci                Update Dr. CI
    cherry-pick         Cherry pick a PR onto a release branch
    close               Close a PR

Merge

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Merge an accepted PR, subject to the rules in .github/merge_rules.json.
By default, this will wait for all required checks (lint, pull) to succeed before merging.

optional arguments:
  -f MESSAGE, --force MESSAGE
                        Merge without checking anything. This requires a reason for auditting purpose, for example:
                        @pytorchbot merge -f 'Minor update to fix lint. Expecting all PR tests to pass'
                        
                        Please use `-f` as last resort, prefer `--ignore-current` to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
  -i, --ignore-current  Merge while ignoring the currently failing jobs.  Behaves like -f if there are no pending jobs.
  -ic                   Old flag for --ignore-current. Deprecated in favor of -i.
  -r [{viable/strict,main}], --rebase [{viable/strict,main}]
                        Rebase the PR to re run checks before merging.  Accepts viable/strict or main as branch options and will default to viable/strict if not specified.

Revert

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Revert a merged PR. This requires that you are a Meta employee.

Example:
  @pytorchbot revert -m="This is breaking tests on trunk. hud.pytorch.org/" -c=nosignal

optional arguments:
  -m MESSAGE, --message MESSAGE
                        The reason you are reverting, will be put in the commit message. Must be longer than 3 words.
  -c {nosignal,ignoredsignal,landrace,weird,ghfirst}, --classification {nosignal,ignoredsignal,landrace,weird,ghfirst}
                        A machine-friendly classification of the revert reason.

Rebase

usage: @pytorchbot rebase [-s | -b BRANCH]

Rebase a PR. Rebasing defaults to the stable viable/strict branch of pytorch.
Repeat contributor may use this command to rebase their PR.

optional arguments:
  -s, --stable          [DEPRECATED] Rebase onto viable/strict
  -b BRANCH, --branch BRANCH
                        Branch you would like to rebase to

Label

usage: @pytorchbot label labels [labels ...]

Adds label to a PR or Issue [Can be used on Issues]

positional arguments:
  labels  Labels to add to given Pull Request or Issue [Can be used on Issues]

Dr CI

usage: @pytorchbot drci 

Update Dr. CI. Updates the Dr. CI comment on the PR in case it's gotten out of sync with actual CI results.

cherry-pick

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Cherry pick a pull request onto a release branch for inclusion in a release

optional arguments:
  --onto ONTO           Branch you would like to cherry pick onto (Example: release/2.1)
  --fixes FIXES         Link to the issue that your PR fixes (Example: https://github.com/pytorch/pytorch/issues/110666)
  -c {regression,critical,fixnewfeature,docs,release}, --classification {regression,critical,fixnewfeature,docs,release}
                        A machine-friendly classification of the cherry-pick reason.

Close

usage: @pytorchbot close

Close a PR [Can be used on issues]

@wdvr
Copy link
Contributor

wdvr commented Oct 22, 2024

@pytorchmergebot revert -m 'reverting since this is failing internal tests, maybe a NCCL version mismatch between oss and fbcode?' -c nosignal

1 similar comment
@wdvr
Copy link
Contributor

wdvr commented Oct 22, 2024

@pytorchmergebot revert -m 'reverting since this is failing internal tests, maybe a NCCL version mismatch between oss and fbcode?' -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 137855 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit fecd370ea1e9dce838bf2db3441146c543f1b2b6 returned non-zero exit code 1

Auto-merging torch/csrc/distributed/c10d/NCCLUtils.cpp
Auto-merging torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Auto-merging torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
Auto-merging torch/distributed/distributed_c10d.py
error: could not revert fecd370ea1... [c10d] Fix color value for comm split being negative (#137855)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

@wdvr
Copy link
Contributor

wdvr commented Oct 22, 2024

ignore above, forward fix added in #138488

pytorchmergebot pushed a commit that referenced this pull request Oct 22, 2024
- Added default value for `nccl_nonblocking_timeout` (30 mins, previous: -1).
- Reuse C10D_CHECK_TIMEOUT in other CHECK macros

Pull Request resolved: #138374
Approved by: https://github.com/eqy
ghstack dependencies: #137855, #138488
pytorchmergebot pushed a commit that referenced this pull request Oct 23, 2024
Previously we only wait for comm to become ready after its initialization.
That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc.
Therefore, we just ensure comm is ready every "next time" we need to access ncclComm.
The place to add such gate keeper is `getNcclComm`.

Pull Request resolved: #138384
Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj
ghstack dependencies: #137855, #138488, #138374
pytorchmergebot pushed a commit that referenced this pull request Oct 23, 2024
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: #138527
Approved by: https://github.com/wconstab
ghstack dependencies: #137855, #138488, #138374, #138384
SamGinzburg pushed a commit that referenced this pull request Oct 28, 2024
Forward fix for build issue introduced by #137855:
```
In file included from fbcode/caffe2/torch/csrc/distributed/c10d/NCCLUtils.cpp:2:
fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:508:21: error: use of undeclared identifier 'NCCL_SPLIT_NOCOLOR'
  508 |     int split_color{NCCL_SPLIT_NOCOLOR - 1};
      |                     ^
```

Pull Request resolved: #138488
Approved by: https://github.com/fduwjj
ghstack dependencies: #137855
SamGinzburg pushed a commit that referenced this pull request Oct 28, 2024
- Added default value for `nccl_nonblocking_timeout` (30 mins, previous: -1).
- Reuse C10D_CHECK_TIMEOUT in other CHECK macros

Pull Request resolved: #138374
Approved by: https://github.com/eqy
ghstack dependencies: #137855, #138488
SamGinzburg pushed a commit that referenced this pull request Oct 28, 2024
Previously we only wait for comm to become ready after its initialization.
That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc.
Therefore, we just ensure comm is ready every "next time" we need to access ncclComm.
The place to add such gate keeper is `getNcclComm`.

Pull Request resolved: #138384
Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj
ghstack dependencies: #137855, #138488, #138374
SamGinzburg pushed a commit that referenced this pull request Oct 28, 2024
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: #138527
Approved by: https://github.com/wconstab
ghstack dependencies: #137855, #138488, #138374, #138384
@github-actions github-actions bot deleted the gh/kwen2501/74/head branch November 21, 2024 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants