-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[c10d] fix sequence numbers for coalesced operations #135132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135132
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New FailuresAs of commit 3031918 with merge base 66db61f ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
57086fe to
d0d413c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the approach in this PR assume that we can put a ++ in every place that we start coalescing?
i think this fails due to the fact that user-code can also start a coalescing group (e.g. from python side) and they will not bump the counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming that from the python side, the coalescing group will eventually call ProcessGroupNCCL::collectiveCoalesced function.
In this function, we already do:
seqCollective_++.
Does the python side not eventually call the mentioned function above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should mention that the intent here is to fix the two use cases:
all_gatherof differentoutputTensorsizes automatically turns on coalescing andreduce_scatterof differentinputTensor_sizes.
Whenever python calls a coalesced operation, the call (if I read the code correctly) makes it to collectiveCoalesced function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the code go increment seqCollective_ in ProcessGroupNCCL::collective operation once per coalesced collective. This way, it should cover all API calls into this area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the python side not eventually call the mentioned function above?
Unfortunately there are 2 paths from python. For some of the python coalescing, it starts by calling 'start' and then issuing normal coalesced calls, then calling 'end'.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L2274
increment seqCollective_ in ProcessGroupNCCL::collective operation once per coalesced collective
iirc this is actually a change in semantics. We need to document what the semantics for the seq numbers are. One rationale is, we want one seq number per actual 'work' object == actual GPU kernel. Since coalescing groups logical operations together into one actual gpu kernel, we also create only one 'work' obj that we enqueue into the watchdog for monitoring. It may be confusing to increment the seqnum for each collective inside the coalescing group.
For P2P + flightrecorder i remember confronting this and adding a new 'op id' or something, which does increment on every op, but was separated from the seq number. The FR could then use actual seq num to match whole coalescing groups with each other.
3cf699b to
8663689
Compare
57fe706 to
18d13d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
| input_sizes = op_sizes[seq % ops_per_repeat] | ||
| profiling_name = "nccl:recv 0<-1" if self.rank == 0 else "nccl:send 1->0" | ||
| self.assertEqual(t["entries"][seq]["profiling_name"], profiling_name) | ||
| # we don't increment collective_seq_id for p2p ops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. one more possible wrinkle, since i am clearly trying to be annoying here, is we actually do use the 'collective' communicator when doing coalesced p2ps. In other words, i think it might have been intentional to bump the collective seq for coalp2p. Someone check me on this?
ref: we get a special nccl stream and nccl comm for p2p ops, but iirc we use the normal stream/comm for coalp2p same as we would use for collectives.
given this, what makes the most sense from a logging pov? clearly it will confuse users if we increment collective seq when we do p2p. otoh, if we are trying to do trace analysis to debug a hang, it might be important to note that this coalp2p happened sequentially after a collective on the same ncclcomm.
Summary: We were erroneously incrementing seq_collective for p2p operations. Fixes issue #134833 Test Plan: Unit tests. TODO: add more unit tests Reviewers: Subscribers: Tasks: Tags:
18d13d6 to
3031918
Compare
|
Hi, @c-p-i-o is this PR ready to merge? Thanks! |
I was trying to address @wconstab comment above
If the 'collective' communicator hangs when doing a coalesced p2p, we might falsely infer that the hang was due to a collective instead of a coalesced p2p. I'm thinking of introducing (another) flag that says if the collective is a coalesced op or not. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-test / test Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i "Unrelated failure" here: inux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-test / test |
Merge startedYour change will be merged while ignoring the following 1 checks: linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-test / test Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_9-cuda12_4-test / test Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i Failures unrelated to the change. linux-binary-manywheel / manywheel-py3_9-cuda12_4-test / test |
Merge startedYour change will be merged while ignoring the following 5 checks: linux-binary-manywheel / manywheel-py3_9-cuda12_4-test / test, linux-binary-manywheel / manywheel-py3_9-cuda12_1-test / test, linux-binary-manywheel / manywheel-py3_9-cuda11_8-test / test, linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-test / test, trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / build Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: Fix sequence number in execution trace dump for matching between collective/p2p op and wait in execution trace replay. `ProcessGroupNCCL` has 2 sequence number counter, `seqCollective_` and `seqP2P_`. https://github.com/pytorch/pytorch/blob/b18ba9419e7062acbd49bef5c388e1b1d6a170dc/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp#L1188-L1191 However, `WorkNCCL` only has one sequence number member `seq_`. https://github.com/pytorch/pytorch/blob/b18ba9419e7062acbd49bef5c388e1b1d6a170dc/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp#L387 We need to match collective and p2p with wait separately. facebookresearch/param@29b5a46 Depend on: #135132 Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test Differential Revision: Pull Request resolved: #134578 Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o
Summary:
We were erroneously incrementing seq_collective for p2p operations.
FIxes issue #134833
Test Plan:
Unit tests.
TODO: add more unit tests
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k