-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[c10d] Pass avoidRecordStreams into collective() function #112195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112195
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4f0a652 with merge base 1b702b1 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me!
|
It looks like this broke correctness from the unit tests :/ |
The sharded param is created by `pre_unshard` stream, then used by `unshard` stream, then PG-NCCL's stream. A `record_stream` must be added between `pre_unshard` and `unshard` if PG-NCCL stops doing such favor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed the FSDP change offline. FSDP was missing a record_stream call for correctness and was relying on ProcessGroupNCCL's own record_stream call, which is not technically part of the API's contract.
In other words, this change to ProcessGroupNCCL can lead to silent correctness failures, but it would only be for existing code that depended on unspecified behavior.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…2195) Even after PR pytorch#111431, the `collective(...)` function still uses the underlined version `avoidRecordStreams_` inside and does not respect each collective call's preference, as the underlined `avoidRecordStreams_` is only controlled by environment variable. As a fix, we pass `avoidRecordStreams` into the collective() function. Pull Request resolved: pytorch#112195 Approved by: https://github.com/awgu
…2195) Even after PR pytorch#111431, the `collective(...)` function still uses the underlined version `avoidRecordStreams_` inside and does not respect each collective call's preference, as the underlined `avoidRecordStreams_` is only controlled by environment variable. As a fix, we pass `avoidRecordStreams` into the collective() function. Pull Request resolved: pytorch#112195 Approved by: https://github.com/awgu
Even after PR #111431, the
collective(...)function still uses the underlined versionavoidRecordStreams_inside and does not respect each collective call's preference, as the underlinedavoidRecordStreams_is only controlled by environment variable.As a fix, we pass
avoidRecordStreamsinto the collective() function.