-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Remove USE_CUDA from c10d reducer/logger #59562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 314b298 (more details on the Dr. CI page):
5 failures not recognized by patterns:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love the change, it makes codes much cleaner, thanks!
before landing, would you please run a benchmark for your PR and verify the measured times are correct?
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
| @@ -0,0 +1,91 @@ | |||
| #include <c10d/reducer.hpp> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this module cuda_timer?
i wonder if it's possible to implement just one timer (instead of CPU and GPU timers) with c10::Events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class is called CudaTimer. Unless you're talking about the file?
I chose to call the file reducer_cuda because it contains the CUDA-related stuff for the reducer.cpp file. If I saw a timer_cuda.cpp file I would expect there to be also a timer.cpp file (or timer_cpu.cpp) but that's not the case.
i wonder if it's possible to implement just one timer (instead of CPU and GPU timers) with c10::Events?
There's several obstacles to this:
- c10::Event doesn't have an
elapsed_timemethod (though we could add it but it's unclear how to make it device-agnostic) - even if we did so, c10::Events of "CPU" type currently behave like no-ops, thus they're unable to provide timing information, which means we'd still need two separate branches for CPU and CUDA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Do you feel like it'd be worth while fixing c10::Event so it aligns with CUDAEvent to move further to "CUDA agnostic" code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding timer: do you feel that timer is only going to be used in the context of reducer, e.g. maybe timing collectives or even something related to RPC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me ask @ezyang: do you think we could/should add an elapsed_time method to c10::Event? (like the one on CUDAEvent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I have enough information right now to say. Under what situations would you exercise this code with CPU events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang I don't think you can have c10::Events with a CPU device? (All event-related methods of the CPU DeviceGuardImpl throw). Which actually means that we'd still need two separate codepaths for CPU and non-CPU. My question was whether elapsed_time is a "generic enough" method to elevate to c10::Event.
Anyway, in the end I won't be doing this change here to avoid blocking the stack, I'll consider it for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding timer: do you feel that timer is only going to be used in the context of reducer, e.g. maybe timing collectives or even something related to RPC?
In its current version, Timer is very much specific to the reducer since it hardcodes the 5 exact types of events it can be used for (forward, backward_comm, ...). If we'll ever make Timer more generic we can move it to a better location at that time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question was whether elapsed_time is a "generic enough" method to elevate to c10::Event.
OK. Well, the way I would think about this problem is that the more API surface you put on a virtual interface, the harder it is to bring up new implementations of the interface. This isn't a terribly pressing concern right now, because CUDA events are the only real game in town, but it would be more difficulty later. My general inclination is, as you've done here, keep things specialized if there is only one user, and let people who are later trying to generalize make the call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)! [ghstack-poisoned]
|
This pull request has been merged in cbcae46. |
Stack from ghstack:
Differential Revision: D28931378
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!