KEMBAR78
Remove USE_CUDA from c10d reducer/logger by lw · Pull Request #59562 · pytorch/pytorch · GitHub
Skip to content

Conversation

@lw
Copy link
Contributor

@lw lw commented Jun 7, 2021

Stack from ghstack:

Differential Revision: D28931378

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 7, 2021

💊 CI failures summary and remediations

As of commit 314b298 (more details on the Dr. CI page):


  • 5/5 failures introduced in this PR

5 failures not recognized by patterns:

Job Step Action
CircleCI docker-pytorch-linux-bionic-rocm4.1-py3.6 Unknown 🔁 rerun
CircleCI docker-pytorch-linux-bionic-rocm3.9-py3.6 Unknown 🔁 rerun
CircleCI pytorch_ios_12_0_0_x86_64_build Unknown 🔁 rerun
CircleCI docker-pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9 Unknown 🔁 rerun
CircleCI Build Error Config Processing Error (Don't rerun) 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

lw added 15 commits June 7, 2021 08:15
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Copy link
Contributor

@zhaojuanmao zhaojuanmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love the change, it makes codes much cleaner, thanks!

before landing, would you please run a benchmark for your PR and verify the measured times are correct?

lw added 2 commits June 10, 2021 02:01
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
@@ -0,0 +1,91 @@
#include <c10d/reducer.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this module cuda_timer?

i wonder if it's possible to implement just one timer (instead of CPU and GPU timers) with c10::Events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class is called CudaTimer. Unless you're talking about the file?

I chose to call the file reducer_cuda because it contains the CUDA-related stuff for the reducer.cpp file. If I saw a timer_cuda.cpp file I would expect there to be also a timer.cpp file (or timer_cpu.cpp) but that's not the case.

i wonder if it's possible to implement just one timer (instead of CPU and GPU timers) with c10::Events?

There's several obstacles to this:

  • c10::Event doesn't have an elapsed_time method (though we could add it but it's unclear how to make it device-agnostic)
  • even if we did so, c10::Events of "CPU" type currently behave like no-ops, thus they're unable to provide timing information, which means we'd still need two separate branches for CPU and CUDA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Do you feel like it'd be worth while fixing c10::Event so it aligns with CUDAEvent to move further to "CUDA agnostic" code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding timer: do you feel that timer is only going to be used in the context of reducer, e.g. maybe timing collectives or even something related to RPC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me ask @ezyang: do you think we could/should add an elapsed_time method to c10::Event? (like the one on CUDAEvent)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have enough information right now to say. Under what situations would you exercise this code with CPU events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang I don't think you can have c10::Events with a CPU device? (All event-related methods of the CPU DeviceGuardImpl throw). Which actually means that we'd still need two separate codepaths for CPU and non-CPU. My question was whether elapsed_time is a "generic enough" method to elevate to c10::Event.

Anyway, in the end I won't be doing this change here to avoid blocking the stack, I'll consider it for later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding timer: do you feel that timer is only going to be used in the context of reducer, e.g. maybe timing collectives or even something related to RPC?

In its current version, Timer is very much specific to the reducer since it hardcodes the 5 exact types of events it can be used for (forward, backward_comm, ...). If we'll ever make Timer more generic we can move it to a better location at that time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was whether elapsed_time is a "generic enough" method to elevate to c10::Event.

OK. Well, the way I would think about this problem is that the more API surface you put on a virtual interface, the harder it is to bring up new implementations of the interface. This isn't a terribly pressing concern right now, because CUDA events are the only real game in town, but it would be more difficulty later. My general inclination is, as you've done here, keep things specialized if there is only one user, and let people who are later trying to generalize make the call.

Copy link
Contributor

@agolynski agolynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

Differential Revision: [D28931378](https://our.internmc.facebook.com/intern/diff/D28931378/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28931378/)!

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in cbcae46.

@facebook-github-bot facebook-github-bot deleted the gh/lw/210/head branch June 14, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants