Make CUDA serde support for TP agent pluggable #59376

lw · 2021-06-03T14:47:20Z

Stack from ghstack:

Prepare for TensorPipe separating its CUDA-specific headers #59788 Prepare for TensorPipe separating its CUDA-specific headers
Move CUDA-related stuff of TP agent to separate file #59377 Move CUDA-related stuff of TP agent to separate file
Make CUDA serde support for TP agent pluggable #59376 Make CUDA serde support for TP agent pluggable

This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by #ifdef USE_CUDA, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: D28796428

This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]

facebook-github-bot · 2021-06-03T14:47:28Z

💊 CI failures summary and remediations

As of commit c306fb3 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]

Pull Request resolved: #59376 This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". ghstack-source-id: 131081298 Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]

codecov · 2021-06-11T21:18:00Z

Codecov Report

Merging #59376 (8eddbba) into gh/lw/206/base (0a6dcf9) will increase coverage by 0.10%.
The diff coverage is 90.00%.

❗ Current head 8eddbba differs from pull request most recent head c306fb3. Consider uploading reports for the commit c306fb3 to get more accurate results

@@                Coverage Diff                 @@
##           gh/lw/206/base   #59376      +/-   ##
==================================================
+ Coverage           76.11%   76.22%   +0.10%     
==================================================
  Files                2041     2041              
  Lines              203801   203821      +20     
==================================================
+ Hits               155131   155367     +236     
+ Misses              48670    48454     -216

mrshenli · 2021-06-13T19:24:10Z

torch/csrc/distributed/rpc/tensorpipe_utils.cpp

+    std::atomic<const TensorpipeDeviceTypeConverter*>,
+    static_cast<size_t>(DeviceType::COMPILE_TIME_MAX_DEVICE_TYPES)>
+    // NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
+    device_type_converter_registry;


curious, since this is statically initialized and done in order, why does it still need to be atomic? What type of concurrency are we expecting?

It was done this way in the code I copied this from, and this was the original motivation:

pytorch/c10/core/impl/DeviceGuardImplInterface.h

Lines 270 to 275 in ff15d93

// The registry is NON-owning. Each stored pointer is std::atomic so

// that under all interleavings of registry calls the structure is

// race-free. This doesn't cost us anything on reads in X86. (An

// unsynchronized implementation probably is OK too, but I didn't want

// to prove that we never read from device_guard_impl_registry at the

// same time some registration is occurring. Shiver.)

I didn't want to copy all the comments/docstrings so I just mentioned where this code was copied from hoping that readers would then go check there in case of questions. Does that make sense?

Yep, that makes sense to me.

An unsynchronized implementation probably is OK too, but I didn't want to prove that we never read from device_guard_impl_registry at the same time some registration is occurring.

I still wanna understand our expectations. Would I be correct if I assume in our case, it will be "A unsynchronized implementation is also OK, as TensorPipeAgent is the only owner and all initialization is done in order. The only reason we add atomic here is because this doesn't cost us anything on reads in X86."

I must admit I'm not sure. I suspect the main "failure" scenario is as follows: a program starts and loads libtorch (i.e., the CPU-only version) because that's what happens by default. Later on, one thread creates a CUDA tensor for the first time, and this triggers libtorch_cuda to be loaded, because it's lazy-loaded when needed. At this time, the CUDA DeviceTypeConverter is created and inserted into that array. However at this point there could already be other threads that are trying to read from that array. This is unlikely but technically possible I guess.

mrshenli · 2021-06-13T19:35:22Z

torch/csrc/distributed/rpc/tensorpipe_utils.h

+class TensorpipeDeviceTypeConverter {
+ public:
+  // Ideally we'd want this to also return a tensorpipe::Message::Tensor object
+  // but we cannot forward-declare that class (because it's nested), and we


It's indeed a little weird to add tensors to message and then use TORCH_INTERNAL_ASSERT to make sure that the tensor is indeed inserted.

Do we expect other files to access this TensorpipeDeviceTypeConverter? If not, maybe it can be completed implemented in the .cpp file and directly using tensorpipe::Message::Tensor there?

I should have made it clearer but, yes, in the next PR (#59377) I am moving the CUDA subclass of this to a separate file, thus the abstract base class must be defined in a header.

This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]

mrshenli · 2021-06-14T14:11:50Z

torch/csrc/distributed/rpc/tensorpipe_utils.cpp

+    std::atomic<const TensorpipeDeviceTypeConverter*>,
+    static_cast<size_t>(DeviceType::COMPILE_TIME_MAX_DEVICE_TYPES)>
+    // NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
+    device_type_converter_registry;


Yep, that makes sense to me.

An unsynchronized implementation probably is OK too, but I didn't want to prove that we never read from device_guard_impl_registry at the same time some registration is occurring.

I still wanna understand our expectations. Would I be correct if I assume in our case, it will be "A unsynchronized implementation is also OK, as TensorPipeAgent is the only owner and all initialization is done in order. The only reason we add atomic here is because this doesn't cost us anything on reads in X86."

facebook-github-bot · 2021-06-15T10:28:54Z

This pull request has been merged in 83ba71a.

lw requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma, wayi1 and zhaojuanmao as code owners June 3, 2021 14:47

facebook-github-bot added the cla signed label Jun 3, 2021

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 3, 2021

lw added 6 commits June 3, 2021 08:04

lw mentioned this pull request Jun 10, 2021

Prepare for TensorPipe separating its CUDA-specific headers #59788

Closed

lw added 2 commits June 10, 2021 06:45

lw requested a review from cbalioglu as a code owner June 10, 2021 16:26

lw added 2 commits June 11, 2021 00:23

mrshenli reviewed Jun 13, 2021

View reviewed changes

mrshenli approved these changes Jun 14, 2021

View reviewed changes

facebook-github-bot closed this in 83ba71a Jun 15, 2021

facebook-github-bot added the Merged label Jun 15, 2021

facebook-github-bot deleted the gh/lw/206/head branch June 18, 2021 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make CUDA serde support for TP agent pluggable #59376

Make CUDA serde support for TP agent pluggable #59376

Uh oh!

lw commented Jun 3, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Jun 3, 2021 •

edited

Loading

Uh oh!

codecov bot commented Jun 11, 2021 •

edited

Loading

Uh oh!

mrshenli Jun 13, 2021

Uh oh!

lw Jun 14, 2021

Uh oh!

mrshenli Jun 14, 2021

Uh oh!

lw Jun 14, 2021

Uh oh!

mrshenli Jun 13, 2021

Uh oh!

lw Jun 14, 2021

Uh oh!

mrshenli Jun 14, 2021

Uh oh!

facebook-github-bot commented Jun 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// The registry is NON-owning. Each stored pointer is std::atomic so
	// that under all interleavings of registry calls the structure is
	// race-free. This doesn't cost us anything on reads in X86. (An
	// unsynchronized implementation probably is OK too, but I didn't want
	// to prove that we never read from device_guard_impl_registry at the
	// same time some registration is occurring. Shiver.)

Make CUDA serde support for TP agent pluggable #59376

Make CUDA serde support for TP agent pluggable #59376

Uh oh!

Conversation

lw commented Jun 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

codecov bot commented Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mrshenli Jun 13, 2021

Choose a reason for hiding this comment

Uh oh!

lw Jun 14, 2021

Choose a reason for hiding this comment

Uh oh!

mrshenli Jun 14, 2021

Choose a reason for hiding this comment

Uh oh!

lw Jun 14, 2021

Choose a reason for hiding this comment

Uh oh!

mrshenli Jun 13, 2021

Choose a reason for hiding this comment

Uh oh!

lw Jun 14, 2021

Choose a reason for hiding this comment

Uh oh!

mrshenli Jun 14, 2021

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lw commented Jun 3, 2021 •

edited

Loading

facebook-github-bot commented Jun 3, 2021 •

edited

Loading

codecov bot commented Jun 11, 2021 •

edited

Loading