KEMBAR78
Make CUDA serde support for TP agent pluggable by lw · Pull Request #59376 · pytorch/pytorch · GitHub
Skip to content

Conversation

@lw
Copy link
Contributor

@lw lw commented Jun 3, 2021

Stack from ghstack:

This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by #ifdef USE_CUDA, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: D28796428

This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 3, 2021

💊 CI failures summary and remediations

As of commit c306fb3 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

lw added 6 commits June 3, 2021 08:04
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

[ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

[ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

[ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

[ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

[ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

[ghstack-poisoned]
lw added 2 commits June 10, 2021 06:45
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

[ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

[ghstack-poisoned]
@lw lw requested a review from cbalioglu as a code owner June 10, 2021 16:26
lw added a commit that referenced this pull request Jun 10, 2021
Pull Request resolved: #59376

This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".
ghstack-source-id: 131081298

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)
lw added 2 commits June 11, 2021 00:23
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

[ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

[ghstack-poisoned]
@codecov
Copy link

codecov bot commented Jun 11, 2021

Codecov Report

Merging #59376 (8eddbba) into gh/lw/206/base (0a6dcf9) will increase coverage by 0.10%.
The diff coverage is 90.00%.

❗ Current head 8eddbba differs from pull request most recent head c306fb3. Consider uploading reports for the commit c306fb3 to get more accurate results

@@                Coverage Diff                 @@
##           gh/lw/206/base   #59376      +/-   ##
==================================================
+ Coverage           76.11%   76.22%   +0.10%     
==================================================
  Files                2041     2041              
  Lines              203801   203821      +20     
==================================================
+ Hits               155131   155367     +236     
+ Misses              48670    48454     -216     

std::atomic<const TensorpipeDeviceTypeConverter*>,
static_cast<size_t>(DeviceType::COMPILE_TIME_MAX_DEVICE_TYPES)>
// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
device_type_converter_registry;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, since this is statically initialized and done in order, why does it still need to be atomic? What type of concurrency are we expecting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was done this way in the code I copied this from, and this was the original motivation:

// The registry is NON-owning. Each stored pointer is std::atomic so
// that under all interleavings of registry calls the structure is
// race-free. This doesn't cost us anything on reads in X86. (An
// unsynchronized implementation probably is OK too, but I didn't want
// to prove that we never read from device_guard_impl_registry at the
// same time some registration is occurring. Shiver.)

I didn't want to copy all the comments/docstrings so I just mentioned where this code was copied from hoping that readers would then go check there in case of questions. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense to me.

An unsynchronized implementation probably is OK too, but I didn't want to prove that we never read from device_guard_impl_registry at the same time some registration is occurring.

I still wanna understand our expectations. Would I be correct if I assume in our case, it will be "A unsynchronized implementation is also OK, as TensorPipeAgent is the only owner and all initialization is done in order. The only reason we add atomic here is because this doesn't cost us anything on reads in X86."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit I'm not sure. I suspect the main "failure" scenario is as follows: a program starts and loads libtorch (i.e., the CPU-only version) because that's what happens by default. Later on, one thread creates a CUDA tensor for the first time, and this triggers libtorch_cuda to be loaded, because it's lazy-loaded when needed. At this time, the CUDA DeviceTypeConverter is created and inserted into that array. However at this point there could already be other threads that are trying to read from that array. This is unlikely but technically possible I guess.

class TensorpipeDeviceTypeConverter {
public:
// Ideally we'd want this to also return a tensorpipe::Message::Tensor object
// but we cannot forward-declare that class (because it's nested), and we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed a little weird to add tensors to message and then use TORCH_INTERNAL_ASSERT to make sure that the tensor is indeed inserted.

Do we expect other files to access this TensorpipeDeviceTypeConverter? If not, maybe it can be completed implemented in the .cpp file and directly using tensorpipe::Message::Tensor there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have made it clearer but, yes, in the next PR (#59377) I am moving the CUDA subclass of this to a separate file, thus the abstract base class must be defined in a header.

This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".

Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)

[ghstack-poisoned]
std::atomic<const TensorpipeDeviceTypeConverter*>,
static_cast<size_t>(DeviceType::COMPILE_TIME_MAX_DEVICE_TYPES)>
// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
device_type_converter_registry;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense to me.

An unsynchronized implementation probably is OK too, but I didn't want to prove that we never read from device_guard_impl_registry at the same time some registration is occurring.

I still wanna understand our expectations. Would I be correct if I assume in our case, it will be "A unsynchronized implementation is also OK, as TensorPipeAgent is the only owner and all initialization is done in order. The only reason we add atomic here is because this doesn't cost us anything on reads in X86."

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 83ba71a.

@facebook-github-bot facebook-github-bot deleted the gh/lw/206/head branch June 18, 2021 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants