-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Make CUDA serde support for TP agent pluggable #59376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit c306fb3 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]
Pull Request resolved: #59376 This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". ghstack-source-id: 131081298 Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/)
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]
Codecov Report
@@ Coverage Diff @@
## gh/lw/206/base #59376 +/- ##
==================================================
+ Coverage 76.11% 76.22% +0.10%
==================================================
Files 2041 2041
Lines 203801 203821 +20
==================================================
+ Hits 155131 155367 +236
+ Misses 48670 48454 -216 |
| std::atomic<const TensorpipeDeviceTypeConverter*>, | ||
| static_cast<size_t>(DeviceType::COMPILE_TIME_MAX_DEVICE_TYPES)> | ||
| // NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables) | ||
| device_type_converter_registry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, since this is statically initialized and done in order, why does it still need to be atomic? What type of concurrency are we expecting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was done this way in the code I copied this from, and this was the original motivation:
pytorch/c10/core/impl/DeviceGuardImplInterface.h
Lines 270 to 275 in ff15d93
| // The registry is NON-owning. Each stored pointer is std::atomic so | |
| // that under all interleavings of registry calls the structure is | |
| // race-free. This doesn't cost us anything on reads in X86. (An | |
| // unsynchronized implementation probably is OK too, but I didn't want | |
| // to prove that we never read from device_guard_impl_registry at the | |
| // same time some registration is occurring. Shiver.) |
I didn't want to copy all the comments/docstrings so I just mentioned where this code was copied from hoping that readers would then go check there in case of questions. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes sense to me.
An unsynchronized implementation probably is OK too, but I didn't want to prove that we never read from device_guard_impl_registry at the same time some registration is occurring.
I still wanna understand our expectations. Would I be correct if I assume in our case, it will be "A unsynchronized implementation is also OK, as TensorPipeAgent is the only owner and all initialization is done in order. The only reason we add atomic here is because this doesn't cost us anything on reads in X86."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must admit I'm not sure. I suspect the main "failure" scenario is as follows: a program starts and loads libtorch (i.e., the CPU-only version) because that's what happens by default. Later on, one thread creates a CUDA tensor for the first time, and this triggers libtorch_cuda to be loaded, because it's lazy-loaded when needed. At this time, the CUDA DeviceTypeConverter is created and inserted into that array. However at this point there could already be other threads that are trying to read from that array. This is unlikely but technically possible I guess.
| class TensorpipeDeviceTypeConverter { | ||
| public: | ||
| // Ideally we'd want this to also return a tensorpipe::Message::Tensor object | ||
| // but we cannot forward-declare that class (because it's nested), and we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's indeed a little weird to add tensors to message and then use TORCH_INTERNAL_ASSERT to make sure that the tensor is indeed inserted.
Do we expect other files to access this TensorpipeDeviceTypeConverter? If not, maybe it can be completed implemented in the .cpp file and directly using tensorpipe::Message::Tensor there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have made it clearer but, yes, in the next PR (#59377) I am moving the CUDA subclass of this to a separate file, thus the abstract base class must be defined in a header.
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen. The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys". Differential Revision: [D28796428](https://our.internmc.facebook.com/intern/diff/D28796428/) [ghstack-poisoned]
| std::atomic<const TensorpipeDeviceTypeConverter*>, | ||
| static_cast<size_t>(DeviceType::COMPILE_TIME_MAX_DEVICE_TYPES)> | ||
| // NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables) | ||
| device_type_converter_registry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes sense to me.
An unsynchronized implementation probably is OK too, but I didn't want to prove that we never read from device_guard_impl_registry at the same time some registration is occurring.
I still wanna understand our expectations. Would I be correct if I assume in our case, it will be "A unsynchronized implementation is also OK, as TensorPipeAgent is the only owner and all initialization is done in order. The only reason we add atomic here is because this doesn't cost us anything on reads in X86."
|
This pull request has been merged in 83ba71a. |
Stack from ghstack:
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.
The only instance where we had CPU and CUDA logic within the same code, guarded by
#ifdef USE_CUDA, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".Differential Revision: D28796428