-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Add beginnings of torch::stable::accelerator #159679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add beginnings of torch::stable::accelerator #159679
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159679
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 5e8e158 with merge base 34ec5ed ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Attention! PyTorch one of the C-stable API file was changedYou MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function. Caused by: |
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
|
|
||
| using DeviceIndex = int8_t; | ||
| using StreamId = int64_t; | ||
| class DeviceGuard { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something that I'm not sure about -- do the copy / move semantics need to match the real device guard although this just a wrapper that is holding a std::unique_ptr to DeviceGuardOpaque?
pytorch/c10/core/DeviceGuard.h
Lines 37 to 46 in 444e238
| ~DeviceGuard() = default; | |
| /// Copy is disallowed | |
| DeviceGuard(const DeviceGuard&) = delete; | |
| DeviceGuard& operator=(const DeviceGuard&) = delete; | |
| /// Move is disallowed, as DeviceGuard does not have an uninitialized state, | |
| /// which is required for moves on types with nontrivial destructors. | |
| DeviceGuard(DeviceGuard&& other) = delete; | |
| DeviceGuard& operator=(DeviceGuard&& other) = delete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to copy the existing DeviceGuard. For example, I agree with deleting the default constructor for this API. We can disallow copy and move if it's easiest to maintain anyway.
[ghstack-poisoned]
[ghstack-poisoned]
Adds - `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` copied from https://github.com/pytorch/pytorch/blob/50eac811a68e63e96ad56c11c983bfe298a0bb8a/torch/csrc/inductor/aoti_runtime/utils_cuda.h#L30-L46 - constructor `DeviceGuard(DeviceIndex)` (this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device - `set_index(DeviceIndex)` - `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque` - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor) - `id() -> StreamId` - `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream` [ghstack-poisoned]
Adds - `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` copied from https://github.com/pytorch/pytorch/blob/50eac811a68e63e96ad56c11c983bfe298a0bb8a/torch/csrc/inductor/aoti_runtime/utils_cuda.h#L30-L46 - constructor `DeviceGuard(DeviceIndex)` (this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device - `set_index(DeviceIndex)` - `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque` - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor) - `id() -> StreamId` - `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream` [ghstack-poisoned]
|
FYI @guangyey @EikanWang in case you have some feedback here. |
Adds - `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` copied from https://github.com/pytorch/pytorch/blob/50eac811a68e63e96ad56c11c983bfe298a0bb8a/torch/csrc/inductor/aoti_runtime/utils_cuda.h#L30-L46 - constructor `DeviceGuard(DeviceIndex)` (this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device - `set_index(DeviceIndex)` - `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque` - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor) - `id() -> StreamId` - `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream` [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
test/cpp_extensions/libtorch_agnostic_extension/libtorch_agnostic/csrc/kernel.cpp
Outdated
Show resolved
Hide resolved
test/cpp_extensions/libtorch_agnostic_extension/libtorch_agnostic/csrc/kernel.cpp
Outdated
Show resolved
Hide resolved
test/cpp_extensions/libtorch_agnostic_extension/libtorch_agnostic/ops.py
Outdated
Show resolved
Hide resolved
| if torch.cuda.is_available(): | ||
| extra_compile_args["cxx"].append("-DUSE_CUDA") | ||
| extension = CUDAExtension | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm maybe this would be a good call to move the CUDA stuff into its own C++ file and build them separately..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I don't see the issue with the current approach, so gonna keep it as is unless there's something specific you're concerned about! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues yet, but if we have more "cuda only" code in the future it may be a good way to split the ever growing kernel.cpp
| }); | ||
| } | ||
|
|
||
| AOTITorchError aoti_torch_create_device_guard( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this code is inspired by https://github.com/pytorch/pytorch/blob/main/torch/csrc/inductor/aoti_torch/shim_cuda.cpp#L8?
| c10::Stream* stream_ptr = new c10::Stream(stream); | ||
| *ret_stream = reinterpret_cast<StreamHandle>(stream_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new stream on the heap is gonna leak memory, we should be able to assign the stream in line 1642 into the pointer (tho idk the cast semantics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh after reading the rest of the code, I see what the user flow is intended to be. That said, I'm not sure if we want the semantics of get_current_stream to create a new stream on the heap (this is likely not expected from the user POV), and if we do end up sticking with this, we need to loudly document this as.
I think we do not want to mess with the memory of the stream at all...cc @albanD for thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm looks like @albanD had no thoughts :b
I'm not sure what to do here, this one looks different from
pytorch/torch/csrc/inductor/aoti_torch/shim_cuda.cpp
Lines 49 to 55 in 5f5f508
| AOTITorchError aoti_torch_get_current_cuda_stream( | |
| int32_t device_index, | |
| void** ret_stream) { | |
| AOTI_TORCH_CONVERT_EXCEPTION_TO_ERROR_CODE({ | |
| *(cudaStream_t*)(ret_stream) = at::cuda::getCurrentCUDAStream(device_index); | |
| }); | |
| } |
As that is just directly returning the id. However, the actual getCurrentStream returns a c10::Stream, so I'm trying to match the semantic in the stable ABI (so in future users can add other stream methods on a stream object)
pytorch/aten/src/ATen/DeviceAccelerator.cpp
Lines 106 to 110 in 2ee22e4
| c10::Stream getCurrentStream(c10::DeviceIndex device_index) { | |
| const auto device_type = getAccelerator(true).value(); | |
| c10::impl::VirtualGuardImpl impl(device_type); | |
| return impl.getStream({device_type, device_index}); | |
| } |
I think at::acceerator::getCurrentStream is returning c10::Stream by value which is why we will need to create a new object on the heap if we want to return it to the caller and transfer ownership to the stable::Stream
I can add a comment for sure, but would be curious how else we can tweak this in a way that's more user/memory friendly, would it make more sense to just return the .id() directly?
test/cpp_extensions/libtorch_agnostic_extension/libtorch_agnostic/csrc/kernel.cpp
Show resolved
Hide resolved
|
|
||
| using DeviceIndex = int8_t; | ||
| using StreamId = int64_t; | ||
| class DeviceGuard { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to copy the existing DeviceGuard. For example, I agree with deleting the default constructor for this API. We can disallow copy and move if it's easiest to maintain anyway.
|
|
||
| // Construct a stable::Stream from a StreamHandle | ||
| // Steals ownership from the StreamHandle | ||
| explicit Stream(StreamHandle stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, so the expected use case is:
- user calls getCurrentStream which creates a Stream
- then they can call id on it
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 4 checks: pull / linux-jammy-py3.9-clang12 / test (default, 4, 5, linux.4xlarge), pull / linux-jammy-py3.9-gcc11 / test (default, 4, 5, linux.2xlarge), inductor / unit-test / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_amx, 1, 2, linux.8xlarge.amx), inductor / unit-test / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_avx2, 1, 2, linux.10xlarge.avx2) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
torch/csrc/stable/accelerator.h
Outdated
| @@ -0,0 +1,71 @@ | |||
| #pragma once | |||
|
|
|||
| #include <torch/csrc/inductor/aoti_runtime/utils.h> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self to remove when rebasing and use TORCH_ERROR_CODE_CHECK instead
Adds - `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` mostly copied from the below (but made generic) https://github.com/pytorch/pytorch/blob/50eac811a68e63e96ad56c11c983bfe298a0bb8a/torch/csrc/inductor/aoti_runtime/utils_cuda.h#L30-L46 - constructor `DeviceGuard(DeviceIndex)` (**this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device**) - `set_index(DeviceIndex)` - `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque` - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor) - `id() -> StreamId` - `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream` [ghstack-poisoned]
|
Starting merge as part of PR stack under #160453 |
Adds - `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` mostly copied from the below (but made generic) https://github.com/pytorch/pytorch/blob/50eac811a68e63e96ad56c11c983bfe298a0bb8a/torch/csrc/inductor/aoti_runtime/utils_cuda.h#L30-L46 - constructor `DeviceGuard(DeviceIndex)` (**this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device**) - `set_index(DeviceIndex)` - `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque` - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor) - `id() -> StreamId` - `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream` [ghstack-poisoned]
Adds - `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` mostly copied from the below (but made generic) https://github.com/pytorch/pytorch/blob/50eac811a68e63e96ad56c11c983bfe298a0bb8a/torch/csrc/inductor/aoti_runtime/utils_cuda.h#L30-L46 - constructor `DeviceGuard(DeviceIndex)` (**this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device**) - `set_index(DeviceIndex)` - `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque` - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor) - `id() -> StreamId` - `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream` [ghstack-poisoned]
|
Starting merge as part of PR stack under #160453 |
3 similar comments
|
Starting merge as part of PR stack under #160453 |
|
Starting merge as part of PR stack under #160453 |
|
Starting merge as part of PR stack under #160453 |
Pull Request resolved: #160453 Approved by: https://github.com/janeyx99 ghstack dependencies: #159679
Adds - `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` mostly copied from the below (but made generic) https://github.com/pytorch/pytorch/blob/50eac811a68e63e96ad56c11c983bfe298a0bb8a/torch/csrc/inductor/aoti_runtime/utils_cuda.h#L30-L46 - constructor `DeviceGuard(DeviceIndex)` (**this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device**) - `set_index(DeviceIndex)` - `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque` - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor) - `id() -> StreamId` - `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream` Pull Request resolved: #159679 Approved by: https://github.com/guangyey, https://github.com/janeyx99
Pull Request resolved: #160453 Approved by: https://github.com/janeyx99 ghstack dependencies: #159679
Adds - `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` mostly copied from the below (but made generic) https://github.com/pytorch/pytorch/blob/50eac811a68e63e96ad56c11c983bfe298a0bb8a/torch/csrc/inductor/aoti_runtime/utils_cuda.h#L30-L46 - constructor `DeviceGuard(DeviceIndex)` (**this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device**) - `set_index(DeviceIndex)` - `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque` - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor) - `id() -> StreamId` - `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream` Pull Request resolved: #159679 Approved by: https://github.com/guangyey, https://github.com/janeyx99
Pull Request resolved: #160453 Approved by: https://github.com/janeyx99 ghstack dependencies: #159679
Adds - `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` mostly copied from the below (but made generic) https://github.com/pytorch/pytorch/blob/50eac811a68e63e96ad56c11c983bfe298a0bb8a/torch/csrc/inductor/aoti_runtime/utils_cuda.h#L30-L46 - constructor `DeviceGuard(DeviceIndex)` (**this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device**) - `set_index(DeviceIndex)` - `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque` - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor) - `id() -> StreamId` - `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream` Pull Request resolved: pytorch#159679 Approved by: https://github.com/guangyey, https://github.com/janeyx99
Pull Request resolved: pytorch#160453 Approved by: https://github.com/janeyx99 ghstack dependencies: pytorch#159679
Adds - `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` mostly copied from the below (but made generic) https://github.com/pytorch/pytorch/blob/50eac811a68e63e96ad56c11c983bfe298a0bb8a/torch/csrc/inductor/aoti_runtime/utils_cuda.h#L30-L46 - constructor `DeviceGuard(DeviceIndex)` (**this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device**) - `set_index(DeviceIndex)` - `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque` - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor) - `id() -> StreamId` - `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream` Pull Request resolved: pytorch#159679 Approved by: https://github.com/guangyey, https://github.com/janeyx99
Pull Request resolved: pytorch#160453 Approved by: https://github.com/janeyx99 ghstack dependencies: pytorch#159679
Adds
torch::stable::accelerator::DeviceGuard:std::unique_ptrtoDeviceGuardOpauqemostly copied from the below (but made generic)pytorch/torch/csrc/inductor/aoti_runtime/utils_cuda.h
Lines 30 to 46 in 50eac81
DeviceGuard(DeviceIndex)(this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device)set_index(DeviceIndex)torch::stable::accelerator::Stream:std::shared_ptrtoStreamOpaqueStream(StreamHandle stream)(similar to torch::stable::Tensor)id() -> StreamIdgetCurrentStream(DeviceIndex device_index) -> stable::accelerator::StreamStack from ghstack (oldest at bottom):