KEMBAR78
Fix `UntypedStorage.resize_` to keep same CUDA device index by kurtamohler · Pull Request #113386 · pytorch/pytorch · GitHub
Skip to content

Conversation

@kurtamohler
Copy link
Collaborator

@kurtamohler kurtamohler commented Nov 9, 2023

Fixes #113300

cc @ptrblck

BC-breaking note

Before this PR, UntypedStorage.resize_() used to move CUDA storage data to the current CUDA device index, torch.cuda.current_device(). After this PR, UntypedStorage.resize_() keeps the data on the same device index that it was on before, regardless of the current device index.

@kurtamohler kurtamohler added module: cuda Related to torch.cuda, and CUDA support in general release notes: cuda release notes category labels Nov 9, 2023
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 9, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113386

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 78ffe49 with merge base e53da90 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@kurtamohler
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 9, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@gchanan
Copy link
Contributor

gchanan commented Nov 13, 2023

@albanD @kurtamohler worth a BC breaking note?

@kurtamohler kurtamohler added the topic: bc breaking topic category label Nov 13, 2023
@kurtamohler
Copy link
Collaborator Author

Yeah probably a good idea. I added a note

Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
@kevint324
Copy link
Contributor

kevint324 commented May 15, 2024

Hi @kurtamohler @albanD

Sorry to discuss this on the closed PR. I can open another issue if you think it's a better place to discuss it. Please let me know.

Could you explain a little bit that why is c10::cuda::CUDAGuard guard(device.index()); needed here

We are looking into some performance issue and found quite some device Set/Get operation pairs which are introduced by device guards which are scattered all over the code path.

For instance, it seems resize_bytes_cuda here assumes allocator->allocate(size_bytes); should be guarded.
Also, resize_impl_cuda_ inserts device guard by default.

Along the path, resize_impl_cuda_ (first device gurard) -> maybe_resize_storage_cuda -> resize_bytes_cuda (nested device guard)`

Is there any pricinple on device guard usage? It seems

  1. All functions declared with TORCH_CUDA_CPP_API should be guarded?
  2. It's always the caller's responsibility to guard the callee. Thus nested device guard is inevitable since the callee has no idea about what the caller did and might do the device guard again.

And last, if resize_bytes_cuda is guarded anyway, is it necessary to have a device guard here ?

Also do you think it's necessary to handle/optimize nested device guard?
Tensorflow(now XLA) uses TLS to manage the depth of nested guards which can save some calls to device layer.

Thanks
Kevin

@kurtamohler
Copy link
Collaborator Author

kurtamohler commented May 15, 2024

Hey @kevint324, I'm not sure about all the answers to the questions you asked, but below are my responses to some of them.

Could you explain a little bit that why is c10::cuda::CUDAGuard guard(device.index()); needed here

The allocator will use whichever device index is currently set. We need to make sure that the device index is set to the same device index that the data of the StorageImpl given to resize_bytes_cuda is currently allocated on--otherwise, the storage's data may switch to a different device. We want to keep the storage on whichever device the user originally said they wanted it to be on.

Along the path, resize_impl_cuda_ (first device gurard) -> maybe_resize_storage_cuda -> resize_bytes_cuda (nested device guard)`

If the guard in resize_bytes_cuda is potentially redundant within a call to resize_impl_cuda_(..., device_guard = true), that is not the only place where resize_bytes_cuda is called. It is also called here and here. I don't think either of those call sites sets the device guard beforehand.

It's always the caller's responsibility to guard the callee.

I'm not sure I would agree that that is always true, unless I'm misunderstanding what you mean. If we were to move the responsibility of setting the device guard to the caller of resize_bytes_cuda, then all of its call sites will need to add CUDAGuard guard(storage_impl->device().index());. In order to make the interface developer-friendly, resize_bytes_cuda should be responsible for changing the device index when needed.

Also do you think it's necessary to handle/optimize nested device guard?

It would certainly be ideal to avoid more cudaGetDevice() and cudaSetDevice() calls than we really need. But pragmatically, I'm wondering how significant the performance costs of the extra cudaGetDevice() and cudaSetDevice() calls really are.

Because of this check CUDAGuard will only call cudaSetDevice() if the current device index differs from the new one. So within resize_impl_cuda_(..., device_guard = true), unless I'm mistaken, cudaSetDevice() can only be called twice if the storage has a different device index than the current setting--once upon creation of the first guard, once upon destroy to restore the original setting. The nested guard won't call cudaSetDevice(). But there are a handful of cudaGetDevice() calls, not sure exactly how many, but I think it would be at least 4. However, if resize_bytes_cuda is responsible for changing the device index when needed, I'm not sure what better solution there is than just using CUDAGuard

@kevint324
Copy link
Contributor

Hey @kevint324, I'm not sure about all the answers to the questions you asked, but below are my responses to some of them.

Could you explain a little bit that why is c10::cuda::CUDAGuard guard(device.index()); needed here

The allocator will use whichever device index is currently set. We need to make sure that the device index is set to the same device index that the data of the StorageImpl given to resize_bytes_cuda is currently allocated on--otherwise, the storage's data may switch to a different device. We want to keep the storage on whichever device the user originally said they wanted it to be on.

Along the path, resize_impl_cuda_ (first device gurard) -> maybe_resize_storage_cuda -> resize_bytes_cuda (nested device guard)`

If the guard in resize_bytes_cuda is potentially redundant within a call to resize_impl_cuda_(..., device_guard = true), that is not the only place where resize_bytes_cuda is called. It is also called here and here. I don't think either of those call sites sets the device guard beforehand.

It's always the caller's responsibility to guard the callee.

I'm not sure I would agree that that is always true, unless I'm misunderstanding what you mean. If we were to move the responsibility of setting the device guard to the caller of resize_bytes_cuda, then all of its call sites will need to add CUDAGuard guard(storage_impl->device().index());. In order to make the interface developer-friendly, resize_bytes_cuda should be responsible for changing the device index when needed.

Also do you think it's necessary to handle/optimize nested device guard?

It would certainly be ideal to avoid more cudaGetDevice() and cudaSetDevice() calls than we really need. But pragmatically, I'm wondering how significant the performance costs of the extra cudaGetDevice() and cudaSetDevice() calls really are.

Because of this check CUDAGuard will only call cudaSetDevice() if the current device index differs from the new one. So within resize_impl_cuda_(..., device_guard = true), unless I'm mistaken, cudaSetDevice() can only be called twice if the storage has a different device index than the current setting--once upon creation of the first guard, once upon destroy to restore the original setting. The nested guard won't call cudaSetDevice(). But there are a handful of cudaGetDevice() calls, not sure exactly how many, but I think it would be at least 4. However, if resize_bytes_cuda is responsible for changing the device index when needed, I'm not sure what better solution there is than just using CUDAGuard

Hi @kurtamohler

Sorry, my bad. I didn't make my self clear. You are right. cudaSetDevice is only called when necessary. What I meant was the cudaGetDevice/cudaGetDevice pair introduced by the ctor and dtor of the device guard.

As for the cost, each cudaGetDevice costs just around ~100 nanoseconds on a modern CPU which is totally negligible. However we noticed 13 cudaGetDevice call during a plain torch::abs call and seems 10 of them are introduced by device guards. The host latency becomes a little bit more noticeable with a multiplier of 10.

Here is the simple test.

import torch

// warm up call
x=-torch.ones((10,10,10),dtype=torch.float,device="cuda")
y=torch.abs(x)

// start gdb here and set a break point on cudaGetDevice
z=torch.abs(y)

And here are the callstacks of the 13 calls to cudaGetDevice.
cudaGetDevice_calls.txt

@kevint324
Copy link
Contributor

kevint324 commented May 16, 2024

@kurtamohler

I'm trying to understand why there are so many device guards in PyTorch compared to TensorFlow.

Here are my current understandings (please forgive me if I am totally wrong):

Q: When must a programmer insert a device guard?

A: A device guard must be inserted when a piece of code relies on cudaCurrentDevice to work properly and it might be called from a potentially unguarded scope.

The fundamental difference between TensorFlow (TF) and PyTorch (PT) is that TF hides worker threads from the user, binding them to a device context at a very high level. In contrast, PyTorch is much more flexible, allowing threads to switch between devices and perform device-related operations at any moment. For example, in the resize_bytes_cuda case, it doesn’t make sense in the TensorFlow world for a thread bound to device A to resize a tensor on device B.

I think it's harder for a PyTorch developer. For instance, it’s difficult to determine the exact purpose and scope of the device guard inserted here due to the large guarded region.

Does self->dtype().itemsize(); need to be guarded to function? If not, why is it in the guarded region? Or maybe at::detail::computeStorageNbytesContiguous must be guarded? You wouldn’t know until you check the detailed implementation.

maybe_resize_storage_cuda -> resize_bytes_cuda clearly should be guarded, as you mentioned. But after this PR, a device guard was inserted. Is the guard in resize_impl_cuda_ still necessary? If yes, what exactly is it guarding now?

I'm just continually wondering about these things :p

Thanks for the time and really appreciate your reply

@kurtamohler
Copy link
Collaborator Author

Your concerns are outside the scope of this PR, so I'd recommend opening an issue

@kevint324
Copy link
Contributor

Sure. Thank you.

pytorchmergebot pushed a commit that referenced this pull request Jun 6, 2024
In #113386 a device guard was [inserted](https://github.com/pytorch/pytorch/pull/113386/files#diff-2691af3a999b3a8f4a0f635aabcd8edf0ffeda501edfa9366648e8a89de12a90R30).

The new inserted device guarded has a clear and more confined guarded scope.
And it's hard to tell the exact purpose and scope of the  [old device guard](https://github.com/kurtamohler/pytorch/blob/78ffe49a3fcd3ddc4f9f98500ccd3cbdee22a029/aten/src/ATen/native/cuda/Resize.h#L41).

Removing the guard has negligible positive performance impact and make the code more understandable.

Thanks

Pull Request resolved: #126498
Approved by: https://github.com/eqy, https://github.com/lezcano
TharinduRusira pushed a commit to TharinduRusira/pytorch that referenced this pull request Jun 14, 2024
In pytorch#113386 a device guard was [inserted](https://github.com/pytorch/pytorch/pull/113386/files#diff-2691af3a999b3a8f4a0f635aabcd8edf0ffeda501edfa9366648e8a89de12a90R30).

The new inserted device guarded has a clear and more confined guarded scope.
And it's hard to tell the exact purpose and scope of the  [old device guard](https://github.com/kurtamohler/pytorch/blob/78ffe49a3fcd3ddc4f9f98500ccd3cbdee22a029/aten/src/ATen/native/cuda/Resize.h#L41).

Removing the guard has negligible positive performance impact and make the code more understandable.

Thanks

Pull Request resolved: pytorch#126498
Approved by: https://github.com/eqy, https://github.com/lezcano
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: cuda Related to torch.cuda, and CUDA support in general open source release notes: cuda release notes category topic: bc breaking topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Storage.resize_() moves storage to current device

6 participants