Fix `UntypedStorage.resize_` to keep same CUDA device index #113386

kurtamohler · 2023-11-09T20:16:20Z

BC-breaking note

Before this PR, UntypedStorage.resize_() used to move CUDA storage data to the current CUDA device index, torch.cuda.current_device(). After this PR, UntypedStorage.resize_() keeps the data on the same device index that it was on before, regardless of the current device index.

pytorch-bot · 2023-11-09T20:16:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113386

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 78ffe49 with merge base e53da90 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD

Thanks!

kurtamohler · 2023-11-09T23:11:26Z

@pytorchbot merge

pytorchmergebot · 2023-11-09T23:14:01Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

gchanan · 2023-11-13T15:45:18Z

@albanD @kurtamohler worth a BC breaking note?

kurtamohler · 2023-11-13T17:48:58Z

Yeah probably a good idea. I added a note

…113386) Fixes pytorch#113300 Pull Request resolved: pytorch#113386 Approved by: https://github.com/albanD

kevint324 · 2024-05-15T08:50:54Z

Hi @kurtamohler @albanD

Sorry to discuss this on the closed PR. I can open another issue if you think it's a better place to discuss it. Please let me know.

Could you explain a little bit that why is c10::cuda::CUDAGuard guard(device.index()); needed here

We are looking into some performance issue and found quite some device Set/Get operation pairs which are introduced by device guards which are scattered all over the code path.

For instance, it seems resize_bytes_cuda here assumes allocator->allocate(size_bytes); should be guarded.
Also, resize_impl_cuda_ inserts device guard by default.

Along the path, resize_impl_cuda_ (first device gurard) -> maybe_resize_storage_cuda -> resize_bytes_cuda (nested device guard)`

Is there any pricinple on device guard usage? It seems

All functions declared with TORCH_CUDA_CPP_API should be guarded?
It's always the caller's responsibility to guard the callee. Thus nested device guard is inevitable since the callee has no idea about what the caller did and might do the device guard again.

And last, if resize_bytes_cuda is guarded anyway, is it necessary to have a device guard here ?

Also do you think it's necessary to handle/optimize nested device guard?
Tensorflow(now XLA) uses TLS to manage the depth of nested guards which can save some calls to device layer.

Thanks
Kevin

kurtamohler · 2024-05-15T23:49:48Z

Hey @kevint324, I'm not sure about all the answers to the questions you asked, but below are my responses to some of them.

Could you explain a little bit that why is c10::cuda::CUDAGuard guard(device.index()); needed here

The allocator will use whichever device index is currently set. We need to make sure that the device index is set to the same device index that the data of the StorageImpl given to resize_bytes_cuda is currently allocated on--otherwise, the storage's data may switch to a different device. We want to keep the storage on whichever device the user originally said they wanted it to be on.

Along the path, resize_impl_cuda_ (first device gurard) -> maybe_resize_storage_cuda -> resize_bytes_cuda (nested device guard)`

If the guard in resize_bytes_cuda is potentially redundant within a call to resize_impl_cuda_(..., device_guard = true), that is not the only place where resize_bytes_cuda is called. It is also called here and here. I don't think either of those call sites sets the device guard beforehand.

It's always the caller's responsibility to guard the callee.

I'm not sure I would agree that that is always true, unless I'm misunderstanding what you mean. If we were to move the responsibility of setting the device guard to the caller of resize_bytes_cuda, then all of its call sites will need to add CUDAGuard guard(storage_impl->device().index());. In order to make the interface developer-friendly, resize_bytes_cuda should be responsible for changing the device index when needed.

Also do you think it's necessary to handle/optimize nested device guard?

It would certainly be ideal to avoid more cudaGetDevice() and cudaSetDevice() calls than we really need. But pragmatically, I'm wondering how significant the performance costs of the extra cudaGetDevice() and cudaSetDevice() calls really are.

Because of this check CUDAGuard will only call cudaSetDevice() if the current device index differs from the new one. So within resize_impl_cuda_(..., device_guard = true), unless I'm mistaken, cudaSetDevice() can only be called twice if the storage has a different device index than the current setting--once upon creation of the first guard, once upon destroy to restore the original setting. The nested guard won't call cudaSetDevice(). But there are a handful of cudaGetDevice() calls, not sure exactly how many, but I think it would be at least 4. However, if resize_bytes_cuda is responsible for changing the device index when needed, I'm not sure what better solution there is than just using CUDAGuard

kevint324 · 2024-05-16T10:41:09Z

Hey @kevint324, I'm not sure about all the answers to the questions you asked, but below are my responses to some of them.

Could you explain a little bit that why is c10::cuda::CUDAGuard guard(device.index()); needed here

The allocator will use whichever device index is currently set. We need to make sure that the device index is set to the same device index that the data of the StorageImpl given to resize_bytes_cuda is currently allocated on--otherwise, the storage's data may switch to a different device. We want to keep the storage on whichever device the user originally said they wanted it to be on.

Along the path, resize_impl_cuda_ (first device gurard) -> maybe_resize_storage_cuda -> resize_bytes_cuda (nested device guard)`

If the guard in resize_bytes_cuda is potentially redundant within a call to resize_impl_cuda_(..., device_guard = true), that is not the only place where resize_bytes_cuda is called. It is also called here and here. I don't think either of those call sites sets the device guard beforehand.

It's always the caller's responsibility to guard the callee.

I'm not sure I would agree that that is always true, unless I'm misunderstanding what you mean. If we were to move the responsibility of setting the device guard to the caller of resize_bytes_cuda, then all of its call sites will need to add CUDAGuard guard(storage_impl->device().index());. In order to make the interface developer-friendly, resize_bytes_cuda should be responsible for changing the device index when needed.

Also do you think it's necessary to handle/optimize nested device guard?

It would certainly be ideal to avoid more cudaGetDevice() and cudaSetDevice() calls than we really need. But pragmatically, I'm wondering how significant the performance costs of the extra cudaGetDevice() and cudaSetDevice() calls really are.

Because of this check CUDAGuard will only call cudaSetDevice() if the current device index differs from the new one. So within resize_impl_cuda_(..., device_guard = true), unless I'm mistaken, cudaSetDevice() can only be called twice if the storage has a different device index than the current setting--once upon creation of the first guard, once upon destroy to restore the original setting. The nested guard won't call cudaSetDevice(). But there are a handful of cudaGetDevice() calls, not sure exactly how many, but I think it would be at least 4. However, if resize_bytes_cuda is responsible for changing the device index when needed, I'm not sure what better solution there is than just using CUDAGuard

Hi @kurtamohler

Sorry, my bad. I didn't make my self clear. You are right. cudaSetDevice is only called when necessary. What I meant was the cudaGetDevice/cudaGetDevice pair introduced by the ctor and dtor of the device guard.

As for the cost, each cudaGetDevice costs just around ~100 nanoseconds on a modern CPU which is totally negligible. However we noticed 13 cudaGetDevice call during a plain torch::abs call and seems 10 of them are introduced by device guards. The host latency becomes a little bit more noticeable with a multiplier of 10.

Here is the simple test.

import torch

// warm up call
x=-torch.ones((10,10,10),dtype=torch.float,device="cuda")
y=torch.abs(x)

// start gdb here and set a break point on cudaGetDevice
z=torch.abs(y)

And here are the callstacks of the 13 calls to cudaGetDevice.
cudaGetDevice_calls.txt

kevint324 · 2024-05-16T11:19:14Z

@kurtamohler

I'm trying to understand why there are so many device guards in PyTorch compared to TensorFlow.

Here are my current understandings (please forgive me if I am totally wrong):

Q: When must a programmer insert a device guard?

A: A device guard must be inserted when a piece of code relies on cudaCurrentDevice to work properly and it might be called from a potentially unguarded scope.

The fundamental difference between TensorFlow (TF) and PyTorch (PT) is that TF hides worker threads from the user, binding them to a device context at a very high level. In contrast, PyTorch is much more flexible, allowing threads to switch between devices and perform device-related operations at any moment. For example, in the resize_bytes_cuda case, it doesn’t make sense in the TensorFlow world for a thread bound to device A to resize a tensor on device B.

I think it's harder for a PyTorch developer. For instance, it’s difficult to determine the exact purpose and scope of the device guard inserted here due to the large guarded region.

Does self->dtype().itemsize(); need to be guarded to function? If not, why is it in the guarded region? Or maybe at::detail::computeStorageNbytesContiguous must be guarded? You wouldn’t know until you check the detailed implementation.

maybe_resize_storage_cuda -> resize_bytes_cuda clearly should be guarded, as you mentioned. But after this PR, a device guard was inserted. Is the guard in resize_impl_cuda_ still necessary? If yes, what exactly is it guarding now?

I'm just continually wondering about these things :p

Thanks for the time and really appreciate your reply

kurtamohler · 2024-05-16T17:21:24Z

Your concerns are outside the scope of this PR, so I'd recommend opening an issue

kevint324 · 2024-05-17T01:00:30Z

Sure. Thank you.

In #113386 a device guard was [inserted](https://github.com/pytorch/pytorch/pull/113386/files#diff-2691af3a999b3a8f4a0f635aabcd8edf0ffeda501edfa9366648e8a89de12a90R30). The new inserted device guarded has a clear and more confined guarded scope. And it's hard to tell the exact purpose and scope of the [old device guard](https://github.com/kurtamohler/pytorch/blob/78ffe49a3fcd3ddc4f9f98500ccd3cbdee22a029/aten/src/ATen/native/cuda/Resize.h#L41). Removing the guard has negligible positive performance impact and make the code more understandable. Thanks Pull Request resolved: #126498 Approved by: https://github.com/eqy, https://github.com/lezcano

In pytorch#113386 a device guard was [inserted](https://github.com/pytorch/pytorch/pull/113386/files#diff-2691af3a999b3a8f4a0f635aabcd8edf0ffeda501edfa9366648e8a89de12a90R30). The new inserted device guarded has a clear and more confined guarded scope. And it's hard to tell the exact purpose and scope of the [old device guard](https://github.com/kurtamohler/pytorch/blob/78ffe49a3fcd3ddc4f9f98500ccd3cbdee22a029/aten/src/ATen/native/cuda/Resize.h#L41). Removing the guard has negligible positive performance impact and make the code more understandable. Thanks Pull Request resolved: pytorch#126498 Approved by: https://github.com/eqy, https://github.com/lezcano

Fix UntypedStorage.resize_ to keep same CUDA device index

78ffe49

kurtamohler added module: cuda Related to torch.cuda, and CUDA support in general release notes: cuda release notes category labels Nov 9, 2023

kurtamohler requested a review from ezyang November 9, 2023 20:16

pytorchbot added the open source label Nov 9, 2023

albanD approved these changes Nov 9, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 9, 2023

pytorchmergebot added the merging label Nov 9, 2023

pytorchmergebot added Merged and removed merging labels Nov 10, 2023

pytorchmergebot closed this in 8bdce9b Nov 10, 2023

kurtamohler added the topic: bc breaking topic category label Nov 13, 2023

Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023

Fix UntypedStorage.resize_ to keep same CUDA device index (pytorch#…

8fcacfd

…113386) Fixes pytorch#113300 Pull Request resolved: pytorch#113386 Approved by: https://github.com/albanD

kevint324 mentioned this pull request May 17, 2024

Remove redundant device guard in Resize.h #126498

Closed

Fix UntypedStorage.resize_ to keep same CUDA device index #113386

Fix UntypedStorage.resize_ to keep same CUDA device index #113386

Uh oh!

Conversation

kurtamohler commented Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BC-breaking note

Uh oh!

pytorch-bot bot commented Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113386

✅ No Failures

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

kurtamohler commented Nov 9, 2023

Uh oh!

pytorchmergebot commented Nov 9, 2023

Merge started

Uh oh!

gchanan commented Nov 13, 2023

Uh oh!

kurtamohler commented Nov 13, 2023

Uh oh!

kevint324 commented May 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kurtamohler commented May 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevint324 commented May 16, 2024

Uh oh!

kevint324 commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kurtamohler commented May 16, 2024

Uh oh!

kevint324 commented May 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fix `UntypedStorage.resize_` to keep same CUDA device index #113386

Fix `UntypedStorage.resize_` to keep same CUDA device index #113386

kurtamohler commented Nov 9, 2023 •

edited

Loading

pytorch-bot bot commented Nov 9, 2023 •

edited

Loading

kevint324 commented May 15, 2024 •

edited

Loading

kurtamohler commented May 15, 2024 •

edited

Loading

kevint324 commented May 16, 2024 •

edited

Loading