-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Fix UntypedStorage.resize_
to keep same CUDA device index
#113386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113386
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 78ffe49 with merge base e53da90 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@albanD @kurtamohler worth a BC breaking note? |
Yeah probably a good idea. I added a note |
…113386) Fixes pytorch#113300 Pull Request resolved: pytorch#113386 Approved by: https://github.com/albanD
Sorry to discuss this on the closed PR. I can open another issue if you think it's a better place to discuss it. Please let me know. Could you explain a little bit that why is We are looking into some performance issue and found quite some device Set/Get operation pairs which are introduced by device guards which are scattered all over the code path. For instance, it seems Along the path, Is there any pricinple on device guard usage? It seems
And last, if Also do you think it's necessary to handle/optimize nested device guard? Thanks |
Hey @kevint324, I'm not sure about all the answers to the questions you asked, but below are my responses to some of them.
The allocator will use whichever device index is currently set. We need to make sure that the device index is set to the same device index that the data of the
If the guard in
I'm not sure I would agree that that is always true, unless I'm misunderstanding what you mean. If we were to move the responsibility of setting the device guard to the caller of
It would certainly be ideal to avoid more Because of this check |
Hi @kurtamohler Sorry, my bad. I didn't make my self clear. You are right. cudaSetDevice is only called when necessary. What I meant was the cudaGetDevice/cudaGetDevice pair introduced by the ctor and dtor of the device guard. As for the cost, each cudaGetDevice costs just around ~100 nanoseconds on a modern CPU which is totally negligible. However we noticed 13 cudaGetDevice call during a plain torch::abs call and seems 10 of them are introduced by device guards. The host latency becomes a little bit more noticeable with a multiplier of 10. Here is the simple test.
And here are the callstacks of the 13 calls to cudaGetDevice. |
I'm trying to understand why there are so many device guards in PyTorch compared to TensorFlow. Here are my current understandings (please forgive me if I am totally wrong): Q: When must a programmer insert a device guard? A: A device guard must be inserted when a piece of code relies on cudaCurrentDevice to work properly and it might be called from a potentially unguarded scope. The fundamental difference between TensorFlow (TF) and PyTorch (PT) is that TF hides worker threads from the user, binding them to a device context at a very high level. In contrast, PyTorch is much more flexible, allowing threads to switch between devices and perform device-related operations at any moment. For example, in the resize_bytes_cuda case, it doesn’t make sense in the TensorFlow world for a thread bound to device A to resize a tensor on device B. I think it's harder for a PyTorch developer. For instance, it’s difficult to determine the exact purpose and scope of the device guard inserted here due to the large guarded region. Does self->dtype().itemsize(); need to be guarded to function? If not, why is it in the guarded region? Or maybe at::detail::computeStorageNbytesContiguous must be guarded? You wouldn’t know until you check the detailed implementation. maybe_resize_storage_cuda -> resize_bytes_cuda clearly should be guarded, as you mentioned. But after this PR, a device guard was inserted. Is the guard in resize_impl_cuda_ still necessary? If yes, what exactly is it guarding now? I'm just continually wondering about these things :p Thanks for the time and really appreciate your reply |
Your concerns are outside the scope of this PR, so I'd recommend opening an issue |
Sure. Thank you. |
In #113386 a device guard was [inserted](https://github.com/pytorch/pytorch/pull/113386/files#diff-2691af3a999b3a8f4a0f635aabcd8edf0ffeda501edfa9366648e8a89de12a90R30). The new inserted device guarded has a clear and more confined guarded scope. And it's hard to tell the exact purpose and scope of the [old device guard](https://github.com/kurtamohler/pytorch/blob/78ffe49a3fcd3ddc4f9f98500ccd3cbdee22a029/aten/src/ATen/native/cuda/Resize.h#L41). Removing the guard has negligible positive performance impact and make the code more understandable. Thanks Pull Request resolved: #126498 Approved by: https://github.com/eqy, https://github.com/lezcano
In pytorch#113386 a device guard was [inserted](https://github.com/pytorch/pytorch/pull/113386/files#diff-2691af3a999b3a8f4a0f635aabcd8edf0ffeda501edfa9366648e8a89de12a90R30). The new inserted device guarded has a clear and more confined guarded scope. And it's hard to tell the exact purpose and scope of the [old device guard](https://github.com/kurtamohler/pytorch/blob/78ffe49a3fcd3ddc4f9f98500ccd3cbdee22a029/aten/src/ATen/native/cuda/Resize.h#L41). Removing the guard has negligible positive performance impact and make the code more understandable. Thanks Pull Request resolved: pytorch#126498 Approved by: https://github.com/eqy, https://github.com/lezcano
Fixes #113300
cc @ptrblck
BC-breaking note
Before this PR,
UntypedStorage.resize_()
used to move CUDA storage data to the current CUDA device index,torch.cuda.current_device()
. After this PR,UntypedStorage.resize_()
keeps the data on the same device index that it was on before, regardless of the current device index.