-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Rearrange dimensions for pointwise operations for better performance. #4174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In existing code, pointwise operations on transposed tensors process data "column by column", resulting in poor performance. The worse case happens when all operands are transposed tensors. This change tries to "un-transpose" tensors in such a case, so that memory access patterns are as sequential as possible.
|
For example, in my GTX 1080, doing simple addition After this change, both operations take about the same duration (~208 us). Unfortunately it does not help the cases when only one of them are transposed. (If only A (output) is transposed: 776 us; if only B is transposed: 377 us). In theory I could rearrange dimensions to prefer the output to be in the "correct" order (i.e., transform 776 us to 377 us), but I don't know if it would trash performance for other machines/operations, so I decided to tackle only the most obvious (and safe) case. |
|
i think this also addresses #4010 |
|
@pytorchbot test this please |
|
Hi all, Is someone looking at this, or is there something I should do to get it reviewed? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, nice catch. This looks good, but I'd like someone else to verify this patch as well, just to make sure everything is fine.
My only comment is that it would be nice to add a more detailed description of the algorithm before the function. A high-level view is that if you have k inputs (up to 3 in here), you can view sizes[i] and strides[i] as k-element tuples. Now, define a greater-than relation on those tuples as
u > v iff:
- for all i, u[i] >= v[i] (u is never worse for access pattern)
- there exists i, u[i] > v[i] (u is better for access pattern somewhere)
Then, what this function does, is basically a sort according to this newly defined relation, on tuples of strides (and transposes sizes accordingly).
|
Thanks for the review! Updated documentation as suggested. |
|
thanks a lot @yongjik ! |
|
@yongjik I wanted to check if it does fix the #4010. But it still does: a = torch.ones(5, 1, 1)
print(a.expand(len(a), 10, 10).storage().size())
# 5
print(a.expand(len(a), 10, 10).mul(5).storage().size())
# 500I thought this PR would rearrange, then collapse dims (like the zero strides in this example), and perform the op only on the small tensor, and then introduce the zero strides back: https://github.com/yongjik/pytorch/blob/22796bd9cac51ba64adca240508284cb3d49a5e4/aten/src/THC/THCApply.cuh#L305-L306 But the example results in the old behavior, and a larger tensor is allocated for the result. Is it the expected behavior for the scope of this PR? @soumith I remember our discussion, but it seemed that this solves it |
|
My PR should not change any visible behavior. It is strictly performance optimization: it merely shuffles the order in which CUDA kernels visit each element of a tensor. (For that matter, collapseDims() does not change the tensor's shape either: it merely tries to find a more efficient way of iterating over the elements for the particular CUDA kernel launch.) I tested your example code above and it seems a new tensor storage is already created with the "expanded" dimension before THC_pointwiseApply2 is even called:
So my PR can't help with #4010. |
|
@yongjik Got it. Thanks for clarification. |
…#4174) * Rearrange dimensions for pointwise operations for better performance. In existing code, pointwise operations on transposed tensors process data "column by column", resulting in poor performance. The worse case happens when all operands are transposed tensors. This change tries to "un-transpose" tensors in such a case, so that memory access patterns are as sequential as possible. * More explanation on what rearrangeDims() does. * Fixed a very important (and stupid) typo.
In existing code, pointwise operations on transposed tensors process data
"column by column", resulting in poor performance. The worse case happens when
all operands are transposed tensors.
This change tries to "un-transpose" tensors in such a case, so that memory
access patterns are as sequential as possible.