KEMBAR78
Rearrange dimensions for pointwise operations for better performance. by yongjik · Pull Request #4174 · pytorch/pytorch · GitHub
Skip to content

Conversation

@yongjik
Copy link
Contributor

@yongjik yongjik commented Dec 14, 2017

In existing code, pointwise operations on transposed tensors process data
"column by column", resulting in poor performance. The worse case happens when
all operands are transposed tensors.

This change tries to "un-transpose" tensors in such a case, so that memory
access patterns are as sequential as possible.

In existing code, pointwise operations on transposed tensors process data
"column by column", resulting in poor performance.  The worse case happens when
all operands are transposed tensors.

This change tries to "un-transpose" tensors in such a case, so that memory
access patterns are as sequential as possible.
@yongjik
Copy link
Contributor Author

yongjik commented Dec 14, 2017

For example, in my GTX 1080, doing simple addition A += B on two 2048*2048 float tensors take ~208 us. However, if I transpose both of them (A = A.t(); B = B.t()), the same addition now takes ~2380(!!) us.

After this change, both operations take about the same duration (~208 us).

Unfortunately it does not help the cases when only one of them are transposed. (If only A (output) is transposed: 776 us; if only B is transposed: 377 us). In theory I could rearrange dimensions to prefer the output to be in the "correct" order (i.e., transform 776 us to 377 us), but I don't know if it would trash performance for other machines/operations, so I decided to tackle only the most obvious (and safe) case.

@vadimkantorov
Copy link
Contributor

i think this also addresses #4010

@ezyang
Copy link
Contributor

ezyang commented Dec 14, 2017

@pytorchbot test this please

@yongjik
Copy link
Contributor Author

yongjik commented Dec 16, 2017

Hi all,

Is someone looking at this, or is there something I should do to get it reviewed?

Thanks!

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, nice catch. This looks good, but I'd like someone else to verify this patch as well, just to make sure everything is fine.


My only comment is that it would be nice to add a more detailed description of the algorithm before the function. A high-level view is that if you have k inputs (up to 3 in here), you can view sizes[i] and strides[i] as k-element tuples. Now, define a greater-than relation on those tuples as

u > v iff:

  • for all i, u[i] >= v[i] (u is never worse for access pattern)
  • there exists i, u[i] > v[i] (u is better for access pattern somewhere)

Then, what this function does, is basically a sort according to this newly defined relation, on tuples of strides (and transposes sizes accordingly).

@yongjik
Copy link
Contributor Author

yongjik commented Dec 16, 2017

Thanks for the review! Updated documentation as suggested.

@soumith soumith merged commit 5c46427 into pytorch:master Dec 18, 2017
@soumith
Copy link
Member

soumith commented Dec 18, 2017

thanks a lot @yongjik !

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Dec 18, 2017

@yongjik I wanted to check if it does fix the #4010. But it still does:

a = torch.ones(5, 1, 1)
print(a.expand(len(a), 10, 10).storage().size())
# 5
print(a.expand(len(a), 10, 10).mul(5).storage().size())
# 500

I thought this PR would rearrange, then collapse dims (like the zero strides in this example), and perform the op only on the small tensor, and then introduce the zero strides back: https://github.com/yongjik/pytorch/blob/22796bd9cac51ba64adca240508284cb3d49a5e4/aten/src/THC/THCApply.cuh#L305-L306

But the example results in the old behavior, and a larger tensor is allocated for the result. Is it the expected behavior for the scope of this PR?

@soumith I remember our discussion, but it seemed that this solves it

@yongjik
Copy link
Contributor Author

yongjik commented Dec 19, 2017

My PR should not change any visible behavior. It is strictly performance optimization: it merely shuffles the order in which CUDA kernels visit each element of a tensor. (For that matter, collapseDims() does not change the tensor's shape either: it merely tries to find a more efficient way of iterating over the elements for the particular CUDA kernel launch.)

I tested your example code above and it seems a new tensor storage is already created with the "expanded" dimension before THC_pointwiseApply2 is even called:

THCTensor_(resizeAs)(state, self_, src_);

So my PR can't help with #4010.

@vadimkantorov
Copy link
Contributor

@yongjik Got it. Thanks for clarification.

@yongjik yongjik deleted the test1 branch December 24, 2017 23:11
@soumith soumith added the 0.3.1 label Feb 4, 2018
soumith pushed a commit that referenced this pull request Feb 7, 2018
…#4174)

* Rearrange dimensions for pointwise operations for better performance.

In existing code, pointwise operations on transposed tensors process data
"column by column", resulting in poor performance.  The worse case happens when
all operands are transposed tensors.

This change tries to "un-transpose" tensors in such a case, so that memory
access patterns are as sequential as possible.

* More explanation on what rearrangeDims() does.

* Fixed a very important (and stupid) typo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants