Rearrange dimensions for pointwise operations for better performance. #4174

yongjik · 2017-12-14T16:49:21Z

In existing code, pointwise operations on transposed tensors process data
"column by column", resulting in poor performance. The worse case happens when
all operands are transposed tensors.

This change tries to "un-transpose" tensors in such a case, so that memory
access patterns are as sequential as possible.

In existing code, pointwise operations on transposed tensors process data "column by column", resulting in poor performance. The worse case happens when all operands are transposed tensors. This change tries to "un-transpose" tensors in such a case, so that memory access patterns are as sequential as possible.

yongjik · 2017-12-14T16:58:36Z

For example, in my GTX 1080, doing simple addition A += B on two 2048*2048 float tensors take ~208 us. However, if I transpose both of them (A = A.t(); B = B.t()), the same addition now takes ~2380(!!) us.

After this change, both operations take about the same duration (~208 us).

Unfortunately it does not help the cases when only one of them are transposed. (If only A (output) is transposed: 776 us; if only B is transposed: 377 us). In theory I could rearrange dimensions to prefer the output to be in the "correct" order (i.e., transform 776 us to 377 us), but I don't know if it would trash performance for other machines/operations, so I decided to tackle only the most obvious (and safe) case.

vadimkantorov · 2017-12-14T17:40:17Z

i think this also addresses #4010

ezyang · 2017-12-14T17:59:05Z

@pytorchbot test this please

yongjik · 2017-12-16T18:28:22Z

Hi all,

Is someone looking at this, or is there something I should do to get it reviewed?

Thanks!

apaszke

Wow, nice catch. This looks good, but I'd like someone else to verify this patch as well, just to make sure everything is fine.

My only comment is that it would be nice to add a more detailed description of the algorithm before the function. A high-level view is that if you have k inputs (up to 3 in here), you can view sizes[i] and strides[i] as k-element tuples. Now, define a greater-than relation on those tuples as

u > v iff:

for all i, u[i] >= v[i] (u is never worse for access pattern)
there exists i, u[i] > v[i] (u is better for access pattern somewhere)

Then, what this function does, is basically a sort according to this newly defined relation, on tuples of strides (and transposes sizes accordingly).

yongjik · 2017-12-16T22:36:36Z

Thanks for the review! Updated documentation as suggested.

soumith · 2017-12-18T06:50:01Z

thanks a lot @yongjik !

vadimkantorov · 2017-12-18T21:47:48Z

@yongjik I wanted to check if it does fix the #4010. But it still does:

a = torch.ones(5, 1, 1)
print(a.expand(len(a), 10, 10).storage().size())
# 5
print(a.expand(len(a), 10, 10).mul(5).storage().size())
# 500

I thought this PR would rearrange, then collapse dims (like the zero strides in this example), and perform the op only on the small tensor, and then introduce the zero strides back: https://github.com/yongjik/pytorch/blob/22796bd9cac51ba64adca240508284cb3d49a5e4/aten/src/THC/THCApply.cuh#L305-L306

But the example results in the old behavior, and a larger tensor is allocated for the result. Is it the expected behavior for the scope of this PR?

@soumith I remember our discussion, but it seemed that this solves it

yongjik · 2017-12-19T02:48:21Z

My PR should not change any visible behavior. It is strictly performance optimization: it merely shuffles the order in which CUDA kernels visit each element of a tensor. (For that matter, collapseDims() does not change the tensor's shape either: it merely tries to find a more efficient way of iterating over the elements for the particular CUDA kernel launch.)

I tested your example code above and it seems a new tensor storage is already created with the "expanded" dimension before THC_pointwiseApply2 is even called:

pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu

Line 74 in 9bf5e40

THCTensor_(resizeAs)(state, self_, src_);

So my PR can't help with #4010.

vadimkantorov · 2017-12-19T15:46:21Z

@yongjik Got it. Thanks for clarification.

…#4174) * Rearrange dimensions for pointwise operations for better performance. In existing code, pointwise operations on transposed tensors process data "column by column", resulting in poor performance. The worse case happens when all operands are transposed tensors. This change tries to "un-transpose" tensors in such a case, so that memory access patterns are as sequential as possible. * More explanation on what rearrangeDims() does. * Fixed a very important (and stupid) typo.

apaszke approved these changes Dec 16, 2017

View reviewed changes

More explanation on what rearrangeDims() does.

6484b54

Fixed a very important (and stupid) typo.

22796bd

soumith merged commit 5c46427 into pytorch:master Dec 18, 2017

yongjik deleted the test1 branch December 24, 2017 23:11

soumith added the 0.3.1 label Feb 4, 2018

ezyang added the open source label Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rearrange dimensions for pointwise operations for better performance. #4174

Rearrange dimensions for pointwise operations for better performance. #4174

Uh oh!

yongjik commented Dec 14, 2017

Uh oh!

yongjik commented Dec 14, 2017

Uh oh!

vadimkantorov commented Dec 14, 2017

Uh oh!

ezyang commented Dec 14, 2017

Uh oh!

yongjik commented Dec 16, 2017

Uh oh!

apaszke left a comment •

edited

Loading

Uh oh!

yongjik commented Dec 16, 2017

Uh oh!

soumith commented Dec 18, 2017

Uh oh!

vadimkantorov commented Dec 18, 2017 •

edited

Loading

Uh oh!

yongjik commented Dec 19, 2017

Uh oh!

vadimkantorov commented Dec 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Rearrange dimensions for pointwise operations for better performance. #4174

Rearrange dimensions for pointwise operations for better performance. #4174

Uh oh!

Conversation

yongjik commented Dec 14, 2017

Uh oh!

yongjik commented Dec 14, 2017

Uh oh!

vadimkantorov commented Dec 14, 2017

Uh oh!

ezyang commented Dec 14, 2017

Uh oh!

yongjik commented Dec 16, 2017

Uh oh!

apaszke left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yongjik commented Dec 16, 2017

Uh oh!

soumith commented Dec 18, 2017

Uh oh!

vadimkantorov commented Dec 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yongjik commented Dec 19, 2017

Uh oh!

vadimkantorov commented Dec 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

apaszke left a comment •

edited

Loading

vadimkantorov commented Dec 18, 2017 •

edited

Loading