-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Migrate nonzero from TH to ATen (CPU) #58811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 815dad5 (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
170601d to
87cf46c
Compare
|
|
||
| AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3( | ||
| kHalf, kBFloat16, kBool, self.scalar_type(), "nonzero_count_cpu", [&] { | ||
| at::parallel_for(0, iter.numel(), internal::GRAIN_SIZE, [&] (int64_t begin, int64_t end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at::parallel_for adds some non-trivial overhead, typically skipping it if size is < GRAIN_SIZE is beneficial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed at::parallel_for so it calls the lambda directly without entering the parallel region in that case. I hope that's acceptable, as it's much cleaner on the caller's end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's good. Interesting, parallelNative and ParallelTBB already had this shortcut.
|
|
||
| // Convert thread-local counts to cumulative sum | ||
| for (size_t i = 1; i < thread_count_nonzero.size(); ++i) { | ||
| thread_count_nonzero[i] += thread_count_nonzero[i - 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is DimVector initialized to 0? Otherwise thread_count_nonzero[0] is uninitialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, DimVector is like std::vector in that it will always zero-initialize the array.
| const auto self_sizes = self.sizes(); | ||
| const auto total_nonzero = thread_count_nonzero.back(); | ||
| const int64_t ndim = self_sizes.size(); | ||
| resize_output(result, {total_nonzero, ndim}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently the contiguity of the tensor returned by nonzero from cuda and cpu is different, see #46224. It would be good to fix it here, but that would require result tensor to be ({ndim, total_nonzero}).t(). Proper handling of result if it was passed by the user is also a bit annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was easier than I expected. It turns out resize_output returns true if the tensor was resized, so I just re-stride the output in that case.
| TORCH_INTERNAL_ASSERT(begin == thread_begin[tid]); | ||
|
|
||
| // +1 faster than additional condition check inside loop | ||
| c10::SmallVector<int64_t, 33> sizes(ndim + 1, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 33?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means no allocation is done unless ndim > 32. With DimVector's default static size that would be ndim > 4 which is more likely to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While nominally we don't have a limit on the number of dimensions, de-facto on GPU it's 25 (MAX_DIMS in OffsetCalculator.cuh), so supporting more than that doesn't make sense
| const auto in_stride = strides[0]; | ||
| const auto out_stride1 = out_accessor.stride(1); | ||
| const auto out_stride0 = out_accessor.stride(0) - ndim * out_stride1; | ||
| const auto ndim = out_accessor.size(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this ndim? It's the same as the captured one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment. The local variables allow the compiler to keep them in registers, otherwise it's alias analysis isn't good enough to realize the int64_t pointer we're writing to won't alias it.
| int64_t* out = out_ptr; | ||
|
|
||
| for (int64_t i = 0; i < n2; ++i) { | ||
| const char* ptr = data[0] + i * strides[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for 1d tensor strides[1] will never be used, as i will only be 0. We should think about how we can make it possible to not pack strides to 2*ntensors(), but I agree that for now it's unsafe (re: other TI PRs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please run the benchmarks with the latest version?
| TORCH_INTERNAL_ASSERT(begin == thread_begin[tid]); | ||
|
|
||
| // +1 faster than additional condition check inside loop | ||
| c10::SmallVector<int64_t, 33> sizes(ndim + 1, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While nominally we don't have a limit on the number of dimensions, de-facto on GPU it's 25 (MAX_DIMS in OffsetCalculator.cuh), so supporting more than that doesn't make sense
|
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
Here are the updated benchmark numbers:
|
|
@peterbell10 android tests are failing with |
|
Is there a list of "selected" kernels somewhere? I'm adding new kernels, so I assume those need to be added to whatever list. |
|
I've ran the suggested tracer updates internally, let's see if it fixes it. |
|
Sorry, that suggestion still didn't work. I'm working with mobile team to try and get it resolved. |
|
Sorry @peterbell10, xla error is real (probably related to new dispatch for count_nonzeros), reverting |
|
This pull request has been reverted by 657b75d. |
Summary: Resubmit of #58811, Closes gh-24745 The existing PR (gh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works. This PR also significantly improves performance by adding multithreading support to the algorithm. As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location. | Shape | Before | After (1 thread) | After (8 threads) | |:----------:|--------:|-----------------:|------------------:| | 256,128,32 | 2610 us | 2150 us | 551 us | | 128,128,32 | 1250 us | 1020 us | 197 us | | 64,128,32 | 581 us | 495 us | 99 us | | 32,128,32 | 292 us | 255 us | 83 us | | 16,128,32 | 147 us | 126 us | 75 us | | 8,128,32 | 75 us | 65 us | 65 us | | 4,128,32 | 39 us | 33 us | 33 us | | 2,128,32 | 20 us | 18 us | 18 us | | 1,128,32 | 11 us | 9 us | 9 us | Pull Request resolved: #59149 Reviewed By: mruberry Differential Revision: D28817466 Pulled By: ngimel fbshipit-source-id: f08f6c003c339368fd53dabd28e9ada9e59de732
Summary: Closes pytorchgh-24745 The existing PR (pytorchgh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works. This PR also significantly improves performance by adding multithreading support to the algorithm. As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location. | Shape | Before | After (1 thread) | After (8 threads) | |:----------:|--------:|-----------------:|------------------:| | 256,128,32 | 2610 us | 2220 us | 496 us | | 128,128,32 | 1250 us | 976 us | 175 us | | 64,128,32 | 581 us | 486 us | 88 us | | 32,128,32 | 292 us | 245 us | 80 us | | 16,128,32 | 147 us | 120 us | 71 us | | 8,128,32 | 75 us | 61 us | 61 us | | 4,128,32 | 39 us | 32 us | 32 us | | 2,128,32 | 20 us | 17 us | 17 us | | 1,128,32 | 11 us | 9 us | 9 us | Pull Request resolved: pytorch#58811 Reviewed By: anjali411 Differential Revision: D28700259 Pulled By: ngimel fbshipit-source-id: 9b279ca7c36d8e348b7e5e4be0dd159e05aee159
Summary: Resubmit of pytorch#58811, Closes pytorchgh-24745 The existing PR (pytorchgh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works. This PR also significantly improves performance by adding multithreading support to the algorithm. As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location. | Shape | Before | After (1 thread) | After (8 threads) | |:----------:|--------:|-----------------:|------------------:| | 256,128,32 | 2610 us | 2150 us | 551 us | | 128,128,32 | 1250 us | 1020 us | 197 us | | 64,128,32 | 581 us | 495 us | 99 us | | 32,128,32 | 292 us | 255 us | 83 us | | 16,128,32 | 147 us | 126 us | 75 us | | 8,128,32 | 75 us | 65 us | 65 us | | 4,128,32 | 39 us | 33 us | 33 us | | 2,128,32 | 20 us | 18 us | 18 us | | 1,128,32 | 11 us | 9 us | 9 us | Pull Request resolved: pytorch#59149 Reviewed By: mruberry Differential Revision: D28817466 Pulled By: ngimel fbshipit-source-id: f08f6c003c339368fd53dabd28e9ada9e59de732
Closes gh-24745
The existing PR (gh-50655) has been stalled because
TensorIteratordoesn't guarantee iteration order in the same way thatTH_TENSOR_APPLYdoes. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a newTensorIteratorConfigparameter,enforce_linear_iteration, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.This PR also significantly improves performance by adding multithreading support to the algorithm. As part of this, I wrote a custom
count_nonzerothat gives per-thread counts which is necessary to write the outputs in the right location.