KEMBAR78
Migrate nonzero from TH to ATen (CPU) by peterbell10 · Pull Request #58811 · pytorch/pytorch · GitHub
Skip to content

Conversation

@peterbell10
Copy link
Collaborator

Closes gh-24745

The existing PR (gh-50655) has been stalled because TensorIterator doesn't guarantee iteration order in the same way that TH_TENSOR_APPLY does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new TensorIteratorConfig parameter, enforce_linear_iteration, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.

This PR also significantly improves performance by adding multithreading support to the algorithm. As part of this, I wrote a custom count_nonzero that gives per-thread counts which is necessary to write the outputs in the right location.

Shape Before After (1 thread) After (8 threads)
256,128,32 2610 us 2220 us 496 us
128,128,32 1250 us 976 us 175 us
64,128,32 581 us 486 us 88 us
32,128,32 292 us 245 us 80 us
16,128,32 147 us 120 us 71 us
8,128,32 75 us 61 us 61 us
4,128,32 39 us 32 us 32 us
2,128,32 20 us 17 us 17 us
1,128,32 11 us 9 us 9 us

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 23, 2021

💊 CI failures summary and remediations

As of commit 815dad5 (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/2)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .github/scripts/generate_ci_workflows.py
Auto-merging .github/scripts/generate_ci_workflows.py
CONFLICT (add/add): Merge conflict in .github/scale-config.yml
Auto-merging .github/scale-config.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/binary_linux_test.sh
Auto-merging .circleci/scripts/binary_linux_test.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (2/2)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .github/scripts/generate_ci_workflows.py
Auto-merging .github/scripts/generate_ci_workflows.py
CONFLICT (add/add): Merge conflict in .github/scale-config.yml
Auto-merging .github/scale-config.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/binary_linux_test.sh
Auto-merging .circleci/scripts/binary_linux_test.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@peterbell10 peterbell10 force-pushed the nonzero-aten branch 4 times, most recently from 170601d to 87cf46c Compare May 23, 2021 17:44
@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 24, 2021

AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
kHalf, kBFloat16, kBool, self.scalar_type(), "nonzero_count_cpu", [&] {
at::parallel_for(0, iter.numel(), internal::GRAIN_SIZE, [&] (int64_t begin, int64_t end) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at::parallel_for adds some non-trivial overhead, typically skipping it if size is < GRAIN_SIZE is beneficial

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed at::parallel_for so it calls the lambda directly without entering the parallel region in that case. I hope that's acceptable, as it's much cleaner on the caller's end.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's good. Interesting, parallelNative and ParallelTBB already had this shortcut.


// Convert thread-local counts to cumulative sum
for (size_t i = 1; i < thread_count_nonzero.size(); ++i) {
thread_count_nonzero[i] += thread_count_nonzero[i - 1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is DimVector initialized to 0? Otherwise thread_count_nonzero[0] is uninitialized?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, DimVector is like std::vector in that it will always zero-initialize the array.

const auto self_sizes = self.sizes();
const auto total_nonzero = thread_count_nonzero.back();
const int64_t ndim = self_sizes.size();
resize_output(result, {total_nonzero, ndim});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently the contiguity of the tensor returned by nonzero from cuda and cpu is different, see #46224. It would be good to fix it here, but that would require result tensor to be ({ndim, total_nonzero}).t(). Proper handling of result if it was passed by the user is also a bit annoying.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was easier than I expected. It turns out resize_output returns true if the tensor was resized, so I just re-stride the output in that case.

TORCH_INTERNAL_ASSERT(begin == thread_begin[tid]);

// +1 faster than additional condition check inside loop
c10::SmallVector<int64_t, 33> sizes(ndim + 1, -1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 33?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means no allocation is done unless ndim > 32. With DimVector's default static size that would be ndim > 4 which is more likely to happen.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While nominally we don't have a limit on the number of dimensions, de-facto on GPU it's 25 (MAX_DIMS in OffsetCalculator.cuh), so supporting more than that doesn't make sense

const auto in_stride = strides[0];
const auto out_stride1 = out_accessor.stride(1);
const auto out_stride0 = out_accessor.stride(0) - ndim * out_stride1;
const auto ndim = out_accessor.size(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this ndim? It's the same as the captured one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment. The local variables allow the compiler to keep them in registers, otherwise it's alias analysis isn't good enough to realize the int64_t pointer we're writing to won't alias it.

int64_t* out = out_ptr;

for (int64_t i = 0; i < n2; ++i) {
const char* ptr = data[0] + i * strides[1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for 1d tensor strides[1] will never be used, as i will only be 0. We should think about how we can make it possible to not pack strides to 2*ntensors(), but I agree that for now it's unsafe (re: other TI PRs)

@ezyang ezyang removed their request for review May 25, 2021 03:09
Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please run the benchmarks with the latest version?

TORCH_INTERNAL_ASSERT(begin == thread_begin[tid]);

// +1 faster than additional condition check inside loop
c10::SmallVector<int64_t, 33> sizes(ndim + 1, -1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While nominally we don't have a limit on the number of dimensions, de-facto on GPU it's 25 (MAX_DIMS in OffsetCalculator.cuh), so supporting more than that doesn't make sense

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@peterbell10
Copy link
Collaborator Author

Here are the updated benchmark numbers:

Shape Before After (1 thread) After (8 threads)
256,128,32 2610 us 2150 us 551 us
128,128,32 1250 us 1020 us 197 us
64,128,32 581 us 495 us 99 us
32,128,32 292 us 255 us 83 us
16,128,32 147 us 126 us 75 us
8,128,32 75 us 65 us 65 us
4,128,32 39 us 33 us 33 us
2,128,32 20 us 18 us 18 us
1,128,32 11 us 9 us 9 us

@ngimel
Copy link
Collaborator

ngimel commented May 26, 2021

@peterbell10 android tests are failing with

Test Case ModelRunner
* Running ModelRunner.basic
***** Failure in unknown file
C++ exception with description "dtype 'Bool' not selected for kernel tag "nonzero_count_cpu"
  

@peterbell10
Copy link
Collaborator Author

Is there a list of "selected" kernels somewhere? I'm adding new kernels, so I assume those need to be added to whatever list.

@ngimel
Copy link
Collaborator

ngimel commented May 26, 2021

I've ran the suggested tracer updates internally, let's see if it fixes it.

@ngimel
Copy link
Collaborator

ngimel commented May 27, 2021

Sorry, that suggestion still didn't work. I'm working with mobile team to try and get it resolved.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 95b1bc1.

@ngimel
Copy link
Collaborator

ngimel commented May 28, 2021

Sorry @peterbell10, xla error is real (probably related to new dispatch for count_nonzeros), reverting

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 657b75d.

facebook-github-bot pushed a commit that referenced this pull request Jun 2, 2021
Summary:
Resubmit of #58811, Closes gh-24745

The existing PR (gh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.

This PR also significantly improves performance by adding multithreading support to the algorithm.  As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location.

|    Shape   |  Before | After (1 thread) | After (8 threads) |
|:----------:|--------:|-----------------:|------------------:|
| 256,128,32 | 2610 us |          2150 us |            551 us |
| 128,128,32 | 1250 us |          1020 us |            197 us |
|  64,128,32 |  581 us |           495 us |             99 us |
|  32,128,32 |  292 us |           255 us |             83 us |
|  16,128,32 |  147 us |           126 us |             75 us |
|  8,128,32  |   75 us |            65 us |             65 us |
|  4,128,32  |   39 us |            33 us |             33 us |
|  2,128,32  |   20 us |            18 us |             18 us |
|  1,128,32  |   11 us |             9 us |              9 us |

Pull Request resolved: #59149

Reviewed By: mruberry

Differential Revision: D28817466

Pulled By: ngimel

fbshipit-source-id: f08f6c003c339368fd53dabd28e9ada9e59de732
deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
Closes pytorchgh-24745

The existing PR (pytorchgh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.

This PR also significantly improves performance by adding multithreading support to the algorithm.  As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location.

|    Shape   |  Before | After (1 thread) | After (8 threads) |
|:----------:|--------:|-----------------:|------------------:|
| 256,128,32 | 2610 us |          2220 us |            496 us |
| 128,128,32 | 1250 us |           976 us |            175 us |
|  64,128,32 |  581 us |           486 us |             88 us |
|  32,128,32 |  292 us |           245 us |             80 us |
|  16,128,32 |  147 us |           120 us |             71 us |
|  8,128,32  |   75 us |            61 us |             61 us |
|  4,128,32  |   39 us |            32 us |             32 us |
|  2,128,32  |   20 us |            17 us |             17 us |
|  1,128,32  |   11 us |             9 us |              9 us |

Pull Request resolved: pytorch#58811

Reviewed By: anjali411

Differential Revision: D28700259

Pulled By: ngimel

fbshipit-source-id: 9b279ca7c36d8e348b7e5e4be0dd159e05aee159
deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
Resubmit of pytorch#58811, Closes pytorchgh-24745

The existing PR (pytorchgh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.

This PR also significantly improves performance by adding multithreading support to the algorithm.  As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location.

|    Shape   |  Before | After (1 thread) | After (8 threads) |
|:----------:|--------:|-----------------:|------------------:|
| 256,128,32 | 2610 us |          2150 us |            551 us |
| 128,128,32 | 1250 us |          1020 us |            197 us |
|  64,128,32 |  581 us |           495 us |             99 us |
|  32,128,32 |  292 us |           255 us |             83 us |
|  16,128,32 |  147 us |           126 us |             75 us |
|  8,128,32  |   75 us |            65 us |             65 us |
|  4,128,32  |   39 us |            33 us |             33 us |
|  2,128,32  |   20 us |            18 us |             18 us |
|  1,128,32  |   11 us |             9 us |              9 us |

Pull Request resolved: pytorch#59149

Reviewed By: mruberry

Differential Revision: D28817466

Pulled By: ngimel

fbshipit-source-id: f08f6c003c339368fd53dabd28e9ada9e59de732
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged open source Reverted triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate nonzero from the TH to Aten (CPU)

4 participants