KEMBAR78
Migrate nonzero from TH to ATen (CPU) by peterbell10 · Pull Request #59149 · pytorch/pytorch · GitHub
Skip to content

Conversation

@peterbell10
Copy link
Collaborator

Resubmit of #58811, Closes gh-24745

The existing PR (gh-50655) has been stalled because TensorIterator doesn't guarantee iteration order in the same way that TH_TENSOR_APPLY does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new TensorIteratorConfig parameter, enforce_linear_iteration, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.

This PR also significantly improves performance by adding multithreading support to the algorithm. As part of this, I wrote a custom count_nonzero that gives per-thread counts which is necessary to write the outputs in the right location.

Shape Before After (1 thread) After (8 threads)
256,128,32 2610 us 2150 us 551 us
128,128,32 1250 us 1020 us 197 us
64,128,32 581 us 495 us 99 us
32,128,32 292 us 255 us 83 us
16,128,32 147 us 126 us 75 us
8,128,32 75 us 65 us 65 us
4,128,32 39 us 33 us 33 us
2,128,32 20 us 18 us 18 us
1,128,32 11 us 9 us 9 us

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 28, 2021

💊 CI failures summary and remediations

As of commit beac439 (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 01 20:49:54 SUMMARY: UndefinedBehaviorSanit.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in
Jun 01 20:49:54     #7 0x562ae2c7f51b in PyEval_EvalCode /tmp/build/80754af9/python_1614113050744/work/Python/ceval.c:731
Jun 01 20:49:54     #8 0x562ae2cff5e3 in run_mod /tmp/build/80754af9/python_1614113050744/work/Python/pythonrun.c:1025
Jun 01 20:49:54     #9 0x562ae2cff67c in PyRun_StringFlags /tmp/build/80754af9/python_1614113050744/work/Python/pythonrun.c:949
Jun 01 20:49:54     #10 0x562ae2cff6de in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1614113050744/work/Python/pythonrun.c:445
Jun 01 20:49:54     #11 0x562ae2d034e2 in run_command /tmp/build/80754af9/python_1614113050744/work/Modules/main.c:301
Jun 01 20:49:54     #12 0x562ae2d034e2 in Py_Main /tmp/build/80754af9/python_1614113050744/work/Modules/main.c:749
Jun 01 20:49:54     #13 0x562ae2bcdb0d in main /tmp/build/80754af9/python_1614113050744/work/Programs/python.c:69
Jun 01 20:49:54     #14 0x7f89f268683f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291
Jun 01 20:49:54     #15 0x562ae2cacd6f in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103
Jun 01 20:49:54 
Jun 01 20:49:54 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in 
Jun 01 20:49:54 + retcode=1
Jun 01 20:49:54 + set -e
Jun 01 20:49:54 + return 1
Jun 01 20:49:54 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX-* ]]
Jun 01 20:49:54 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX2-* ]]
Jun 01 20:49:54 + IN_PULL_REQUEST=https://github.com/pytorch/pytorch/pull/59149
Jun 01 20:49:54 + '[' -n https://github.com/pytorch/pytorch/pull/59149 ']'
Jun 01 20:49:54 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 != *coverage* ]]
Jun 01 20:49:54 ++ mktemp
Jun 01 20:49:54 + DETERMINE_FROM=/tmp/tmp.hPJGM1OgZ5

1 failure not recognized by patterns:

Job Step Action
GitHub Actions Windows CI (pytorch-win-vs2019-cpu-py3) / render_test_results Download PyTorch Test Reports 🔁 rerun

1 job timed out:

  • pytorch_linux_xenial_py3_clang5_asan_test2

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@peterbell10
Copy link
Collaborator Author

@JackCaoG any idea why these changes might be triggering XLA to segfault?

@ngimel
Copy link
Collaborator

ngimel commented May 29, 2021

If xla's implementation internally used count_nonzero, then changes to its dispatch could affect it, but I'm just guessing.

@ngimel
Copy link
Collaborator

ngimel commented May 30, 2021

Funny thing is, in pytorch's indexing implementation, when nonzero output is contiguous, then discontiguous slices would be used as indices

result.emplace_back(nonzero.select(1, j));
, and this PR fixes it and actually makes those slices contiguous. But I don't know what XLA implementation is doing.

@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 1, 2021
@peterbell10
Copy link
Collaborator Author

@ngimel would it be reasonable to leave the output as contiguous for now and try to figure out the XLA issues in another PR.

@ngimel
Copy link
Collaborator

ngimel commented Jun 1, 2021

@peterbell10 yes, let's do that.
@ailzhang, @JackCaoG we need nonzero to return a discontiguous tensor so that behavior is consistent with cuda, and indexing and sparse operations that depend on nonzero can get contiguous indices after nonzero return is chunked on the last dimension. XLA doesn't have strides, so it segfaults on attempts to allocate a discontiguous output tensor. What can we do to achieve the behavior we want on the cpu, while still keeping XLA happy?

@ezyang ezyang removed their request for review June 1, 2021 17:37
@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 99f2000.

@ailzhang
Copy link
Contributor

ailzhang commented Jun 2, 2021

@peterbell10 Thanks a lot for separating the change! If you send a PR to return non-contiguous output and cc me on it, I can take a look and see if there's anything we can do to unblock. Thanks!!

deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
Resubmit of pytorch#58811, Closes pytorchgh-24745

The existing PR (pytorchgh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.

This PR also significantly improves performance by adding multithreading support to the algorithm.  As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location.

|    Shape   |  Before | After (1 thread) | After (8 threads) |
|:----------:|--------:|-----------------:|------------------:|
| 256,128,32 | 2610 us |          2150 us |            551 us |
| 128,128,32 | 1250 us |          1020 us |            197 us |
|  64,128,32 |  581 us |           495 us |             99 us |
|  32,128,32 |  292 us |           255 us |             83 us |
|  16,128,32 |  147 us |           126 us |             75 us |
|  8,128,32  |   75 us |            65 us |             65 us |
|  4,128,32  |   39 us |            33 us |             33 us |
|  2,128,32  |   20 us |            18 us |             18 us |
|  1,128,32  |   11 us |             9 us |              9 us |

Pull Request resolved: pytorch#59149

Reviewed By: mruberry

Differential Revision: D28817466

Pulled By: ngimel

fbshipit-source-id: f08f6c003c339368fd53dabd28e9ada9e59de732
@leslie-fang-intel
Copy link
Collaborator

Hi @peterbell10. Thanks for the PR. I have one question, what will happen if I invoke the new nonzero in C++ with openmp.
For example:

# pragma omp parallel for simd schedule(static) if (omp_get_max_threads() > 1 && !omp_in_parallel())
for(int i=0;i<bs;i++){
   at::Tensor mask_index = at::nonzero(score > 0.05).squeeze(1);
}

@ngimel
Copy link
Collaborator

ngimel commented Jun 15, 2021

You cannot do this, omp parallel is the same in spirit as parallel_for #56794

@leslie-fang-intel
Copy link
Collaborator

Thanks for the comment.
Here is the case, I have a customized operations which will use at::nonzero

void my_customized_op(){
  at::nonzero();
}

at::parallel_for(0, numel, internal::GRAIN_SIZE, [&] (int64_t begin, int64_t end) {
  my_customized_op();
}

Openmp can't be nested. So It's expected that at::nonzero run with single thread in this case.
But I suspect in current implementation, at::nonzero will still create DimVector with unexpected num_threads sizes since it's not judge whether at::nonzero is in a nested openmp ENV or not.

const auto num_threads = at::get_num_threads();
DimVector thread_begin(num_threads, -1);

And I am not sure: in this case what's the behavior of at::nonzero? I suspect I have got a different result but I may need more debug effort.

@ngimel
Copy link
Collaborator

ngimel commented Jun 15, 2021

According to the linked issue you cannot use tensor operations inside parallel_for, you example is not valid.

@leslie-fang-intel
Copy link
Collaborator

leslie-fang-intel commented Jun 15, 2021

Sorry for the confusion. Actually I am using OPENMP for my_customized_op.

void my_customized_op(){
  at::nonzero();
}

# pragma omp parallel for simd schedule(static) if (omp_get_max_threads() > 1 && !omp_in_parallel())
for(int i=0;i<bs;i++){
   my_customized_op();
}

It works previously. And I suspect the results got changed once we use the version of parallelized at::nonzero.


@ngimel Thanks for the explanation. Since omp parallel is the same in spirit as parallel_for, we can't do any tensor operation inside omp parallel also.

@peterbell10
Copy link
Collaborator Author

Is your question theoretical, or are you actually getting errors and/or wrong results?

Inside an existing omp parallel region, parallel_for should run single threaded. The single threaded case is known to work, since that's used for all inputs with fewer than ~32000 elements already. So, in theory I don't think there should be anything wrong here.

As for why, the DimVector is zero-initialized so it just sums the list [num_nonzero, 0, 0, ...] at the end, and there should be no problem at all.

@leslie-fang-intel
Copy link
Collaborator

Hi @peterbell10 Thanks for the comment. I do get the errors for some cases. And I have uploaded the UT to reproduce the errors here. You can find a different output of nonzero in single thread's calculation and two threads' calculation.

With some debug, here are some clues: if I run the UT with 2 threads. In the second thread whose tid is 1, the output of at::get_num_threads() would be 1 instead of 2 since the second loop.

const auto num_threads = at::get_num_threads();

Then the output write would be out of index, since both tid and the size of thread_begin is 1 now.
thread_begin[tid] = begin;

@leslie-fang-intel
Copy link
Collaborator

leslie-fang-intel commented Jun 17, 2021

Here are some thoughts for what I see in the UT with 2 threads: since we are using lazy_init_num_threads.

at::internal::lazy_init_num_threads();

It will init the threads_number at the first time hit parallel_for inside a specific thread. So here are the cases:

  • For the main thread, tid=0:

    Start running the parallel_for inside the UT => hit lazy_init_num_threads inside parallel_for and init the threads number as 2 => running non_zero op => create the thread_begin with size of 2 =>hit parallel_for inside non_zero again inside the non_zero op, but it will ignore lazy_init_num_threads since init is a thread_local var

    thread_local bool init = false;

    So for the main thread, any time you invoke at::get_num_threads(), it will returns 2.

  • For the second thread, tid=1:
    it will not do what's the main thread doing => Directly enter into non_zero op => create the thread_begin with size of 2 =>hit parallel_for inside non_zero again, init lazy_init_num_threads but here the num_threads is 1 now since we are inside a nested openmp region (Here we have MKL installed)

    omp_set_num_threads(mkl_get_max_threads());

    => start running the non_zero in the second loop => create the thread_begin with size of 1, but the tid of current thread is still 1.

With some tests, if I move lazy_init_num_threads after

if (!use_parallel) {
f(begin, end);
return;
}
, it can workaround the issue we see.

CC @jgong5 for the discussion here.

peterbell10 added a commit that referenced this pull request Jun 17, 2021
Fixes #59149 (comment)

`parallel_for` will call the function directly if it would have run on only a
single thread anyway. This is great for performance, but causes an issue in
nested parallel regions because `get_thread_num` will reflect the parent
parallel region instead of the current `parallel_for` call.

I fix this by using a `thread_local` variable for the current thread id and
manually setting it before each call to the user-provided function.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 17, 2021
Fixes #59149 (comment)

`parallel_for` will call the function directly if it would have run on only a
single thread anyway. This is great for performance, but causes an issue in
nested parallel regions because `get_thread_num` will reflect the parent
parallel region instead of the current `parallel_for` call.

I fix this by using a `thread_local` variable for the current thread id and
manually setting it before each call to the user-provided function.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 17, 2021
Fixes #59149 (comment)

`parallel_for` will call the function directly if it would have run on only a
single thread anyway. This is great for performance, but causes an issue in
nested parallel regions because `get_thread_num` will reflect the parent
parallel region instead of the current `parallel_for` call.

I fix this by using a `thread_local` variable for the current thread id and
manually setting it before each call to the user-provided function.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 17, 2021
Fixes #59149 (comment)

`parallel_for` will call the function directly if it would have run on only a
single thread anyway. This is great for performance, but causes an issue in
nested parallel regions because `get_thread_num` will reflect the parent
parallel region instead of the current `parallel_for` call.

I fix this by using a `thread_local` variable for the current thread id and
manually setting it before each call to the user-provided function.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 17, 2021
Fixes #59149 (comment)

`parallel_for` will call the function directly if it would have run on only a
single thread anyway. This is great for performance, but causes an issue in
nested parallel regions because `get_thread_num` will reflect the parent
parallel region instead of the current `parallel_for` call.

I fix this by using a `thread_local` variable for the current thread id and
manually setting it before each call to the user-provided function.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 17, 2021
Fixes #59149 (comment)

`parallel_for` will call the function directly if it would have run on only a
single thread anyway. This is great for performance, but causes an issue in
nested parallel regions because `get_thread_num` will reflect the parent
parallel region instead of the current `parallel_for` call.

I fix this by using a `thread_local` variable for the current thread id and
manually setting it before each call to the user-provided function.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 17, 2021
Fixes #59149 (comment)

`parallel_for` will call the function directly if it would have run on only a
single thread anyway. This is great for performance, but causes an issue in
nested parallel regions because `get_thread_num` will reflect the parent
parallel region instead of the current `parallel_for` call.

I fix this by using a `thread_local` variable for the current thread id and
manually setting it before each call to the user-provided function.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 22, 2021
Fixes #59149 (comment)

`parallel_for` will call the function directly if it would have run on only a
single thread anyway. This is great for performance, but causes an issue in
nested parallel regions because `get_thread_num` will reflect the parent
parallel region instead of the current `parallel_for` call.

I fix this by using a `thread_local` variable for the current thread id and
manually setting it before each call to the user-provided function.

Differential Revision: [D29287816](https://our.internmc.facebook.com/intern/diff/D29287816)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Jun 23, 2021
Summary:
Pull Request resolved: #60183

Fixes #59149 (comment)

`parallel_for` will call the function directly if it would have run on only a
single thread anyway. This is great for performance, but causes an issue in
nested parallel regions because `get_thread_num` will reflect the parent
parallel region instead of the current `parallel_for` call.

I fix this by using a `thread_local` variable for the current thread id and
manually setting it before each call to the user-provided function.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29287816

Pulled By: ngimel

fbshipit-source-id: 777f771a0900750c7f22eb1dd185d84d19282108
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate nonzero from the TH to Aten (CPU)

5 participants