Migrate nonzero from TH to ATen (CPU) #59149

peterbell10 · 2021-05-28T13:34:58Z

Resubmit of #58811, Closes gh-24745

The existing PR (gh-50655) has been stalled because TensorIterator doesn't guarantee iteration order in the same way that TH_TENSOR_APPLY does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new TensorIteratorConfig parameter, enforce_linear_iteration, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.

This PR also significantly improves performance by adding multithreading support to the algorithm. As part of this, I wrote a custom count_nonzero that gives per-thread counts which is necessary to write the outputs in the right location.

Shape	Before	After (1 thread)	After (8 threads)
256,128,32	2610 us	2150 us	551 us
128,128,32	1250 us	1020 us	197 us
64,128,32	581 us	495 us	99 us
32,128,32	292 us	255 us	83 us
16,128,32	147 us	126 us	75 us
8,128,32	75 us	65 us	65 us
4,128,32	39 us	33 us	33 us
2,128,32	20 us	18 us	18 us
1,128,32	11 us	9 us	9 us

facebook-github-bot · 2021-05-28T13:35:05Z

💊 CI failures summary and remediations

As of commit beac439 (more details on the Dr. CI page):

2/2 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_clang5_asan_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 01 20:49:54 SUMMARY: UndefinedBehaviorSanit.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in

Jun 01 20:49:54     #7 0x562ae2c7f51b in PyEval_EvalCode /tmp/build/80754af9/python_1614113050744/work/Python/ceval.c:731
Jun 01 20:49:54     #8 0x562ae2cff5e3 in run_mod /tmp/build/80754af9/python_1614113050744/work/Python/pythonrun.c:1025
Jun 01 20:49:54     #9 0x562ae2cff67c in PyRun_StringFlags /tmp/build/80754af9/python_1614113050744/work/Python/pythonrun.c:949
Jun 01 20:49:54     #10 0x562ae2cff6de in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1614113050744/work/Python/pythonrun.c:445
Jun 01 20:49:54     #11 0x562ae2d034e2 in run_command /tmp/build/80754af9/python_1614113050744/work/Modules/main.c:301
Jun 01 20:49:54     #12 0x562ae2d034e2 in Py_Main /tmp/build/80754af9/python_1614113050744/work/Modules/main.c:749
Jun 01 20:49:54     #13 0x562ae2bcdb0d in main /tmp/build/80754af9/python_1614113050744/work/Programs/python.c:69
Jun 01 20:49:54     #14 0x7f89f268683f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291
Jun 01 20:49:54     #15 0x562ae2cacd6f in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103
Jun 01 20:49:54 
Jun 01 20:49:54 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in 
Jun 01 20:49:54 + retcode=1
Jun 01 20:49:54 + set -e
Jun 01 20:49:54 + return 1
Jun 01 20:49:54 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX-* ]]
Jun 01 20:49:54 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX2-* ]]
Jun 01 20:49:54 + IN_PULL_REQUEST=https://github.com/pytorch/pytorch/pull/59149
Jun 01 20:49:54 + '[' -n https://github.com/pytorch/pytorch/pull/59149 ']'
Jun 01 20:49:54 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 != *coverage* ]]
Jun 01 20:49:54 ++ mktemp
Jun 01 20:49:54 + DETERMINE_FROM=/tmp/tmp.hPJGM1OgZ5

1 failure not recognized by patterns:

Job	Step	Action
^{Windows CI (pytorch-win-vs2019-cpu-py3) / render_test_results}	^{Download PyTorch Test Reports}	🔁 rerun

1 job timed out:

pytorch_linux_xenial_py3_clang5_asan_test2

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

peterbell10 · 2021-05-29T00:40:41Z

@JackCaoG any idea why these changes might be triggering XLA to segfault?

ngimel · 2021-05-29T05:13:40Z

If xla's implementation internally used count_nonzero, then changes to its dispatch could affect it, but I'm just guessing.

ngimel · 2021-05-30T05:00:25Z

Funny thing is, in pytorch's indexing implementation, when nonzero output is contiguous, then discontiguous slices would be used as indices

pytorch/aten/src/ATen/native/IndexingUtils.h

Line 43 in 5c18994

result.emplace_back(nonzero.select(1, j));

, and this PR fixes it and actually makes those slices contiguous. But I don't know what XLA implementation is doing.

peterbell10 · 2021-06-01T12:04:46Z

@ngimel would it be reasonable to leave the output as contiguous for now and try to figure out the XLA issues in another PR.

ngimel · 2021-06-01T16:29:49Z

@peterbell10 yes, let's do that.
@ailzhang, @JackCaoG we need nonzero to return a discontiguous tensor so that behavior is consistent with cuda, and indexing and sparse operations that depend on nonzero can get contiguous indices after nonzero return is chunked on the last dimension. XLA doesn't have strides, so it segfaults on attempts to allocate a discontiguous output tensor. What can we do to achieve the behavior we want on the cpu, while still keeping XLA happy?

facebook-github-bot · 2021-06-02T00:51:20Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-06-02T19:30:34Z

@ngimel merged this pull request in 99f2000.

ailzhang · 2021-06-02T19:35:29Z

@peterbell10 Thanks a lot for separating the change! If you send a PR to return non-contiguous output and cc me on it, I can take a look and see if there's anything we can do to unblock. Thanks!!

Summary: Resubmit of pytorch#58811, Closes pytorchgh-24745 The existing PR (pytorchgh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works. This PR also significantly improves performance by adding multithreading support to the algorithm. As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location. | Shape | Before | After (1 thread) | After (8 threads) | |:----------:|--------:|-----------------:|------------------:| | 256,128,32 | 2610 us | 2150 us | 551 us | | 128,128,32 | 1250 us | 1020 us | 197 us | | 64,128,32 | 581 us | 495 us | 99 us | | 32,128,32 | 292 us | 255 us | 83 us | | 16,128,32 | 147 us | 126 us | 75 us | | 8,128,32 | 75 us | 65 us | 65 us | | 4,128,32 | 39 us | 33 us | 33 us | | 2,128,32 | 20 us | 18 us | 18 us | | 1,128,32 | 11 us | 9 us | 9 us | Pull Request resolved: pytorch#59149 Reviewed By: mruberry Differential Revision: D28817466 Pulled By: ngimel fbshipit-source-id: f08f6c003c339368fd53dabd28e9ada9e59de732

leslie-fang-intel · 2021-06-15T14:56:34Z

Hi @peterbell10. Thanks for the PR. I have one question, what will happen if I invoke the new nonzero in C++ with openmp.
For example:

# pragma omp parallel for simd schedule(static) if (omp_get_max_threads() > 1 && !omp_in_parallel())
for(int i=0;i<bs;i++){
   at::Tensor mask_index = at::nonzero(score > 0.05).squeeze(1);
}

ngimel · 2021-06-15T15:04:20Z

You cannot do this, omp parallel is the same in spirit as parallel_for #56794

leslie-fang-intel · 2021-06-15T15:18:19Z

Thanks for the comment.
Here is the case, I have a customized operations which will use at::nonzero

void my_customized_op(){
  at::nonzero();
}

at::parallel_for(0, numel, internal::GRAIN_SIZE, [&] (int64_t begin, int64_t end) {
  my_customized_op();
}

Openmp can't be nested. So It's expected that at::nonzero run with single thread in this case.
But I suspect in current implementation, at::nonzero will still create DimVector with unexpected num_threads sizes since it's not judge whether at::nonzero is in a nested openmp ENV or not.

pytorch/aten/src/ATen/native/TensorAdvancedIndexing.cpp

Lines 1549 to 1550 in 1c502d1

    
           const auto num_threads = at::get_num_threads(); 
        
           DimVector thread_begin(num_threads, -1);

And I am not sure: in this case what's the behavior of at::nonzero? I suspect I have got a different result but I may need more debug effort.

ngimel · 2021-06-15T15:22:55Z

According to the linked issue you cannot use tensor operations inside parallel_for, you example is not valid.

leslie-fang-intel · 2021-06-15T15:28:58Z

Sorry for the confusion. Actually I am using OPENMP for my_customized_op.

void my_customized_op(){
  at::nonzero();
}

# pragma omp parallel for simd schedule(static) if (omp_get_max_threads() > 1 && !omp_in_parallel())
for(int i=0;i<bs;i++){
   my_customized_op();
}

It works previously. And I suspect the results got changed once we use the version of parallelized at::nonzero.

@ngimel Thanks for the explanation. Since omp parallel is the same in spirit as parallel_for, we can't do any tensor operation inside omp parallel also.

peterbell10 · 2021-06-15T16:30:22Z

Is your question theoretical, or are you actually getting errors and/or wrong results?

Inside an existing omp parallel region, parallel_for should run single threaded. The single threaded case is known to work, since that's used for all inputs with fewer than ~32000 elements already. So, in theory I don't think there should be anything wrong here.

As for why, the DimVector is zero-initialized so it just sums the list [num_nonzero, 0, 0, ...] at the end, and there should be no problem at all.

leslie-fang-intel · 2021-06-17T14:27:31Z

Hi @peterbell10 Thanks for the comment. I do get the errors for some cases. And I have uploaded the UT to reproduce the errors here. You can find a different output of nonzero in single thread's calculation and two threads' calculation.

With some debug, here are some clues: if I run the UT with 2 threads. In the second thread whose tid is 1, the output of at::get_num_threads() would be 1 instead of 2 since the second loop.

pytorch/aten/src/ATen/native/TensorAdvancedIndexing.cpp

Line 1549 in 59b1003

const auto num_threads = at::get_num_threads();

Then the output write would be out of index, since both tid and the size of thread_begin is 1 now.

pytorch/aten/src/ATen/native/TensorAdvancedIndexing.cpp

Line 1558 in 59b1003

thread_begin[tid] = begin;

leslie-fang-intel · 2021-06-17T14:48:06Z

Here are some thoughts for what I see in the UT with 2 threads: since we are using lazy_init_num_threads.

pytorch/aten/src/ATen/ParallelOpenMP.h

Line 26 in 59b1003

at::internal::lazy_init_num_threads();

It will init the threads_number at the first time hit parallel_for inside a specific thread. So here are the cases:

For the main thread, tid=0:

Start running the parallel_for inside the UT => hit lazy_init_num_threads inside parallel_for and init the threads number as 2 => running non_zero op => create the thread_begin with size of 2 =>hit parallel_for inside non_zero again inside the non_zero op, but it will ignore lazy_init_num_threads since init is a thread_local var

pytorch/aten/src/ATen/Parallel.h

Line 32 in 59b1003

thread_local bool init = false;

So for the main thread, any time you invoke at::get_num_threads(), it will returns 2.
For the second thread, tid=1:
it will not do what's the main thread doing => Directly enter into non_zero op => create the thread_begin with size of 2 =>hit parallel_for inside non_zero again, init lazy_init_num_threads but here the num_threads is 1 now since we are inside a nested openmp region (Here we have MKL installed)

pytorch/aten/src/ATen/ParallelOpenMP.cpp

Line 37 in 59b1003

omp_set_num_threads(mkl_get_max_threads());

=> start running the non_zero in the second loop => create the thread_begin with size of 1, but the tid of current thread is still 1.

With some tests, if I move lazy_init_num_threads after

pytorch/aten/src/ATen/ParallelOpenMP.h

Lines 31 to 34 in 59b1003

    
           if (!use_parallel) { 
        
             f(begin, end); 
        
             return; 
        
           }

, it can workaround the issue we see.

CC @jgong5 for the discussion here.

Fixes #59149 (comment) `parallel_for` will call the function directly if it would have run on only a single thread anyway. This is great for performance, but causes an issue in nested parallel regions because `get_thread_num` will reflect the parent parallel region instead of the current `parallel_for` call. I fix this by using a `thread_local` variable for the current thread id and manually setting it before each call to the user-provided function. [ghstack-poisoned]

Fixes #59149 (comment) `parallel_for` will call the function directly if it would have run on only a single thread anyway. This is great for performance, but causes an issue in nested parallel regions because `get_thread_num` will reflect the parent parallel region instead of the current `parallel_for` call. I fix this by using a `thread_local` variable for the current thread id and manually setting it before each call to the user-provided function. Differential Revision: [D29287816](https://our.internmc.facebook.com/intern/diff/D29287816) [ghstack-poisoned]

Summary: Pull Request resolved: #60183 Fixes #59149 (comment) `parallel_for` will call the function directly if it would have run on only a single thread anyway. This is great for performance, but causes an issue in nested parallel regions because `get_thread_num` will reflect the parent parallel region instead of the current `parallel_for` call. I fix this by using a `thread_local` variable for the current thread id and manually setting it before each call to the user-provided function. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29287816 Pulled By: ngimel fbshipit-source-id: 777f771a0900750c7f22eb1dd185d84d19282108

peterbell10 added the open source label May 28, 2021

peterbell10 requested a review from ngimel May 28, 2021 13:34

peterbell10 requested a review from ezyang as a code owner May 28, 2021 13:34

facebook-github-bot added the cla signed label May 28, 2021

peterbell10 added 5 commits May 30, 2021 02:28

Migrate nonzero from TH to ATen (CPU)

d4652d6

Avoid entering OpenMP parallel region if only 1 thread would be used

8642f10

Default to transposed output strides

f768798

Fix spaddmm test

ea60864

WIP: See if xla relies on contiguous nonzero

3997d29

peterbell10 force-pushed the nonzero-aten2 branch from 6278223 to 3997d29 Compare May 30, 2021 01:29

peterbell10 added 2 commits May 30, 2021 15:26

WIP: Is it non-contiguous output, or the as_strided_ call itself?

cececb0

WIP: Resize with strides instead of using as_strided_

3624314

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 1, 2021

ezyang removed their request for review June 1, 2021 17:37

Return contiguous output

beac439

ngimel approved these changes Jun 2, 2021

View reviewed changes

facebook-github-bot closed this in 99f2000 Jun 2, 2021

facebook-github-bot added the Merged label Jun 2, 2021

peterbell10 mentioned this pull request Jun 17, 2021

Ensure thread id is valid in nested parallel regions #60183

Closed

Migrate nonzero from TH to ATen (CPU) #59149

Migrate nonzero from TH to ATen (CPU) #59149

Uh oh!

Conversation

peterbell10 commented May 28, 2021

Uh oh!

facebook-github-bot commented May 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_py3_clang5_asan_test2 (1/1)

1 failure not recognized by patterns:

Uh oh!

peterbell10 commented May 29, 2021

Uh oh!

ngimel commented May 29, 2021

Uh oh!

ngimel commented May 30, 2021

Uh oh!

peterbell10 commented Jun 1, 2021

Uh oh!

ngimel commented Jun 1, 2021

Uh oh!

facebook-github-bot commented Jun 2, 2021

Uh oh!

facebook-github-bot commented Jun 2, 2021

Uh oh!

ailzhang commented Jun 2, 2021

Uh oh!

leslie-fang-intel commented Jun 15, 2021

Uh oh!

ngimel commented Jun 15, 2021

Uh oh!

leslie-fang-intel commented Jun 15, 2021

Uh oh!

ngimel commented Jun 15, 2021

Uh oh!

leslie-fang-intel commented Jun 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterbell10 commented Jun 15, 2021

Uh oh!

leslie-fang-intel commented Jun 17, 2021

Uh oh!

leslie-fang-intel commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

facebook-github-bot commented May 28, 2021 •

edited

Loading

leslie-fang-intel commented Jun 15, 2021 •

edited

Loading

leslie-fang-intel commented Jun 17, 2021 •

edited

Loading