-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Migrate nonzero from TH to ATen (CPU) #59149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit beac439 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
| Job | Step | Action |
|---|---|---|
| Download PyTorch Test Reports | 🔁 rerun |
1 job timed out:
pytorch_linux_xenial_py3_clang5_asan_test2
This comment was automatically generated by Dr. CI (expand for details).
Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group.
|
@JackCaoG any idea why these changes might be triggering XLA to segfault? |
|
If xla's implementation internally used count_nonzero, then changes to its dispatch could affect it, but I'm just guessing. |
6278223 to
3997d29
Compare
|
Funny thing is, in pytorch's indexing implementation, when
|
|
@ngimel would it be reasonable to leave the output as contiguous for now and try to figure out the XLA issues in another PR. |
|
@peterbell10 yes, let's do that. |
|
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
@peterbell10 Thanks a lot for separating the change! If you send a PR to return non-contiguous output and cc me on it, I can take a look and see if there's anything we can do to unblock. Thanks!! |
Summary: Resubmit of pytorch#58811, Closes pytorchgh-24745 The existing PR (pytorchgh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works. This PR also significantly improves performance by adding multithreading support to the algorithm. As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location. | Shape | Before | After (1 thread) | After (8 threads) | |:----------:|--------:|-----------------:|------------------:| | 256,128,32 | 2610 us | 2150 us | 551 us | | 128,128,32 | 1250 us | 1020 us | 197 us | | 64,128,32 | 581 us | 495 us | 99 us | | 32,128,32 | 292 us | 255 us | 83 us | | 16,128,32 | 147 us | 126 us | 75 us | | 8,128,32 | 75 us | 65 us | 65 us | | 4,128,32 | 39 us | 33 us | 33 us | | 2,128,32 | 20 us | 18 us | 18 us | | 1,128,32 | 11 us | 9 us | 9 us | Pull Request resolved: pytorch#59149 Reviewed By: mruberry Differential Revision: D28817466 Pulled By: ngimel fbshipit-source-id: f08f6c003c339368fd53dabd28e9ada9e59de732
|
Hi @peterbell10. Thanks for the PR. I have one question, what will happen if I invoke the new nonzero in C++ with openmp. |
|
You cannot do this, |
|
Thanks for the comment. Openmp can't be nested. So It's expected that pytorch/aten/src/ATen/native/TensorAdvancedIndexing.cpp Lines 1549 to 1550 in 1c502d1
And I am not sure: in this case what's the behavior of |
|
According to the linked issue you cannot use tensor operations inside |
|
Sorry for the confusion. Actually I am using It works previously. And I suspect the results got changed once we use the version of parallelized @ngimel Thanks for the explanation. Since |
|
Is your question theoretical, or are you actually getting errors and/or wrong results? Inside an existing As for why, the |
|
Hi @peterbell10 Thanks for the comment. I do get the errors for some cases. And I have uploaded the UT to reproduce the errors here. You can find a different output of With some debug, here are some clues: if I run the UT with 2 threads. In the second thread whose tid is 1, the output of
Then the output write would be out of index, since both tid and the size of thread_begin is 1 now.
|
|
Here are some thoughts for what I see in the UT with 2 threads: since we are using pytorch/aten/src/ATen/ParallelOpenMP.h Line 26 in 59b1003
It will init the threads_number at the first time hit parallel_for inside a specific thread. So here are the cases:
With some tests, if I move pytorch/aten/src/ATen/ParallelOpenMP.h Lines 31 to 34 in 59b1003
CC @jgong5 for the discussion here. |
Fixes #59149 (comment) `parallel_for` will call the function directly if it would have run on only a single thread anyway. This is great for performance, but causes an issue in nested parallel regions because `get_thread_num` will reflect the parent parallel region instead of the current `parallel_for` call. I fix this by using a `thread_local` variable for the current thread id and manually setting it before each call to the user-provided function. [ghstack-poisoned]
Fixes #59149 (comment) `parallel_for` will call the function directly if it would have run on only a single thread anyway. This is great for performance, but causes an issue in nested parallel regions because `get_thread_num` will reflect the parent parallel region instead of the current `parallel_for` call. I fix this by using a `thread_local` variable for the current thread id and manually setting it before each call to the user-provided function. [ghstack-poisoned]
Fixes #59149 (comment) `parallel_for` will call the function directly if it would have run on only a single thread anyway. This is great for performance, but causes an issue in nested parallel regions because `get_thread_num` will reflect the parent parallel region instead of the current `parallel_for` call. I fix this by using a `thread_local` variable for the current thread id and manually setting it before each call to the user-provided function. [ghstack-poisoned]
Fixes #59149 (comment) `parallel_for` will call the function directly if it would have run on only a single thread anyway. This is great for performance, but causes an issue in nested parallel regions because `get_thread_num` will reflect the parent parallel region instead of the current `parallel_for` call. I fix this by using a `thread_local` variable for the current thread id and manually setting it before each call to the user-provided function. [ghstack-poisoned]
Fixes #59149 (comment) `parallel_for` will call the function directly if it would have run on only a single thread anyway. This is great for performance, but causes an issue in nested parallel regions because `get_thread_num` will reflect the parent parallel region instead of the current `parallel_for` call. I fix this by using a `thread_local` variable for the current thread id and manually setting it before each call to the user-provided function. [ghstack-poisoned]
Fixes #59149 (comment) `parallel_for` will call the function directly if it would have run on only a single thread anyway. This is great for performance, but causes an issue in nested parallel regions because `get_thread_num` will reflect the parent parallel region instead of the current `parallel_for` call. I fix this by using a `thread_local` variable for the current thread id and manually setting it before each call to the user-provided function. [ghstack-poisoned]
Fixes #59149 (comment) `parallel_for` will call the function directly if it would have run on only a single thread anyway. This is great for performance, but causes an issue in nested parallel regions because `get_thread_num` will reflect the parent parallel region instead of the current `parallel_for` call. I fix this by using a `thread_local` variable for the current thread id and manually setting it before each call to the user-provided function. [ghstack-poisoned]
Fixes #59149 (comment) `parallel_for` will call the function directly if it would have run on only a single thread anyway. This is great for performance, but causes an issue in nested parallel regions because `get_thread_num` will reflect the parent parallel region instead of the current `parallel_for` call. I fix this by using a `thread_local` variable for the current thread id and manually setting it before each call to the user-provided function. Differential Revision: [D29287816](https://our.internmc.facebook.com/intern/diff/D29287816) [ghstack-poisoned]
Summary: Pull Request resolved: #60183 Fixes #59149 (comment) `parallel_for` will call the function directly if it would have run on only a single thread anyway. This is great for performance, but causes an issue in nested parallel regions because `get_thread_num` will reflect the parent parallel region instead of the current `parallel_for` call. I fix this by using a `thread_local` variable for the current thread id and manually setting it before each call to the user-provided function. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29287816 Pulled By: ngimel fbshipit-source-id: 777f771a0900750c7f22eb1dd185d84d19282108
Resubmit of #58811, Closes gh-24745
The existing PR (gh-50655) has been stalled because
TensorIteratordoesn't guarantee iteration order in the same way thatTH_TENSOR_APPLYdoes. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a newTensorIteratorConfigparameter,enforce_linear_iteration, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.This PR also significantly improves performance by adding multithreading support to the algorithm. As part of this, I wrote a custom
count_nonzerothat gives per-thread counts which is necessary to write the outputs in the right location.