Implement `std` for multiple dimensions on CPU devices. #14535

umanwizard · 2018-11-29T10:02:52Z

Performance summary

Tested on a tensor with 1 billion elements and 3 dimensions on a powerful, highly
multi-core Linux machine.

parallelized: All operations (e.g., t.std(1)) that could be done in the old code are now several times faster. All
new operations (e.g., t.std((0,2)) are significantly faster than the NumPy equivalents.
t.std((0, 1, 2)), a new operation, is logically equivalent to the
old t.std(), but faster.

serial: The above comment about old operationos now being faster still
holds, but t.std((t1, ..., tn)) is now a few
times slower than t.std(). If this turns out to be important, we can
special-case that to use the old algorithm.

Explanation

The approach is to create a new method, TensorIterator::foreach_reduced_elt,
valid for TensorIterators that represent a dimension reduction. This
method calls a supplied function for each element in the output,
supplying it with the input elements that correspond to that output.

Given that primitive, we can implement reductions like the following pseudocode:

If there is more than one output element:

PARALLEL FOR EACH element IN output:
    accumulator = identity
    SERIAL FOR EACH data_point IN element.corresponding_input:
        accumulator.update(data_point)
    element = accumulator.to_output()

If there is only one output element, we still want to parallelize, so we
do so along the input instead:

accumulators[n_threads]
PARALLEL FOR EACH input_chunk IN input.chunks():
    accumulators[thread_num()] = identity
    SERIAL FOR EACH data_point IN input_chunk:
        accumulators[thread_num()].update_with_data(data_point)
accumulator = identity
SERIAL FOR EACH acc in accumulators:
    accumulator.update_with_other_accumulator(acc)
output_element = accumulator.to_output()

Note that accumulators and data points do not have to be the same type
in general, since it might be necessary to track arbitrary amounts of
data at intermediate stages.

For example, for std, we use a parallel version of Welford's
algorithm, which requies us to track the mean, second moment, and number
of elements, so the accumulator type for std contains three pieces of
data.

Tested on a tensor with 1 billion elements and 3 dimensions on a powerful, highly multi-core Linux machine. parallelized: All operations (e.g., `t.std(1)`) that could be done in the old code are now several times faster. All new operations (e.g., `t.std((0,2))` are significantly faster than the NumPy equivalents. `t.std((0, 1, 2))`, a new operation, is logically equivalent to the old `t.std()`, but faster. serial: The above comment about old operationos now being faster still holds, but `t.std((t1, ..., tn))` is now a few times slower than `t.std()`. If this turns out to be important, we can special-case that to use the old algorithm. The approach is to create a new method, `TensorIterator::foreach_reduced_elt`, valid for `TensorIterator`s that represent a dimension reduction. This method calls a supplied function for each element in the output, supplying it with the input elements that correspond to that output. Given that primitive, we can implement reductions like the following pseudocode: If there is more than one output element: ``` PARALLEL FOR EACH element IN output: accumulator = identity SERIAL FOR EACH data_point IN element.corresponding_input: accumulator.update(data_point) element = accumulator.to_output() ``` If there is only one output element, we still want to parallelize, so we do so along the *input* instead: ``` accumulators[n_threads] PARALLEL FOR EACH input_chunk IN input.chunks(): accumulators[thread_num()] = identity SERIAL FOR EACH data_point IN input_chunk: accumulators[thread_num()].update_with_data(data_point) accumulator = identity SERIAL FOR EACH acc in accumulators: accumulator.update_with_other_accumulator(acc) output_element = accumulator.to_output() ``` Note that accumulators and data points do not have to be the same type in general, since it might be necessary to track arbitrary amounts of data at intermediate stages. For example, for `std`, we use a parallel version of Welford's algorithm, which requies us to track the mean, second moment, and number of elements, so the accumulator type for `std` contains three pieces of data.

… in fact UB.

facebook-github-bot

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

aten/src/ATen/native/cpu/Reduce.h

facebook-github-bot

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

colesbury · 2018-12-06T18:12:04Z

test/test_torch.py

-        check_sum_dim(self._make_tensors((50, 50, 50)), 2)
-        check_sum_dim(self._make_tensors((50, 50, 50)), (1, 2))
-        check_sum_dim(self._make_tensors((50, 50, 50)), (1, -1))
+        for sizes, dim in DIM_TEST_SCENARIOS:


The reason for the unrolled pattern in the previous code is that it makes it easier to figure out which test case broke in a test failure. With the for-loop, the stack trace doesn't give any information about the failing data.

colesbury · 2018-12-06T18:14:04Z

aten/src/ATen/native/TensorIterator.cpp

  }
 }

+void TensorIterator::narrow_all(int start_dim, IntList starts) {


This is a bit different from narrow in PyTorch, which takes a length. In some ways it's like select, although it leaves the dimension in the shape.

facebook-github-bot

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Tested on a tensor with 1 billion elements and 3 dimensions on a powerful, highly multi-core Linux machine. parallelized: All operations (e.g., `t.std(1)`) that could be done in the old code are now several times faster. All new operations (e.g., `t.std((0,2))` are significantly faster than the NumPy equivalents. `t.std((0, 1, 2))`, a new operation, is logically equivalent to the old `t.std()`, but faster. serial: The above comment about old operationos now being faster still holds, but `t.std((t1, ..., tn))` is now a few times slower than `t.std()`. If this turns out to be important, we can special-case that to use the old algorithm. The approach is to create a new method, `TensorIterator::foreach_reduced_elt`, valid for `TensorIterator`s that represent a dimension reduction. This method calls a supplied function for each element in the output, supplying it with the input elements that correspond to that output. Given that primitive, we can implement reductions like the following pseudocode: If there is more than one output element: ``` PARALLEL FOR EACH element IN output: accumulator = identity SERIAL FOR EACH data_point IN element.corresponding_input: accumulator.update(data_point) element = accumulator.to_output() ``` If there is only one output element, we still want to parallelize, so we do so along the *input* instead: ``` accumulators[n_threads] PARALLEL FOR EACH input_chunk IN input.chunks(): accumulators[thread_num()] = identity SERIAL FOR EACH data_point IN input_chunk: accumulators[thread_num()].update_with_data(data_point) accumulator = identity SERIAL FOR EACH acc in accumulators: accumulator.update_with_other_accumulator(acc) output_element = accumulator.to_output() ``` Note that accumulators and data points do not have to be the same type in general, since it might be necessary to track arbitrary amounts of data at intermediate stages. For example, for `std`, we use a parallel version of Welford's algorithm, which requies us to track the mean, second moment, and number of elements, so the accumulator type for `std` contains three pieces of data. Pull Request resolved: pytorch/pytorch#14535 Differential Revision: D13283887 Pulled By: umanwizard fbshipit-source-id: 8586b7bf00bf9f663c55d6f8323301e257f5ec3f

Summary: This is the CUDA version of #14535 . It refactors Reduce.cuh to allow more general classes of reductions to be performed -- we no longer assume that the temporary data returned during reduction is just one scalar, and instead allow an arbitrary accumulate type. We also allow 64-bit indexing when necessary, since in general we will no longer be able to accumulate directly in the output. (In the cases when we can, we continue to split the tensors until they can be addressed with 32-bits, as before). As an initial use-case, we implement `std` in multiple dimensions. Pull Request resolved: #14990 Differential Revision: D13405097 Pulled By: umanwizard fbshipit-source-id: a56c24dc2fd5326d417632089bd3f5c4f9f0d2cb

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Nov 29, 2018

Brennan Vincent added 4 commits November 29, 2018 02:04

fix warnings

248b646

fix flake8 nits

c63271b

Explicitly return NaN instead of relying on x/0 being NaN, which is…

97643ff

… in fact UB.

Only make branches of std_kernel_impl for the types that make sense

85fae40

facebook-github-bot reviewed Nov 30, 2018

View reviewed changes

colesbury reviewed Dec 3, 2018

View reviewed changes

aten/src/ATen/native/cpu/Reduce.h Outdated Show resolved Hide resolved

Brennan Vincent added 2 commits December 3, 2018 12:55

use vector instead of non-standard VLA

2ef1cd5

fix windows build

73a1dd3

facebook-github-bot reviewed Dec 3, 2018

View reviewed changes

colesbury approved these changes Dec 6, 2018

View reviewed changes

Brennan Vincent added 2 commits December 6, 2018 14:32

sam's comments

9032cbf

flake8

3c23196

facebook-github-bot reviewed Dec 6, 2018

View reviewed changes

facebook-github-bot closed this in 25110d6 Dec 8, 2018

umanwizard mentioned this pull request Dec 10, 2018

multi-dim standard deviation for CUDA. #14990

Closed

ezyang added the merged label Jun 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement `std` for multiple dimensions on CPU devices. #14535

Implement `std` for multiple dimensions on CPU devices. #14535

Uh oh!

umanwizard commented Nov 29, 2018 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

Uh oh!

facebook-github-bot left a comment

Uh oh!

colesbury Dec 6, 2018

Uh oh!

colesbury Dec 6, 2018

Uh oh!

facebook-github-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Implement std for multiple dimensions on CPU devices. #14535

Implement std for multiple dimensions on CPU devices. #14535

Uh oh!

Conversation

umanwizard commented Nov 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance summary

Explanation

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

colesbury Dec 6, 2018

Choose a reason for hiding this comment

Uh oh!

colesbury Dec 6, 2018

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Implement `std` for multiple dimensions on CPU devices. #14535

Implement `std` for multiple dimensions on CPU devices. #14535

umanwizard commented Nov 29, 2018 •

edited

Loading