- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.7k
          Implement std for multiple dimensions on CPU devices.
          #14535
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tested on a tensor with 1 billion elements and 3 dimensions on a powerful, highly
multi-core Linux machine.
parallelized: All operations (e.g., `t.std(1)`) that could be done in the old code are now several times faster. All
new operations (e.g., `t.std((0,2))` are significantly faster than the NumPy equivalents.
`t.std((0, 1, 2))`, a new operation, is logically equivalent to the
old `t.std()`, but faster.
serial: The above comment about old operationos now being faster still
holds, but `t.std((t1, ..., tn))` is now a few
times slower than `t.std()`. If this turns out to be important, we can
special-case that to use the old algorithm.
The approach is to create a new method, `TensorIterator::foreach_reduced_elt`,
valid for `TensorIterator`s that represent a dimension reduction. This
method calls a supplied function for each element in the output,
supplying it with the input elements that correspond to that output.
Given that primitive, we can implement reductions like the following pseudocode:
If there is more than one output element:
```
PARALLEL FOR EACH element IN output:
    accumulator = identity
    SERIAL FOR EACH data_point IN element.corresponding_input:
        accumulator.update(data_point)
    element = accumulator.to_output()
```
If there is only one output element, we still want to parallelize, so we
do so along the *input* instead:
```
accumulators[n_threads]
PARALLEL FOR EACH input_chunk IN input.chunks():
    accumulators[thread_num()] = identity
    SERIAL FOR EACH data_point IN input_chunk:
        accumulators[thread_num()].update_with_data(data_point)
accumulator = identity
SERIAL FOR EACH acc in accumulators:
    accumulator.update_with_other_accumulator(acc)
output_element = accumulator.to_output()
```
Note that accumulators and data points do not have to be the same type
in general, since it might be necessary to track arbitrary amounts of
data at intermediate stages.
For example, for `std`, we use a parallel version of Welford's
algorithm, which requies us to track the mean, second moment, and number
of elements, so the accumulator type for `std` contains three pieces of
data.
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
| check_sum_dim(self._make_tensors((50, 50, 50)), 2) | ||
| check_sum_dim(self._make_tensors((50, 50, 50)), (1, 2)) | ||
| check_sum_dim(self._make_tensors((50, 50, 50)), (1, -1)) | ||
| for sizes, dim in DIM_TEST_SCENARIOS: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for the unrolled pattern in the previous code is that it makes it easier to figure out which test case broke in a test failure. With the for-loop, the stack trace doesn't give any information about the failing data.
| } | ||
| } | ||
|  | ||
| void TensorIterator::narrow_all(int start_dim, IntList starts) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary:
Tested on a tensor with 1 billion elements and 3 dimensions on a powerful, highly
multi-core Linux machine.
parallelized: All operations (e.g., `t.std(1)`) that could be done in the old code are now several times faster. All
new operations (e.g., `t.std((0,2))` are significantly faster than the NumPy equivalents.
`t.std((0, 1, 2))`, a new operation, is logically equivalent to the
old `t.std()`, but faster.
serial: The above comment about old operationos now being faster still
holds, but `t.std((t1, ..., tn))` is now a few
times slower than `t.std()`. If this turns out to be important, we can
special-case that to use the old algorithm.
The approach is to create a new method, `TensorIterator::foreach_reduced_elt`,
valid for `TensorIterator`s that represent a dimension reduction. This
method calls a supplied function for each element in the output,
supplying it with the input elements that correspond to that output.
Given that primitive, we can implement reductions like the following pseudocode:
If there is more than one output element:
```
PARALLEL FOR EACH element IN output:
    accumulator = identity
    SERIAL FOR EACH data_point IN element.corresponding_input:
        accumulator.update(data_point)
    element = accumulator.to_output()
```
If there is only one output element, we still want to parallelize, so we
do so along the *input* instead:
```
accumulators[n_threads]
PARALLEL FOR EACH input_chunk IN input.chunks():
    accumulators[thread_num()] = identity
    SERIAL FOR EACH data_point IN input_chunk:
        accumulators[thread_num()].update_with_data(data_point)
accumulator = identity
SERIAL FOR EACH acc in accumulators:
    accumulator.update_with_other_accumulator(acc)
output_element = accumulator.to_output()
```
Note that accumulators and data points do not have to be the same type
in general, since it might be necessary to track arbitrary amounts of
data at intermediate stages.
For example, for `std`, we use a parallel version of Welford's
algorithm, which requies us to track the mean, second moment, and number
of elements, so the accumulator type for `std` contains three pieces of
data.
Pull Request resolved: pytorch/pytorch#14535
Differential Revision: D13283887
Pulled By: umanwizard
fbshipit-source-id: 8586b7bf00bf9f663c55d6f8323301e257f5ec3f
    Summary: This is the CUDA version of #14535 . It refactors Reduce.cuh to allow more general classes of reductions to be performed -- we no longer assume that the temporary data returned during reduction is just one scalar, and instead allow an arbitrary accumulate type. We also allow 64-bit indexing when necessary, since in general we will no longer be able to accumulate directly in the output. (In the cases when we can, we continue to split the tensors until they can be addressed with 32-bits, as before). As an initial use-case, we implement `std` in multiple dimensions. Pull Request resolved: #14990 Differential Revision: D13405097 Pulled By: umanwizard fbshipit-source-id: a56c24dc2fd5326d417632089bd3f5c4f9f0d2cb
Performance summary
Tested on a tensor with 1 billion elements and 3 dimensions on a powerful, highly
multi-core Linux machine.
parallelized: All operations (e.g.,
t.std(1)) that could be done in the old code are now several times faster. Allnew operations (e.g.,
t.std((0,2))are significantly faster than the NumPy equivalents.t.std((0, 1, 2)), a new operation, is logically equivalent to theold
t.std(), but faster.serial: The above comment about old operationos now being faster still
holds, but
t.std((t1, ..., tn))is now a fewtimes slower than
t.std(). If this turns out to be important, we canspecial-case that to use the old algorithm.
Explanation
The approach is to create a new method,
TensorIterator::foreach_reduced_elt,valid for
TensorIterators that represent a dimension reduction. Thismethod calls a supplied function for each element in the output,
supplying it with the input elements that correspond to that output.
Given that primitive, we can implement reductions like the following pseudocode:
If there is more than one output element:
If there is only one output element, we still want to parallelize, so we
do so along the input instead:
Note that accumulators and data points do not have to be the same type
in general, since it might be necessary to track arbitrary amounts of
data at intermediate stages.
For example, for
std, we use a parallel version of Welford'salgorithm, which requies us to track the mean, second moment, and number
of elements, so the accumulator type for
stdcontains three pieces ofdata.