Get more fusion after autodiff uses SumToSize #14957

t-vi · 2018-12-09T09:31:07Z

Here is a fresh attempt at getting some fusion back in autodiff-generated graphs in the presence of SumToSize.

The sum to size operator is now aten::_grad_sum_to_size to allow symbolic script differentiation (and that in turn would need to use this in place of sum_to_size to signal that it strictly operates on gradients). This is also used in the autodiff code, replacing prim::SumToSize.
_grad_sum_to_size is now fusable, cats - which are fused afterwards thanks to Adam's simplification of the code - are only fused if there is no _grad_sum_to_size in the fusion group.
I push the _grad_sum_to_size out of the the fusion group when compiling and record the desired summations in the KernelSpec. The reasoning is the following:
- As the autodiff is a repeated applicaiton of the chain rule, we always have the pattern grad_in = mm(A, grad_out), with A often diagonal for cases interesting to the fuser, whence it is grad_in = a * grad_out (a pointwise multiplication). We know that only grad_out may have AutodiffGradSumToSize applied, so we can commute AutodiffGradSumToSize with the mul (and div and neg are of similar origin).
- For type_as the gradient might be giving the type, so just skip SumToSize,
- add (which was inserted as prim::AutogradAdd) adding gradients when the forward used the same value in several places. This is non-broadcasting, so we know that the two arguments would have the same sizes as inputs - which is good so we don't have to do bookkeeping of the two parts.

Details:

During fusion, the Tensor arguments are always kept as the first parameters of the fusion group to accomodate indexing assumptions in the fuser.
The rewriting of the fusion group to record the necessary output transformation and eliminate _grad_sum_to_size from the fusion group is now in the fuser compile step.
In the execution step, the arguments are split into Tensor / Non-Tensor and the non-tensor args are mostly forgotten about except for doing sum_to_size at the end. This would want to be improved if/when we fuse nonconstant scalar arguments.
In a number of places in the fuser, the non-Tensor arguments to the fusion group needed to be ignored.

Thank you, @apaszke for the insightful discussion. All bad ideas and errors are my own.

apaszke

I think the removal of SumToSize nodes is happening way too early. Basically, you shouldn’t think of FusionGroups as graphs that have already been fused and will conform to those semantics, but graphs eligible to be fused. That means that we still want to preserve the original semantics of the code, because it might turn out that our fusion guesses were wrong, and will end up running a deoptimized version of the original code. Instead, we should allow putting them in FusionGroups and simply remove them right before a kernel is compiled (once we know that the fusion is valid, etc.).

Finally, marking those nodes as fusible is a bad idea, because the only reason why you might put them in a fusion group is because you are certain that it will help you perform more fusions. That should be checked and processed similarly to how we deal with rearranging chunk nodes.

torch/csrc/jit/autodiff.cpp

torch/csrc/jit/passes/graph_fuser.cpp

t-vi · 2018-12-10T07:30:00Z

Thanks for your comments Adam!

I'll rename the prim::GradSumToSize.
So I'll move the graph rewriting into the fuser codegen.
For "when to fuse SumToSize", would it be OK to put them into the fusion group if there aren't any FusedConcat nodes in there?
This would mean that we might end up with SumToSize at the top of the fusion group which we would undo before the fusion, but I'm a bit weary that GraphFuser.run will get considerably more complicated if we split out the scan phase as done for chunk.

t-vi · 2018-12-13T09:22:27Z

Hmh. I need to rebase.
So I think I'm not fusing sumtosize any more when concat is close.
I'm not as sure about the "when to relocate sumtosize": If I move that into kernel generation, is it still safe to move the sumtosize to outside the fusion group? I'll try that next, but I'm still a bit sceptical about it.

We don't support reductions yet, but simply decomposing batch_norm into a kernel that computes the stats, and the fusing everything else with ReLU and following pointwise ops provides nice speedups. Note that this is only limited to inference mode for now, because we don't support convolutions and batch norm in AD, so the fuser isn't applied to those parts.

That makes that definition of a "fusable node" much simpler, as we don't need to keep considering whether something has to be an "exit node" at every step. The fuser now tries to maximize the pointwise fusions first, and proceeds to prepending chunks and appending concats only once a fix point is reached. This patch not only makes the fuser much simpler to reason about, making it siginifcantly easier to implement features like SumToSize fusion, to improve performance of derivative graphs.

We don't support reductions yet, but simply decomposing batch_norm into a kernel that computes the stats, and the fusing everything else with ReLU and following pointwise ops provides nice speedups. Note that this is only limited to inference mode for now, because we don't support convolutions and batch norm in AD, so the fuser isn't applied to those parts.

…sion

That makes that definition of a "fusable node" much simpler, as we don't need to keep considering whether something has to be an "exit node" at every step. The fuser now tries to maximize the pointwise fusions first, and proceeds to prepending chunks and appending concats only once a fix point is reached. This patch not only makes the fuser much simpler to reason about, making it siginifcantly easier to implement features like SumToSize fusion, to improve performance of derivative graphs.

t-vi · 2019-01-17T21:16:28Z

Thanks @ngimel, for raising this. I'll see how to fix that. My understanding is that we would need to deduplicate it for the kernel, but not for the _grad_sum_to_size application after running the fused kernel. That in turn means we have different outputs for the fused kernel vs. the fusion group.

@ngimel

After squeezing out the _grad_sum_to_size during kernel compilation, we may end up with duplicate outputs. For example example, the backward of def fn1(x,y,z): a = x+y+z return torch.sigmoid(a) has that. Thank you @ngimel for noting and providing the example!

t-vi · 2019-01-19T10:49:32Z

So I added the output deduplication in the fuser and a test using @ngimel 's example (thanks again!).

test/test_jit.py

apaszke

I think this should still be improved before we land. First, it can fail when one encounters some non-basic fused nodes, and there's no fallback in that case. Also, it would be good to improve the quality of fusability checks for aten::_grad_sum_to_size to avoid fusing them unnecessarily.

torch/csrc/jit/passes/graph_fuser.cpp

apaszke · 2019-01-25T20:00:38Z

torch/csrc/jit/passes/graph_fuser.cpp

      return false;
-    return node->kind() == prim::FusionGroup || isSimpleMap(node);
+    return node->kind() == prim::FusionGroup ||
+        node->kind() == aten::_grad_sum_to_size || isSimpleMap(node);


I'd really like us to avoid putting _grad_sum_to_size nodes in fusion groups unnecessarily. We should either have stronger checks for them (that adding them would in fact help us fuse more), or we should have a postprocessing pass that e.g. will move them out of the group if they are applied to inputs, or create outputs.

Thanks for going through this!

So postprocessing or checking if the input of _grad_sum_to_size is fusible? Do you have a preference? (Edit: Would just checking isFusable(node->inputs()[0]->node()) be enough?)
Would the "create outputs" comment be mitigated by the deduplication in the fuser itself?
Personally, I would envision that one would move cases where the sumtosizes of the outputs are "ascending" (i.e. you can sort the dimensions in a way that every tensor only has the summations last) into the kernel itself.

So I now do this. This probably means we do not fuse some cases where we would like to - e.g. the milstm backward could be such a case, but we are doing better than before and maybe we just are conservative.

torch/csrc/jit/passes/graph_fuser.cpp

torch/csrc/jit/fuser/kernel_spec.h

torch/csrc/jit/fuser/compiler.cpp

Thank you!

t-vi · 2019-01-27T20:07:09Z

@apaszke I think I acted on your comments (though I have the feeling that there might be a nicer way to cover the two uses of trackSingleGradSumToSizeToOutputs, but I don't really have an idea how).

apaszke

Looks great! Some minor comments, but should be good to land

test/test_jit.py

torch/csrc/jit/passes/graph_fuser.cpp

apaszke · 2019-01-30T16:45:37Z

torch/csrc/jit/passes/graph_fuser.cpp

      return at::nullopt;
    }
+    if (producer->node()->kind() == aten::_grad_sum_to_size &&
+	consumer->kind() == prim::FusionGroup) {


Can we assert that consumer is a fusion group? If I understand correctly this is the only case possible by design, but having a check like this would mean that this case will simply get skipped if it wasn't the case.

I think the case where the consumer isn't a fusion group, i.e. a new fusion group will be started with a _grad_sum_to_size for the output, is legitimate, and then we would have the empty set to check.
The main alternative I see would be to refuse to fuse here, but that would only result in us having two _grad_sum_to_sizes in a row (one pushed from inside the fusion group and one that we didn't fuse because it was at the output) during execution.

t-vi · 2019-01-30T21:21:08Z

So the three CI failures would appear to be unrelated to my changes (flake8 in onnx/symbolic.py which I don't think I touch, python to old to download mypy, UtilsNMSTest.GPUEqualsCPUCorrectnessTest - this is perhaps the one I know the least about).

facebook-github-bot

@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Here is a fresh attempt at getting some fusion back in autodiff-generated graphs in the presence of SumToSize. - The sum to size operator is now `aten::_grad_sum_to_size` to allow symbolic script differentiation (and that in turn would need to use this in place of sum_to_size to signal that it strictly operates on gradients). This is also used in the autodiff code, replacing `prim::SumToSize`. - `_grad_sum_to_size` is now fusable, `cat`s - which are fused afterwards thanks to Adam's simplification of the code - are only fused if there is no `_grad_sum_to_size` in the fusion group. - I push the `_grad_sum_to_size` out of the the fusion group when compiling and record the desired summations in the KernelSpec. The reasoning is the following: - As the autodiff is a repeated applicaiton of the chain rule, we always have the pattern `grad_in = mm(A, grad_out)`, with A often diagonal for cases interesting to the fuser, whence it is `grad_in = a * grad_out` (a pointwise multiplication). We know that only `grad_out` may have AutodiffGradSumToSize applied, so we can commute AutodiffGradSumToSize with the `mul` (and `div` and `neg` are of similar origin). - For `type_as` the gradient might be giving the type, so just skip SumToSize, - `add` (which was inserted as `prim::AutogradAdd`) adding gradients when the forward used the same value in several places. This is non-broadcasting, so we know that the two arguments would have the same sizes as inputs - which is good so we don't have to do bookkeeping of the two parts. Details: - During fusion, the Tensor arguments are always kept as the first parameters of the fusion group to accomodate indexing assumptions in the fuser. - The rewriting of the fusion group to record the necessary output transformation and eliminate `_grad_sum_to_size` from the fusion group is now in the fuser compile step. - In the execution step, the arguments are split into Tensor / Non-Tensor and the non-tensor args are mostly forgotten about except for doing `sum_to_size` at the end. This would want to be improved if/when we fuse nonconstant scalar arguments. - In a number of places in the fuser, the non-Tensor arguments to the fusion group needed to be ignored. Thank you, apaszke for the insightful discussion. All bad ideas and errors are my own. Pull Request resolved: pytorch/pytorch#14957 Differential Revision: D13888173 Pulled By: zou3519 fbshipit-source-id: 071992c876e8b845f2b3e6329ae03a835d39a0ea

apaszke · 2019-02-01T11:50:33Z

This is exciting! I'll have to rerun my benchmarks now!

t-vi added 4 commits December 8, 2018 23:34

start on re-fusing backwards

95a3664

Merge branch 'master' into fusion_after_autodiff

27d1f29

fix typo

c31c350

matching, update tests

3bbcd72

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Dec 9, 2018

apaszke reviewed Dec 9, 2018

View reviewed changes

torch/csrc/jit/autodiff.cpp Outdated Show resolved Hide resolved

torch/csrc/jit/passes/graph_fuser.cpp Outdated Show resolved Hide resolved

t-vi and others added 3 commits December 11, 2018 14:21

it's GradSumToSize

c7785d6

improve detection of fusible GradSumToSize

98a102f

fuse less GradSumToSize

f37d870

apaszke and others added 19 commits December 30, 2018 18:19

clang-format complains...

2a99b0c

Adjust precision

6e21cc4

removes shape logic from code generation, fixes possible segfault

f605808

cleanup

99a11b5

moves rand review to spec constructor to initialize new field

a71d30f

clang-format complains...

3c6a95f

Adjust precision

4ae6ea2

Implement batch_norm_update_stats for CPU

c1b2bda

Review comments

9818e6d

Merge remote-tracking branch 'apaszke/bn_fusion' into simplify_cat_fu…

7d4b23b

…sion

Merge branch 'simplify_cat_fusion' into fusion_after_autodiff

4a61419

Merge branch 'pr_15750' into fusion_after_autodiff

bf9af09

update after merge

9e6aa69

Review comments

ae6945f

more fusion for backwards v2

0d0763d

ngimel mentioned this pull request Jan 18, 2019

[jit] reenable rand_like fusion when there is no broadcast #16087

Closed

t-vi added 3 commits January 19, 2019 11:02

add test for fuser deduplication

32125ae

clang-tidy

8737a19

ngimel reviewed Jan 22, 2019

View reviewed changes

test/test_jit.py Outdated Show resolved Hide resolved

add .data_ptr() to check equal storage

5b6bd13

apaszke suggested changes Jan 25, 2019

View reviewed changes

t-vi and others added 4 commits January 27, 2019 00:42

Improve many bits based on Adam's feedback.

bf81d08

Thank you!

update milstm backward expectation to reflect less fusion

5f7aac6

Merge branch 'master' into fusion_after_autodiff

73dcdff

follow JIT_ASSERT->AT_ASSERT

46f58c4

t-vi added 4 commits January 27, 2019 22:16

import queue

166396d

python 2

50795c1

python2 ?!

7533923

Merge branch 'master' into fusion_after_autodiff

bb7db73

apaszke approved these changes Jan 30, 2019

View reviewed changes

Second round of Adam's feedback. Thank you again!

d65da1c

facebook-github-bot reviewed Jan 30, 2019

View reviewed changes

facebook-github-bot closed this in 20d45c4 Jan 31, 2019

fmassa mentioned this pull request Feb 4, 2019

PyTorch stable version facebookresearch/maskrcnn-benchmark#261

Open

t-vi mentioned this pull request Feb 4, 2019

Multinomial (GPU ONLY) without replacement generates repeated items #16709

Closed

ezyang added open source merged labels Jun 24, 2019

Get more fusion after autodiff uses SumToSize #14957

Get more fusion after autodiff uses SumToSize #14957

Uh oh!

Conversation

t-vi commented Dec 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

t-vi commented Dec 10, 2018

Uh oh!

t-vi commented Dec 13, 2018

Uh oh!

t-vi commented Jan 17, 2019

Uh oh!

t-vi commented Jan 19, 2019

Uh oh!

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

apaszke Jan 25, 2019

Choose a reason for hiding this comment

Uh oh!

t-vi Jan 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t-vi Jan 27, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

t-vi commented Jan 27, 2019

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

apaszke Jan 30, 2019

Choose a reason for hiding this comment

Uh oh!

t-vi Jan 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t-vi commented Jan 30, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

apaszke commented Feb 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

t-vi commented Dec 9, 2018 •

edited

Loading

t-vi Jan 26, 2019 •

edited

Loading

t-vi Jan 30, 2019 •

edited

Loading