Fixes error when too many parameters are passed to fused cuda kernel #18063

royju · 2019-03-15T16:38:39Z

Bug fix for #15043, where a large fusion in JIT with a large number of kernel arguments, which exceeds the limit allowed by nvrtc on a cuda device.
The fix is to check the number of arguments before a cuda kernel is generated. If the number exceeds the limit, take the runFallBack() path.
Add a reduced test from the original issue to keep the test time low. The test would fail without this fix.

when it exceeds nvrtc's limit. Bug fix for issue 15043.

Add more comments for the argu limit.

apaszke

We should avoid generating fusion groups this large altogether. This patch fixes the bug but silently disables an optimization in a case which is legitimate.

apaszke · 2019-03-18T19:32:28Z

test/test_jit.py


 class TestJit(JitTestCase):
+    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
+    def test_large_nbr_kernel_args(self):


This test will be very slow. Can we just prepare a graph that has ~200 ops that would normally get fused perfectly, but we can't because that's too much for a single kernel?

Thanks for the comment. The number of kernel arguments and the number of ops in a fusion group could be related but not always so, and the cause of this bug is the former. While we may want to limit the number of ops in a fusion group, I consider that's a separate issue to address. In fact, in the original test from the issue, the offending kernel itself has only about 10 ops in the fusion group, but because of a FusedConcat, it has a large number of live-ins, which exceeds the limit. Hence, we need to constrain the # of arguments, not the # of ops, for this specific issue.
It's true that the fix gives up fusion on the entire FusionGroup if the number of args is over the limit. One could argue to limit the FusionGroup by # of args in GraphFuser. In fact, I did prototype that. But the trade-off isn't always straightforward. First, since this limit is due to the cuda path, a change in GraphFuser could affect other devices. In addition, it's possible that if one stops a FusionGroup by estimating the number of possible arguments during fusion, further fusion could actually reduce the number of live-in/out arguments in the FusionGroup back below the limit. To keep this PR simple, I didn't continue that route. As I have learned, GraphFuser is sometime too aggressive such that we bail out to the fall back path in the Executor too often than we'd like to see. When we address the aggressiveness of GraphFusion (including constraining # of ops), we will reduce such possibility in giving up an entire FusionGroup by taking a preventive measure (as opposed to a reactive one).
Lastly on reducing the test case, I largely cut down the test and testing time from the original test. I again tried more just now, but haven't made much progress. The challenge is that I need to maintain the number of arguments (> 130) and also make sure the GraphFuser fuses it. Currently, the body of each iteration is
b = input[i] * 2
output.append(b)
To keep each iteration generating one new arg to a later FusedConcat, I keep the indexing form "input[i]", which is translated to 3 ops (1 select op and 2 constants: axis and unrolled index). To trigger fusion, I need to add a pw op (one mul op). 4 ops * 130 unrolled iterations are more than a couple of hundred ops. This test does take more time than some of the really small tests, but it doesn't seem totally out of line and it's off on the non-cuda path. I haven't found a good way to further reduce the test, probably because still being new to PyT, but I'm open to any suggestion.

Ok, my bad. The test does look somewhat simple, but please don't call it GRU in this case.

Also, does your test actually even trigger the fuser? You never run the traced function, so the code probably doesn't even get compiled.

Finally, regarding the tradeoffs, I think those are relatively simple. I doubt there are other devices where fusing more than 128 arguments into a single kernel would be beneficial. Finally, emitting many kernels is always superior to emitting no kernels.

Right. It's no longer the original GRU in the issue. I had a clarifying comment, but I will rename the class to avoid confusion.

Line 607 traced_gru = torch.jit.trace(gru, (input)) calls jit trace, so compilation, including fuser, does get invoked.

Ok. I can re-introduce the logic to estimate and track the # of arguments during fusion and stop growing if the limit is exceeded. I'm at GTC much of the week, and shall get back on this later this week. Thanks.

royju · 2019-03-22T02:01:29Z

Revision done according to the previous comments. Rename the class name in the test. Estimate the # of kernel arguments during fusion, and bail out if the limit may be exceeded.

royju · 2019-03-22T19:52:33Z

ci/circleci: binary_linux_conda_2.7_cpu_build — Your tests failed on CircleCI

Mar 22 01:52:03 /opt/conda/conda-bld/pytorch-nightly-cpu_1553219249856/work/third_party/ideep/mkl-dnn/src/cpu/ref_rnn.cpp:979:50: error: ‘void cblas_sgemm_free(float*)’ is deprecated (declared at /opt/conda/conda-bld/pytorch-nightly-cpu_1553219249856/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/include/mkl_cblas.h:804) [-Werror=deprecated-declarations]
Mar 22 01:52:03 cblas_sgemm_free(weights(i, j, k));
Mar 22 01:52:03 ^

The above test failure is not caused by this commit. It also failed to other devs, and it appears the pytorch/pytorch repo has this issue already.

ngimel · 2019-03-25T22:09:38Z

torch/csrc/jit/passes/graph_fuser.cpp

        any_fused = true;
        auto maybe_group = tryFuse(fused_cat, input);
+        if( !maybe_group ) {
+          continue;


at this point any_fused is already set to true, even if no concat fusion happens, which will throw off subsequent checks.

Thanks. Moved down "any_fused = true;" after all early bail-out are done.

Please adjust canFuseWithConcat instead. The point here was to make sure that this function is sufficient for tryFuse to succeed, which is checked by the assert below.

ngimel · 2019-03-25T22:11:14Z

torch/csrc/jit/passes/graph_fuser.cpp

+          ((before_check->inputs().size() + before_check->outputs().size() +
+            producer->node()->inputs().size() + producer->node()->outputs().size())
+           <= fusion_kernel_args_limit);
+


This seems unnecessary as the same check will be performed a few lines later in tryFuse, also, if you are trying to fuse concat to something else than a FusionGroup, this function will return true (line 1092) which does not look right.

Remove this check and let the one in tryFuse() take care of this.

Remove the redundant check in canFuseWithConcat(), and the check in tryFuse() can cover it.

ngimel · 2019-03-28T23:55:23Z

test/test_jit.py

+            def __init__(self, input_size, seq_len):
+                super(Recurrence, self).__init__()
+                self.input_size = input_size
+                self.batch_first = True


you don't need batch_first, seq_len and input_size as module attributes for this test

ngimel · 2019-03-28T23:56:47Z

test/test_jit.py

+
+                # Main loop
+                output = []
+                for i in range(self.seq_len):


can use input.size(0) instead of seq_len

Using input.size(0) leads to the following warning. Since this is a test, it should be ok. Change done.
test_large_nbr_kernel_args (main.TestJit) ... test/test_jit.py:581: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
for i in range(input.size(0)):

Put seq_len back as a module attribute to avoid the warning.

ngimel · 2019-03-28T23:58:10Z

torch/csrc/jit/passes/graph_fuser.cpp

      //   but this is checked later
      return isFusable(node->inputs()[0]->node());
    }
+    if( (node->inputs().size() + node->outputs().size()) >


This check is not needed, because everywhere after isFusable tryFuse will be called, so only check in tryFuse can be left.

This check is inside isFusableMap(), which is called by isFusable() and one other place in tryToMoveChunk(). In the latter case, there is no immediate call to tryFuse() inside the function. It looks we should keep the guard. No?

tryToMoveChunk is not making any fusions by itself, thus should not blacklist any nodes. Any changes to FusionGroup are caused by tryFuse, thus checking (or estimating, as the case may be) number of inputs/outputs only in tryFuse is better.

Ok. Thanks for checking. This check in isFusableMap() is removed. Just pushed a new commit. CI is starting.

ngimel · 2019-03-29T00:00:52Z

torch/csrc/jit/passes/graph_fuser.cpp

    if (!node->is_constant(attr::dim))
      return false;
+
+    Node* list_construct = node->namedInput(attr::tensors)->node();


list_construct is the same as tensors_node a few lines below, you can move tensors_node line here and reuse it.

Done. Will push a commit soon after testing. Thanks.

addressed

facebook-github-bot

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

apaszke · 2019-03-31T21:22:37Z

torch/csrc/jit/passes/graph_fuser.cpp

+          continue;
+        }
+        any_fused = true;
        AT_ASSERT(maybe_group && maybe_group == fused_cat);


The first part of this assert is meaningless now

Ok. Revision is done in the latest commit. Place a limit check in canFuseWithConcat() to restore the original code sequence around tryFuse() call in fuseConcats().

apaszke · 2019-03-31T21:24:01Z

torch/csrc/jit/passes/graph_fuser.cpp

      return false;
+
    auto tensors_node = node->namedInput(attr::tensors)->node();
+    if( (tensors_node->inputs().size() + node->outputs().size()) >


FWIW there's always only a single output in here, so we don't have to account for that

Since an output could still be an argument to the fused kernel generated later, I didn't change this part in the latest commit.

BTW we should assert that this never happens, because now we have guards for it. If it does, then someone messed up somewhere.

Would you suggest a specific place and assert to add?

Right here. This is the condition that should never happen in this very place

apaszke · 2019-03-31T21:24:22Z

torch/csrc/jit/passes/graph_fuser.cpp

        any_fused = true;
        auto maybe_group = tryFuse(fused_cat, input);
+        if( !maybe_group ) {
+          continue;


Please adjust canFuseWithConcat instead. The point here was to make sure that this function is sufficient for tryFuse to succeed, which is checked by the assert below.

sequence around tryFuse() call in fuseConcats().

facebook-github-bot

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

apaszke

Please just change the condition inside the fuser (not the fuser pass) to be a hard assertion.

royju · 2019-04-08T21:38:11Z

@pytorchbot retest this please.

ngimel · 2019-04-08T21:43:30Z

torch/csrc/jit/fuser/compiler.cpp

+  // Have checked the limit at graph_fuser. Assert nothing else changing that.
+  AT_ASSERT((flat_inputs.size() + flat_outputs.size()) <=
+            fusion_kernel_args_limit);
+


Now that this has come full circle, and compileKernel can no longer return nullopt, can you please change return type of this function, and remove nullopt handling logic in the executor?

I was thinking to keep the more flexible interface able to handle the case of nullopt. But I can remove that if so desired.

The limit check in compileKernel() has been turned to an assert, and the returned value of possible nullopt from compileKernel() has also been removed in the latest commits.

soumith · 2019-04-09T15:48:43Z

@pytorchbot rebase this please

facebook-github-bot

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-04-10T07:08:10Z

@soumith merged this pull request in a9a29dd.

…ytorch#18063) Summary: Bug fix for pytorch#15043, where a large fusion in JIT with a large number of kernel arguments, which exceeds the limit allowed by nvrtc on a cuda device. The fix is to check the number of arguments before a cuda kernel is generated. If the number exceeds the limit, take the runFallBack() path. Add a reduced test from the original issue to keep the test time low. The test would fail without this fix. Pull Request resolved: pytorch#18063 Differential Revision: D14691401 Pulled By: soumith fbshipit-source-id: b98829bc89ed7724e91eda82ae3a5a1151af721a

royju added 6 commits March 4, 2019 11:47

Add test for bug 15043

45a8930

Abort generating a cuda kernel with a large number of arguments

0ec55d7

when it exceeds nvrtc's limit. Bug fix for issue 15043.

Merge branch 'master' into bug-15043

f24edc4

Abort generating a cuda kernel with a large number of arguments

e601b5d

when it exceeds nvrtc's limit. Bug fix for issue 15043.

remove test_15043.py

713323c

Move the new class to be a class in the new test.

3125b8b

Add more comments for the argu limit.

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Mar 15, 2019

apaszke previously requested changes Mar 18, 2019

View reviewed changes

royju added 2 commits March 21, 2019 17:05

Check # of args during fuser, and rename test.

e1943e3

Merge branch 'master' into bug-15043

8cfbc59

ngimel reviewed Mar 25, 2019

View reviewed changes

In tryFuse(), move "any_fused = true;" down after all checks done.

3ccb21b

Remove the redundant check in canFuseWithConcat(), and the check in tryFuse() can cover it.

ngimel requested changes Mar 29, 2019

View reviewed changes

Further simplify the test and the checks in graph_fuser.cpp.

9ccb832

ngimel approved these changes Mar 29, 2019

View reviewed changes

facebook-github-bot reviewed Mar 29, 2019

View reviewed changes

apaszke suggested changes Mar 31, 2019

View reviewed changes

Place a limit check in canFuseWithConcat() to restore the original code

5506a3f

sequence around tryFuse() call in fuseConcats().

facebook-github-bot reviewed Apr 2, 2019

View reviewed changes

apaszke approved these changes Apr 7, 2019

View reviewed changes

Turn the check of args limit to an assert in compileKernel().

93b276f

ngimel reviewed Apr 8, 2019

View reviewed changes

Undo the interface, which allows compileKernel() to return nullopt.

c44535a

Merge remote-tracking branch 'origin/master' into HEAD

44bf3e3

facebook-github-bot reviewed Apr 9, 2019

View reviewed changes

facebook-github-bot closed this in a9a29dd Apr 10, 2019

facebook-github-bot added the merged label Apr 10, 2019

ezyang added the open source label Jun 24, 2019

Fixes error when too many parameters are passed to fused cuda kernel #18063

Fixes error when too many parameters are passed to fused cuda kernel #18063

Uh oh!

Conversation

royju commented Mar 15, 2019

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

royju commented Mar 22, 2019

Uh oh!

royju commented Mar 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

royju commented Apr 8, 2019

Uh oh!