Improve embedding_bag add kernel #19329

jamesr66a · 2019-04-17T01:02:42Z

This PR makes it so that we call into the high-performance EmbeddingLookup function from C2 within embedding_bag for add mode. This is highly-optimized and does dynamic dispatch based on cpuid. It also makes it so that we elide creating the offset2bag tensor in the fast path case. That operation was contributing a significant portion of runtime to the operator.

Benchmark script (with a hack to make jit not DCE/CSE): https://gist.github.com/jamesr66a/73baed47400dcf2221bad996c4b57782

=== Baseline (master) ===

time_per_iter 8.726580142974853e-05
GB/s 4.588097438402839

== Test (this PR) ===

time_per_iter 2.5180697441101074e-05
GB/s 15.900433295643158

facebook-github-bot

@jamesr66a has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@jamesr66a has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@jamesr66a has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

dzhulgakov · 2019-04-18T06:11:17Z

aten/src/ATen/native/EmbeddingBag.cpp

wait, you don't use lengths here at all, do you?

This is a WIP. I'm trying to get the FBCode build disaster fixed before i start fixing the kernel implementation

dzhulgakov · 2019-04-18T06:11:46Z

aten/src/ATen/native/EmbeddingBag.cpp

maybe add a TODO to just changed the underlying kernel to take offsets, not lengths?

dzhulgakov · 2019-04-18T06:13:15Z

aten/src/ATen/native/EmbeddingBag.cpp

EmbeddingLookup supports weighted version too - see weights argument, so let's call it here

dzhulgakov · 2019-04-18T23:21:52Z

aten/src/ATen/native/EmbeddingBag.cpp

it assumes the offsets is non-empty. Probably usually the case, but better to work on empty batch too (seems like we'd need to fix functional.py too)

dzhulgakov · 2019-04-18T23:23:55Z

aten/src/ATen/native/EmbeddingBag.cpp

you can probably save even more compute by skipping make_offset2bag vector in case we specialize here. Otherwise derivatives of offsets twice.

at least - add a TODO here so that the next person knows it

Uh, I think i'm gonna do that. make_offset2bag takes a ton of runtime in my tests and if we can get rid of it we can probably squeeze more perf out of this. Thanks for the tip

Summary: This was actually getting pretty poor throughput with respect to memory bandwidth. I used this test to measure the memory bandwidth specifically for the AXPY call: https://gist.github.com/jamesr66a/b27ff9ecbe036eed5ec310c0a3cc53c5 And I got ~8 GB/s before this change, but ~14 GB/s after this change. This seems to speed up the operator overall by around 1.3x (benchmark: https://gist.github.com/jamesr66a/c533817c334d0be432720ef5e54a4166): == Before == time_per_iter 0.0001298875093460083 GB/s 3.082544287868467 == After == time_per_iter 0.00010104801654815674 GB/s 3.9623142905451076 The large difference between the local BW increase and the full-op BW increase likely indicates significant time is being spent elsewhere in the op, so I will investigate that. EDIT: I updated this PR to include a call into caffe2/perfkernels. This is the progression: before time_per_iter 8.983819484710693e-05 GB/s 4.456723564864611 After no axpy time_per_iter 7.19951868057251e-05 GB/s 5.56126065872172 AFter perfkernels time_per_iter 5.6699180603027346e-05 GB/s 7.061548257694262 After perfkernels no grad time_per_iter 4.388842582702637e-05 GB/s 9.122769670026413 Pull Request resolved: pytorch#19329 Differential Revision: D14969630 fbshipit-source-id: 359b48eac218463d4ff13bdf22d31c70bf35281d

zou3519 · 2019-04-19T13:46:04Z

test/test_nn.py

-        self._test_EmbeddingBag(False, 'sum', True)
-        self._test_EmbeddingBag(False, 'mean', True)
+        for dtype in [torch.double, torch.float]:
+            # TODO: figure out why backward on float breaks


I don't think it's "broken" on backward; it's just that the precision on the test needs to be adjusted because the gradients are so large. If this hypothesis is true, then test_embedding_bag should fail for mode=sum but not fail for mode={max,mean}.

@zou3519 IME it fails for all off sum, max, and mean

What is the magnitude of the gradient for "max" and "mean"? If it is on the order of magnitude of 1 then there are definitely precision issues

cpuhrsch · 2019-04-19T15:01:31Z

aten/src/ATen/native/EmbeddingBag.cpp

+                                          const Tensor& offsets) {
+  int64_t ddim = src.size(1);
+  auto* scale_data = scale.data<float>();
+  auto select_indices_data = select_indices.data<int64_t>();


Do we actually assert that these data types match somewhere?

@cpuhrsch i think this does it: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/EmbeddingBag.cpp#L214

Also there's a PR out to fix that location: #19432

cpuhrsch · 2019-04-19T15:02:14Z

aten/src/ATen/native/EmbeddingBag.cpp


 #include <TH/THBlasUtils.h>

+#include <caffe2/perfkernels/embedding_lookup.h>


How does the dispatch work? Keep in mind that this file isn't compile with avx.

It's a similar mechanism to aten/DISPATCH and does the right thing with different compiler options and runtime cpuid dispatch between them: https://github.com/pytorch/pytorch/blob/master/caffe2/perfkernels/common.h

we should bring them together, but it's for another day

soumith · 2019-04-19T16:35:01Z

this also calls caffe2::EmbeddingLookup for the GPU. Is the perf of the caffe2 kernel better or equal on GPU? (I remember we optimized the GPU kernel quite a bit on the ATen side)

cpuhrsch · 2019-04-19T17:11:32Z

@jamesr66a - is it worthwhile to try a few more input shapes and number of threads for the benchmark?

jamesr66a · 2019-04-19T18:20:37Z

@soumith does this actually get called on GPU? I thought the only callsites for index_select_add and index_select_scale_add were in _embedding_bag_cpu

soumith · 2019-04-19T18:52:03Z

@jamesr66a great, it was my bad not to check that.

Summary: This was actually getting pretty poor throughput with respect to memory bandwidth. I used this test to measure the memory bandwidth specifically for the AXPY call: https://gist.github.com/jamesr66a/b27ff9ecbe036eed5ec310c0a3cc53c5 And I got ~8 GB/s before this change, but ~14 GB/s after this change. This seems to speed up the operator overall by around 1.3x (benchmark: https://gist.github.com/jamesr66a/c533817c334d0be432720ef5e54a4166): == Before == time_per_iter 0.0001298875093460083 GB/s 3.082544287868467 == After == time_per_iter 0.00010104801654815674 GB/s 3.9623142905451076 The large difference between the local BW increase and the full-op BW increase likely indicates significant time is being spent elsewhere in the op, so I will investigate that. EDIT: I updated this PR to include a call into caffe2/perfkernels. This is the progression: before time_per_iter 8.983819484710693e-05 GB/s 4.456723564864611 After no axpy time_per_iter 7.19951868057251e-05 GB/s 5.56126065872172 AFter perfkernels time_per_iter 5.6699180603027346e-05 GB/s 7.061548257694262 After perfkernels no grad time_per_iter 4.388842582702637e-05 GB/s 9.122769670026413 Pull Request resolved: pytorch/pytorch#19329 Reviewed By: dzhulgakov Differential Revision: D14969630 Pulled By: jamesr66a fbshipit-source-id: 42d1015772c87bedd119e33c0aa2c8105160a738

facebook-github-bot · 2019-04-20T04:05:48Z

@jamesr66a merged this pull request in d17c22d.

Summary: This was actually getting pretty poor throughput with respect to memory bandwidth. I used this test to measure the memory bandwidth specifically for the AXPY call: https://gist.github.com/jamesr66a/b27ff9ecbe036eed5ec310c0a3cc53c5 And I got ~8 GB/s before this change, but ~14 GB/s after this change. This seems to speed up the operator overall by around 1.3x (benchmark: https://gist.github.com/jamesr66a/c533817c334d0be432720ef5e54a4166): == Before == time_per_iter 0.0001298875093460083 GB/s 3.082544287868467 == After == time_per_iter 0.00010104801654815674 GB/s 3.9623142905451076 The large difference between the local BW increase and the full-op BW increase likely indicates significant time is being spent elsewhere in the op, so I will investigate that. EDIT: I updated this PR to include a call into caffe2/perfkernels. This is the progression: before time_per_iter 8.983819484710693e-05 GB/s 4.456723564864611 After no axpy time_per_iter 7.19951868057251e-05 GB/s 5.56126065872172 AFter perfkernels time_per_iter 5.6699180603027346e-05 GB/s 7.061548257694262 After perfkernels no grad time_per_iter 4.388842582702637e-05 GB/s 9.122769670026413 Pull Request resolved: pytorch#19329 Reviewed By: dzhulgakov Differential Revision: D14969630 Pulled By: jamesr66a fbshipit-source-id: 42d1015772c87bedd119e33c0aa2c8105160a738

jamesr66a force-pushed the no_axpy branch from 6dcf4a7 to 31b8b02 Compare April 17, 2019 01:11

facebook-github-bot reviewed Apr 17, 2019

View reviewed changes

jamesr66a changed the title ~~Don't use AXPY for embedding_bag add~~ [WIP] Improve embedding_bag add kernel Apr 17, 2019

jamesr66a force-pushed the no_axpy branch 4 times, most recently from 51acc47 to f9558a1 Compare April 17, 2019 22:25

facebook-github-bot reviewed Apr 17, 2019

View reviewed changes

jamesr66a force-pushed the no_axpy branch 2 times, most recently from 02bf05d to 0659677 Compare April 18, 2019 03:50

facebook-github-bot reviewed Apr 18, 2019

View reviewed changes

dzhulgakov requested changes Apr 18, 2019

View reviewed changes

jamesr66a force-pushed the no_axpy branch from 0659677 to 2ba5bb3 Compare April 18, 2019 20:48

jamesr66a changed the title ~~[WIP] Improve embedding_bag add kernel~~ Improve embedding_bag add kernel Apr 18, 2019

jamesr66a requested review from colesbury, cpuhrsch, gchanan and zou3519 April 18, 2019 22:44

dzhulgakov approved these changes Apr 18, 2019

View reviewed changes

jamesr66a force-pushed the no_axpy branch from 2ba5bb3 to c4f5ba1 Compare April 19, 2019 01:56

jamesr66a force-pushed the no_axpy branch from c4f5ba1 to e64fba0 Compare April 19, 2019 02:05

jamesr66a force-pushed the no_axpy branch from e64fba0 to 648e61f Compare April 19, 2019 04:18

zou3519 reviewed Apr 19, 2019

View reviewed changes

cpuhrsch reviewed Apr 19, 2019

View reviewed changes

facebook-github-bot closed this in d17c22d Apr 20, 2019

facebook-github-bot added the merged label Apr 20, 2019


		#include <TH/THBlasUtils.h>

		#include <caffe2/perfkernels/embedding_lookup.h>

Improve embedding_bag add kernel #19329

Improve embedding_bag add kernel #19329

Uh oh!

Conversation

jamesr66a commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zou3519 Apr 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soumith commented Apr 19, 2019

Uh oh!

cpuhrsch commented Apr 19, 2019

Uh oh!

jamesr66a commented Apr 19, 2019

Uh oh!

soumith commented Apr 19, 2019

Uh oh!

facebook-github-bot commented Apr 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jamesr66a commented Apr 17, 2019 •

edited

Loading

zou3519 Apr 19, 2019 •

edited

Loading