Migrate renorm to ATen (CPU and CUDA) #59108

peterbell10 · 2021-05-27T21:46:32Z

Closes #24754, closes #24616, closes #50874

This reuses linalg_vector_norm to calculate the norms. I just add a new kernel that turns the norm into a normalization factor, then multiply the original tensor using a normal broadcasted mul operator. The result is less code, and better performance to boot.

Benchmarks (CPU):

Shape	Dim	Before	After (1 thread)	After (8 threads)
(10, 10, 10)	0	11.6 us	4.2 us	4.2 us
	1	14.3 us	5.2 us	5.2 us
	2	12.7 us	4.6 us	4.6 us
(50, 50, 50)	0	330 us	120 us	24.4 us
	1	350 us	135 us	28.2 us
	2	417 us	130 us	24.4 us

Benchmarks (CUDA)

Shape	Dim	Before	After
(10, 10, 10)	0	12.5 us	12.1 us
	1	13.1 us	12.2 us
	2	13.1 us	11.8 us
(50, 50, 50)	0	33.7 us	11.6 us
	1	36.5 us	15.8 us
	2	41.1 us	15 us

facebook-github-bot · 2021-05-27T21:46:39Z

💊 CI failures summary and remediations

As of commit 3f2843a (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ngimel

Great work, @peterbell10!

ngimel · 2021-05-28T00:20:14Z

aten/src/ATen/native/Normalization.cpp

on @ezyang's behalf, thank you for making it structured!

ngimel · 2021-05-28T00:38:22Z

aten/src/ATen/native/Normalization.cpp

for fp16, norm has a good chance of overflowing. In previous implementation the result was kept in AccT, perhaps makes sense to do it here. Annoyingly, I think our fused implementation of half norm producing float outputs is broken, so it would cast inputs to float first, but it's safer

Done, cuda-half is now around 5 us slower for small sizes but the larger size it's still actually faster than the THC kernel.

ngimel · 2021-05-28T00:39:26Z

aten/src/ATen/native/cuda/TensorModeKernel.cuh

is this include needed?

Yes, it was previously included indirectly from THCTensorMathReduce.cuh but I've removed the include from that header.

aten/src/ATen/native/Normalization.cpp

ngimel · 2021-05-28T01:45:33Z

aten/src/ATen/native/cuda/RenormKernel.cu

iter.common_dtype cannot be kHalf, right? It would be kFLoat since norm is float

ngimel

Thanks!

facebook-github-bot · 2021-05-28T02:01:15Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2021-05-28T04:43:43Z

@peterbell10 I fixed vector_norm to not do explicit casting in #59134, so after it's merged hopefully perf will be back.

ngimel · 2021-05-28T18:36:31Z

windows build error is real

facebook-github-bot · 2021-05-30T04:51:26Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2021-05-30T18:08:22Z

@peterbell10 internal builds are failing with

stderr: ld.lld: error: undefined symbol: at::native::DispatchStub<void (*)(at::TensorIteratorBase&, double), at::native::renorm_scale_factor_stub>::DEFAULT
>>> referenced by SmallVector.h:0 (buck-out/gen/fe3a39b8/xplat/caffe2/c10/c10Android#header-mode-symlink-tree-with-header-map,headers/c10/util/SmallVector.h:0)
>>>               buck-out/gen/fe3a39b8/fbandroid/apps/fb4a/fastparse/flavors/fbandroid_debug_pt_code_gen_libAndroid#android-armv7,compile-pic-Normalization.cpp.oc5b093a4/aten/src/ATen/native/Normalization.cpp.o:(at::native::structured_renorm_out::impl(at::Tensor const&, c10::Scalar const&, long long, c10::Scalar const&, at::Tensor const&))
clang++: error: linker command failed with exit code 1 (use -v to see invocation)

peterbell10 · 2021-05-31T23:49:24Z

I don't know much about how the mobile builds work. Is it possible that RenormKernel.cpp just needs to be added to the build definition?

ngimel · 2021-06-01T00:49:50Z

I've tried adding it to tools/build_variables.bzl, let's see if it fixes the failures.

facebook-github-bot · 2021-06-01T05:39:48Z

@ngimel merged this pull request in 74ec508.

facebook-github-bot · 2021-06-01T12:16:57Z

This pull request has been reverted by afdfd22.

Summary: Resubmit of #59108, closes #24754, closes #24616 This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot. #### Benchmarks (CPU): | Shape | Dim | Before | After (1 thread) | After (8 threads) | |:------------:|:---:|--------:|-----------------:|------------------:| | (10, 10, 10) | 0 | 11.6 us | 4.2 us | 4.2 us | | | 1 | 14.3 us | 5.2 us | 5.2 us | | | 2 | 12.7 us | 4.6 us | 4.6 us | | (50, 50, 50) | 0 | 330 us | 120 us | 24.4 us | | | 1 | 350 us | 135 us | 28.2 us | | | 2 | 417 us | 130 us | 24.4 us | #### Benchmarks (CUDA) | Shape | Dim | Before | After | |:------------:|:---:|--------:|--------:| | (10, 10, 10) | 0 | 12.5 us | 12.1 us | | | 1 | 13.1 us | 12.2 us | | | 2 | 13.1 us | 11.8 us | | (50, 50, 50) | 0 | 33.7 us | 11.6 us | | | 1 | 36.5 us | 15.8 us | | | 2 | 41.1 us | 15 us | Pull Request resolved: #59250 Reviewed By: mruberry Differential Revision: D28820359 Pulled By: ngimel fbshipit-source-id: 572486adabac8135d52a9b8700f9d145c2a4ed45

Summary: Closes pytorch#24754, closes pytorch#24616, closes pytorch#50874 This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot. #### Benchmarks (CPU): | Shape | Dim | Before | After (1 thread) | After (8 threads) | |:------------:|:---:|--------:|-----------------:|------------------:| | (10, 10, 10) | 0 | 11.6 us | 4.2 us | 4.2 us | | | 1 | 14.3 us | 5.2 us | 5.2 us | | | 2 | 12.7 us | 4.6 us | 4.6 us | | (50, 50, 50) | 0 | 330 us | 120 us | 24.4 us | | | 1 | 350 us | 135 us | 28.2 us | | | 2 | 417 us | 130 us | 24.4 us | #### Benchmarks (CUDA) | Shape | Dim | Before | After | |:------------:|:---:|--------:|--------:| | (10, 10, 10) | 0 | 12.5 us | 12.1 us | | | 1 | 13.1 us | 12.2 us | | | 2 | 13.1 us | 11.8 us | | (50, 50, 50) | 0 | 33.7 us | 11.6 us | | | 1 | 36.5 us | 15.8 us | | | 2 | 41.1 us | 15 us | Pull Request resolved: pytorch#59108 Reviewed By: mrshenli Differential Revision: D28767060 Pulled By: ngimel fbshipit-source-id: 93dcbe5483f71cc6a6444fbd5b1aa1f29975d857

Summary: Resubmit of pytorch#59108, closes pytorch#24754, closes pytorch#24616 This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot. #### Benchmarks (CPU): | Shape | Dim | Before | After (1 thread) | After (8 threads) | |:------------:|:---:|--------:|-----------------:|------------------:| | (10, 10, 10) | 0 | 11.6 us | 4.2 us | 4.2 us | | | 1 | 14.3 us | 5.2 us | 5.2 us | | | 2 | 12.7 us | 4.6 us | 4.6 us | | (50, 50, 50) | 0 | 330 us | 120 us | 24.4 us | | | 1 | 350 us | 135 us | 28.2 us | | | 2 | 417 us | 130 us | 24.4 us | #### Benchmarks (CUDA) | Shape | Dim | Before | After | |:------------:|:---:|--------:|--------:| | (10, 10, 10) | 0 | 12.5 us | 12.1 us | | | 1 | 13.1 us | 12.2 us | | | 2 | 13.1 us | 11.8 us | | (50, 50, 50) | 0 | 33.7 us | 11.6 us | | | 1 | 36.5 us | 15.8 us | | | 2 | 41.1 us | 15 us | Pull Request resolved: pytorch#59250 Reviewed By: mruberry Differential Revision: D28820359 Pulled By: ngimel fbshipit-source-id: 572486adabac8135d52a9b8700f9d145c2a4ed45

peterbell10 added the open source label May 27, 2021

peterbell10 requested a review from ngimel May 27, 2021 21:46

peterbell10 requested a review from ezyang as a code owner May 27, 2021 21:46

facebook-github-bot added the cla signed label May 27, 2021

ngimel reviewed May 28, 2021

View reviewed changes

peterbell10 force-pushed the renorm-aten branch from eb331b9 to 100da86 Compare May 28, 2021 01:22

ngimel reviewed May 28, 2021

View reviewed changes

ngimel approved these changes May 28, 2021

View reviewed changes

peterbell10 added 2 commits May 29, 2021 01:27

Migrate renorm to ATen (CPU and CUDA)

d00cb82

Calculate half input norms in float precision

96ad1fa

peterbell10 force-pushed the renorm-aten branch from b5c3f43 to 96ad1fa Compare May 29, 2021 00:27

Fix includes

3f2843a

facebook-github-bot closed this in 74ec508 Jun 1, 2021

facebook-github-bot added the Merged label Jun 1, 2021

facebook-github-bot added the Reverted label Jun 1, 2021

peterbell10 mentioned this pull request Jun 1, 2021

Migrate renorm to ATen (CPU and CUDA) #59250

Closed

Migrate renorm to ATen (CPU and CUDA) #59108

Migrate renorm to ATen (CPU and CUDA) #59108

Uh oh!

Conversation

peterbell10 commented May 27, 2021

Benchmarks (CPU):

Benchmarks (CUDA)

Uh oh!

facebook-github-bot commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel May 28, 2021

Choose a reason for hiding this comment

Uh oh!

ngimel May 28, 2021

Choose a reason for hiding this comment

Uh oh!

peterbell10 May 28, 2021

Choose a reason for hiding this comment

Uh oh!

ngimel May 28, 2021

Choose a reason for hiding this comment

Uh oh!

peterbell10 May 28, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngimel May 28, 2021

Choose a reason for hiding this comment

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 28, 2021

Uh oh!

ngimel commented May 28, 2021

Uh oh!

ngimel commented May 28, 2021

Uh oh!

facebook-github-bot commented May 30, 2021

Uh oh!

ngimel commented May 30, 2021

Uh oh!

peterbell10 commented May 31, 2021

Uh oh!

ngimel commented Jun 1, 2021

Uh oh!

facebook-github-bot commented Jun 1, 2021

Uh oh!

facebook-github-bot commented Jun 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

facebook-github-bot commented May 27, 2021 •

edited

Loading