KEMBAR78
Migrate renorm to ATen (CPU and CUDA) by peterbell10 · Pull Request #59108 · pytorch/pytorch · GitHub
Skip to content

Conversation

@peterbell10
Copy link
Collaborator

Closes #24754, closes #24616, closes #50874

This reuses linalg_vector_norm to calculate the norms. I just add a new kernel that turns the norm into a normalization factor, then multiply the original tensor using a normal broadcasted mul operator. The result is less code, and better performance to boot.

Benchmarks (CPU):

Shape Dim Before After (1 thread) After (8 threads)
(10, 10, 10) 0 11.6 us 4.2 us 4.2 us
1 14.3 us 5.2 us 5.2 us
2 12.7 us 4.6 us 4.6 us
(50, 50, 50) 0 330 us 120 us 24.4 us
1 350 us 135 us 28.2 us
2 417 us 130 us 24.4 us

Benchmarks (CUDA)

Shape Dim Before After
(10, 10, 10) 0 12.5 us 12.1 us
1 13.1 us 12.2 us
2 13.1 us 11.8 us
(50, 50, 50) 0 33.7 us 11.6 us
1 36.5 us 15.8 us
2 41.1 us 15 us

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 27, 2021

💊 CI failures summary and remediations

As of commit 3f2843a (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @peterbell10!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on @ezyang's behalf, thank you for making it structured!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for fp16, norm has a good chance of overflowing. In previous implementation the result was kept in AccT, perhaps makes sense to do it here. Annoyingly, I think our fused implementation of half norm producing float outputs is broken, so it would cast inputs to float first, but it's safer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, cuda-half is now around 5 us slower for small sizes but the larger size it's still actually faster than the THC kernel.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this include needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was previously included indirectly from THCTensorMathReduce.cuh but I've removed the include from that header.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iter.common_dtype cannot be kHalf, right? It would be kFLoat since norm is float

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator

ngimel commented May 28, 2021

@peterbell10 I fixed vector_norm to not do explicit casting in #59134, so after it's merged hopefully perf will be back.

@ngimel
Copy link
Collaborator

ngimel commented May 28, 2021

windows build error is real

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator

ngimel commented May 30, 2021

@peterbell10 internal builds are failing with

stderr: ld.lld: error: undefined symbol: at::native::DispatchStub<void (*)(at::TensorIteratorBase&, double), at::native::renorm_scale_factor_stub>::DEFAULT
>>> referenced by SmallVector.h:0 (buck-out/gen/fe3a39b8/xplat/caffe2/c10/c10Android#header-mode-symlink-tree-with-header-map,headers/c10/util/SmallVector.h:0)
>>>               buck-out/gen/fe3a39b8/fbandroid/apps/fb4a/fastparse/flavors/fbandroid_debug_pt_code_gen_libAndroid#android-armv7,compile-pic-Normalization.cpp.oc5b093a4/aten/src/ATen/native/Normalization.cpp.o:(at::native::structured_renorm_out::impl(at::Tensor const&, c10::Scalar const&, long long, c10::Scalar const&, at::Tensor const&))
clang++: error: linker command failed with exit code 1 (use -v to see invocation)

@peterbell10
Copy link
Collaborator Author

I don't know much about how the mobile builds work. Is it possible that RenormKernel.cpp just needs to be added to the build definition?

@ngimel
Copy link
Collaborator

ngimel commented Jun 1, 2021

I've tried adding it to tools/build_variables.bzl, let's see if it fixes the failures.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 74ec508.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by afdfd22.

facebook-github-bot pushed a commit that referenced this pull request Jun 3, 2021
Summary:
Resubmit of #59108, closes #24754, closes #24616

This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns  the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot.

#### Benchmarks (CPU):
|     Shape    | Dim |  Before | After (1 thread) | After (8 threads) |
|:------------:|:---:|--------:|-----------------:|------------------:|
| (10, 10, 10) | 0   | 11.6 us |           4.2 us |            4.2 us |
|              | 1   | 14.3 us |           5.2 us |            5.2 us |
|              | 2   | 12.7 us |           4.6 us |            4.6 us |
| (50, 50, 50) | 0   |  330 us |           120 us |           24.4 us |
|              | 1   |  350 us |           135 us |           28.2 us |
|              | 2   |  417 us |           130 us |           24.4 us |

#### Benchmarks (CUDA)
|     Shape    | Dim |  Before |   After |
|:------------:|:---:|--------:|--------:|
| (10, 10, 10) | 0   | 12.5 us | 12.1 us |
|              | 1   | 13.1 us | 12.2 us |
|              | 2   | 13.1 us | 11.8 us |
| (50, 50, 50) | 0   | 33.7 us | 11.6 us |
|              | 1   | 36.5 us | 15.8 us |
|              | 2   | 41.1 us |   15 us |

Pull Request resolved: #59250

Reviewed By: mruberry

Differential Revision: D28820359

Pulled By: ngimel

fbshipit-source-id: 572486adabac8135d52a9b8700f9d145c2a4ed45
deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
Closes pytorch#24754, closes pytorch#24616, closes pytorch#50874

This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns  the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot.

#### Benchmarks (CPU):
|     Shape    | Dim |  Before | After (1 thread) | After (8 threads) |
|:------------:|:---:|--------:|-----------------:|------------------:|
| (10, 10, 10) | 0   | 11.6 us |           4.2 us |            4.2 us |
|              | 1   | 14.3 us |           5.2 us |            5.2 us |
|              | 2   | 12.7 us |           4.6 us |            4.6 us |
| (50, 50, 50) | 0   |  330 us |           120 us |           24.4 us |
|              | 1   |  350 us |           135 us |           28.2 us |
|              | 2   |  417 us |           130 us |           24.4 us |

#### Benchmarks (CUDA)
|     Shape    | Dim |  Before |   After |
|:------------:|:---:|--------:|--------:|
| (10, 10, 10) | 0   | 12.5 us | 12.1 us |
|              | 1   | 13.1 us | 12.2 us |
|              | 2   | 13.1 us | 11.8 us |
| (50, 50, 50) | 0   | 33.7 us | 11.6 us |
|              | 1   | 36.5 us | 15.8 us |
|              | 2   | 41.1 us |   15 us |

Pull Request resolved: pytorch#59108

Reviewed By: mrshenli

Differential Revision: D28767060

Pulled By: ngimel

fbshipit-source-id: 93dcbe5483f71cc6a6444fbd5b1aa1f29975d857
deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
Resubmit of pytorch#59108, closes pytorch#24754, closes pytorch#24616

This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns  the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot.

#### Benchmarks (CPU):
|     Shape    | Dim |  Before | After (1 thread) | After (8 threads) |
|:------------:|:---:|--------:|-----------------:|------------------:|
| (10, 10, 10) | 0   | 11.6 us |           4.2 us |            4.2 us |
|              | 1   | 14.3 us |           5.2 us |            5.2 us |
|              | 2   | 12.7 us |           4.6 us |            4.6 us |
| (50, 50, 50) | 0   |  330 us |           120 us |           24.4 us |
|              | 1   |  350 us |           135 us |           28.2 us |
|              | 2   |  417 us |           130 us |           24.4 us |

#### Benchmarks (CUDA)
|     Shape    | Dim |  Before |   After |
|:------------:|:---:|--------:|--------:|
| (10, 10, 10) | 0   | 12.5 us | 12.1 us |
|              | 1   | 13.1 us | 12.2 us |
|              | 2   | 13.1 us | 11.8 us |
| (50, 50, 50) | 0   | 33.7 us | 11.6 us |
|              | 1   | 36.5 us | 15.8 us |
|              | 2   | 41.1 us |   15 us |

Pull Request resolved: pytorch#59250

Reviewed By: mruberry

Differential Revision: D28820359

Pulled By: ngimel

fbshipit-source-id: 572486adabac8135d52a9b8700f9d145c2a4ed45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate renorm and renorm_ from the TH to Aten (CPU) Migrate renorm and renorm_ from the TH to Aten (CUDA)

3 participants