-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Migrate renorm to ATen (CPU and CUDA) #59108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 3f2843a (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @peterbell10!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on @ezyang's behalf, thank you for making it structured!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for fp16, norm has a good chance of overflowing. In previous implementation the result was kept in AccT, perhaps makes sense to do it here. Annoyingly, I think our fused implementation of half norm producing float outputs is broken, so it would cast inputs to float first, but it's safer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, cuda-half is now around 5 us slower for small sizes but the larger size it's still actually faster than the THC kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this include needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was previously included indirectly from THCTensorMathReduce.cuh but I've removed the include from that header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iter.common_dtype cannot be kHalf, right? It would be kFLoat since norm is float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
@peterbell10 I fixed vector_norm to not do explicit casting in #59134, so after it's merged hopefully perf will be back. |
|
windows build error is real |
|
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
@peterbell10 internal builds are failing with |
|
I don't know much about how the mobile builds work. Is it possible that |
|
I've tried adding it to tools/build_variables.bzl, let's see if it fixes the failures. |
|
This pull request has been reverted by afdfd22. |
Summary: Resubmit of #59108, closes #24754, closes #24616 This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot. #### Benchmarks (CPU): | Shape | Dim | Before | After (1 thread) | After (8 threads) | |:------------:|:---:|--------:|-----------------:|------------------:| | (10, 10, 10) | 0 | 11.6 us | 4.2 us | 4.2 us | | | 1 | 14.3 us | 5.2 us | 5.2 us | | | 2 | 12.7 us | 4.6 us | 4.6 us | | (50, 50, 50) | 0 | 330 us | 120 us | 24.4 us | | | 1 | 350 us | 135 us | 28.2 us | | | 2 | 417 us | 130 us | 24.4 us | #### Benchmarks (CUDA) | Shape | Dim | Before | After | |:------------:|:---:|--------:|--------:| | (10, 10, 10) | 0 | 12.5 us | 12.1 us | | | 1 | 13.1 us | 12.2 us | | | 2 | 13.1 us | 11.8 us | | (50, 50, 50) | 0 | 33.7 us | 11.6 us | | | 1 | 36.5 us | 15.8 us | | | 2 | 41.1 us | 15 us | Pull Request resolved: #59250 Reviewed By: mruberry Differential Revision: D28820359 Pulled By: ngimel fbshipit-source-id: 572486adabac8135d52a9b8700f9d145c2a4ed45
Summary: Closes pytorch#24754, closes pytorch#24616, closes pytorch#50874 This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot. #### Benchmarks (CPU): | Shape | Dim | Before | After (1 thread) | After (8 threads) | |:------------:|:---:|--------:|-----------------:|------------------:| | (10, 10, 10) | 0 | 11.6 us | 4.2 us | 4.2 us | | | 1 | 14.3 us | 5.2 us | 5.2 us | | | 2 | 12.7 us | 4.6 us | 4.6 us | | (50, 50, 50) | 0 | 330 us | 120 us | 24.4 us | | | 1 | 350 us | 135 us | 28.2 us | | | 2 | 417 us | 130 us | 24.4 us | #### Benchmarks (CUDA) | Shape | Dim | Before | After | |:------------:|:---:|--------:|--------:| | (10, 10, 10) | 0 | 12.5 us | 12.1 us | | | 1 | 13.1 us | 12.2 us | | | 2 | 13.1 us | 11.8 us | | (50, 50, 50) | 0 | 33.7 us | 11.6 us | | | 1 | 36.5 us | 15.8 us | | | 2 | 41.1 us | 15 us | Pull Request resolved: pytorch#59108 Reviewed By: mrshenli Differential Revision: D28767060 Pulled By: ngimel fbshipit-source-id: 93dcbe5483f71cc6a6444fbd5b1aa1f29975d857
Summary: Resubmit of pytorch#59108, closes pytorch#24754, closes pytorch#24616 This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot. #### Benchmarks (CPU): | Shape | Dim | Before | After (1 thread) | After (8 threads) | |:------------:|:---:|--------:|-----------------:|------------------:| | (10, 10, 10) | 0 | 11.6 us | 4.2 us | 4.2 us | | | 1 | 14.3 us | 5.2 us | 5.2 us | | | 2 | 12.7 us | 4.6 us | 4.6 us | | (50, 50, 50) | 0 | 330 us | 120 us | 24.4 us | | | 1 | 350 us | 135 us | 28.2 us | | | 2 | 417 us | 130 us | 24.4 us | #### Benchmarks (CUDA) | Shape | Dim | Before | After | |:------------:|:---:|--------:|--------:| | (10, 10, 10) | 0 | 12.5 us | 12.1 us | | | 1 | 13.1 us | 12.2 us | | | 2 | 13.1 us | 11.8 us | | (50, 50, 50) | 0 | 33.7 us | 11.6 us | | | 1 | 36.5 us | 15.8 us | | | 2 | 41.1 us | 15 us | Pull Request resolved: pytorch#59250 Reviewed By: mruberry Differential Revision: D28820359 Pulled By: ngimel fbshipit-source-id: 572486adabac8135d52a9b8700f9d145c2a4ed45
Closes #24754, closes #24616, closes #50874
This reuses
linalg_vector_normto calculate the norms. I just add a new kernel that turns the norm into a normalization factor, then multiply the original tensor using a normal broadcastedmuloperator. The result is less code, and better performance to boot.Benchmarks (CPU):
Benchmarks (CUDA)