-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Use accscalar_t for CUDA add/sub with Tensor and Scalar
#60454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 3ac4b6f (more details on the Dr. CI page and at hud.pytorch.org/pr/60454):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
test_binary_ops_with_scalars failure looks related, I'm not sure about multiMarginWithLoss. Also, dedicated test for this (with overflowing scalar) will be good. |
|
The initial code assumed |
Codecov Report
@@ Coverage Diff @@
## master #60454 +/- ##
==========================================
+ Coverage 76.23% 76.28% +0.04%
==========================================
Files 2054 2058 +4
Lines 205033 208466 +3433
==========================================
+ Hits 156309 159027 +2718
- Misses 48724 49439 +715 |
| if (!isIntegralType(iter.common_dtype(), /* includeBool */ true) && (iter.is_cpu_scalar(1) || iter.is_cpu_scalar(2))) { | ||
| // if common dtype is half the scalar constant can overflow in half precision, and yet the result can | ||
| // still be representable in the half dtype. Cast scalar to acc_type to have better accuracy. | ||
| AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kHalf, kBool, kBFloat16, iter.common_dtype(), "add_cuda/sub_cuda", [&]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iter.common_dtype cannot be integral here, right? So you don't need to dispatch to ALL types, only floating and complex and 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely right, thanks.
|
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not. ### Reproducible steps to see the behavioral difference ```ipython In [1]: import torch; torch.__version__ Out[1]: '1.9.0' In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half) In [4]: torch.addcmul(a, b, c, value=2) Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16) In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0] Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16) ``` ### How foreach casts? Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu#L30 and cast inputs and results here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L133-L135 Related to #58833 #60227 #60454 cc ptrblck mcarilli ngimel Pull Request resolved: #60715 Reviewed By: albanD Differential Revision: D29385715 Pulled By: ngimel fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
…h#60715) Summary: Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not. ### Reproducible steps to see the behavioral difference ```ipython In [1]: import torch; torch.__version__ Out[1]: '1.9.0' In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half) In [4]: torch.addcmul(a, b, c, value=2) Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16) In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0] Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16) ``` ### How foreach casts? Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu#L30 and cast inputs and results here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L133-L135 Related to pytorch#58833 pytorch#60227 pytorch#60454 cc ptrblck mcarilli ngimel Pull Request resolved: pytorch#60715 Reviewed By: albanD Differential Revision: D29385715 Pulled By: ngimel fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
…h#60715) Summary: Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not. ### Reproducible steps to see the behavioral difference ```ipython In [1]: import torch; torch.__version__ Out[1]: '1.9.0' In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half) In [4]: torch.addcmul(a, b, c, value=2) Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16) In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0] Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16) ``` ### How foreach casts? Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu#L30 and cast inputs and results here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L133-L135 Related to pytorch#58833 pytorch#60227 pytorch#60454 cc ptrblck mcarilli ngimel Pull Request resolved: pytorch#60715 Reviewed By: albanD Differential Revision: D29385715 Pulled By: ngimel fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
Summary: Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not. ### Reproducible steps to see the behavioral difference ```ipython In [1]: import torch; torch.__version__ Out[1]: '1.9.0' In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half) In [4]: torch.addcmul(a, b, c, value=2) Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16) In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0] Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16) ``` ### How foreach casts? Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu#L30 and cast inputs and results here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L133-L135 Related to #58833 #60227 #60454 cc ptrblck mcarilli ngimel Pull Request resolved: #60715 Reviewed By: albanD Differential Revision: D29385715 Pulled By: ngimel fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
Follow up of #60227, related to #59907 & #58833
With this pull request,
torch.add&torch.subuseacc_typeforScalarif either of two arguments isScalar.This mimics the behavior of
torch.mul,torch._foreach_(add|sub).Scalarandtorch._foreach_(add|sub).ScalarList.reference
pytorch/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu
Lines 17 to 25 in b0c9762
torch._foreach_(add|sub).Scalar: cast scalarpytorch/aten/src/ATen/native/cuda/ForeachBinaryOpScalar.cu
Line 27 in b0c9762
torch._foreach_(add|sub).ScalarList:BinaryOpScalarListFunctorpytorch/aten/src/ATen/native/cuda/ForeachFunctors.cuh
Lines 180 to 182 in b0c9762
scalar_tand computesopmath_t(almost equivalentaccscalar_t)pytorch/aten/src/ATen/native/cuda/MultiTensorApply.cuh
Lines 60 to 68 in b0c9762
is used
pytorch/aten/src/ATen/native/cuda/ForeachBinaryOpScalarList.cu
Line 24 in b0c9762
cc @ngimel @ptrblck @mcarilli