Use `accscalar_t` for CUDA add/sub with Tensor and Scalar #60454

crcrpar · 2021-06-22T10:54:20Z

Follow up of #60227, related to #59907 & #58833

With this pull request, torch.add & torch.sub use acc_type for Scalar if either of two arguments is Scalar.
This mimics the behavior of torch.mul, torch._foreach_(add|sub).Scalar and torch._foreach_(add|sub).ScalarList.

reference

torch.mul CUDA kernel:

pytorch/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu

Lines 17 to 25 in b0c9762

    
           template<typename scalar_t, typename accscalar_t> 
        
           struct MulScalarFunctor { 
        
               MulScalarFunctor(accscalar_t b_): b(b_) {} 
        
               __device__ scalar_t operator() (scalar_t a) const { 
        
                 return a * b; 
        
               } 
        
             private: 
        
               accscalar_t b; 
        
           };

torch._foreach_(add|sub).Scalar: cast scalar

pytorch/aten/src/ATen/native/cuda/ForeachBinaryOpScalar.cu

Line 27 in b0c9762

scalar.to<opmath_t>());

torch._foreach_(add|sub).ScalarList: BinaryOpScalarListFunctor

pytorch/aten/src/ATen/native/cuda/ForeachFunctors.cuh

Lines 180 to 182 in b0c9762

    
           template<typename T, int depth, int r_args_depth, int res_arg_index> 
        
           struct BinaryOpScalarListFunctor { 
        
               using opmath_t = typename get_opmath_t<T>::opmath_t;

and multi_tensor_apply handles scalar_t and computes opmath_t (almost equivalent accscalar_t)

pytorch/aten/src/ATen/native/cuda/MultiTensorApply.cuh

Lines 60 to 68 in b0c9762

    
           template<int depth, typename scalar_T, typename T, typename... ArgTypes> 
        
           void multi_tensor_apply( 
        
               std::vector<std::vector<at::Tensor>>& tensor_lists, 
        
               at::ArrayRef<Scalar> scalars, 
        
               T callable, 
        
               ArgTypes... args) { 
        
                   TORCH_CHECK(tensor_lists.size() == depth, "Number of tensor lists has to match the depth."); 
        
                   size_t n_tensors = tensor_lists[0].size(); 
        
                   using scalar_vals_t = typename T::opmath_t;

. BinaryOpScalarListFunctor
is used

pytorch/aten/src/ATen/native/cuda/ForeachBinaryOpScalarList.cu

Line 24 in b0c9762

BinaryOpScalarListFunctor<scalar_t,

cc @ngimel @ptrblck @mcarilli

facebook-github-bot · 2021-06-22T10:54:27Z

💊 CI failures summary and remediations

As of commit 3ac4b6f (more details on the Dr. CI page and at hud.pytorch.org/pr/60454):

1/1 failures possibly* introduced in this PR
- 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ngimel · 2021-06-22T16:30:28Z

test_binary_ops_with_scalars failure looks related, I'm not sure about multiMarginWithLoss. Also, dedicated test for this (with overflowing scalar) will be good.

crcrpar · 2021-06-23T09:40:08Z

The initial code assumed a-b == b-a so the failures were caused by my mistake. Locally confirmed that the failures are fixed.

codecov · 2021-06-23T11:10:02Z

Codecov Report

Merging #60454 (3ac4b6f) into master (15dc320) will increase coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #60454      +/-   ##
==========================================
+ Coverage   76.23%   76.28%   +0.04%     
==========================================
  Files        2054     2058       +4     
  Lines      205033   208466    +3433     
==========================================
+ Hits       156309   159027    +2718     
- Misses      48724    49439     +715

ngimel · 2021-06-23T18:22:04Z

aten/src/ATen/native/cuda/BinaryAddSubKernel.cu

+  if (!isIntegralType(iter.common_dtype(), /* includeBool */ true) && (iter.is_cpu_scalar(1) || iter.is_cpu_scalar(2))) {
+    // if common dtype is half the scalar constant can overflow in half precision, and yet the result can
+    // still be representable in the half dtype. Cast scalar to acc_type to have better accuracy.
+    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kHalf, kBool, kBFloat16, iter.common_dtype(), "add_cuda/sub_cuda", [&]() {


iter.common_dtype cannot be integral here, right? So you don't need to dispatch to ALL types, only floating and complex and 3?

Absolutely right, thanks.

facebook-github-bot · 2021-06-23T22:16:22Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-06-24T02:00:59Z

@ngimel merged this pull request in 9e773ea.

Summary: Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not. ### Reproducible steps to see the behavioral difference ```ipython In [1]: import torch; torch.__version__ Out[1]: '1.9.0' In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half) In [4]: torch.addcmul(a, b, c, value=2) Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16) In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0] Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16) ``` ### How foreach casts? Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu#L30 and cast inputs and results here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L133-L135 Related to #58833 #60227 #60454 cc ptrblck mcarilli ngimel Pull Request resolved: #60715 Reviewed By: albanD Differential Revision: D29385715 Pulled By: ngimel fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603

…h#60715) Summary: Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not. ### Reproducible steps to see the behavioral difference ```ipython In [1]: import torch; torch.__version__ Out[1]: '1.9.0' In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half) In [4]: torch.addcmul(a, b, c, value=2) Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16) In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0] Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16) ``` ### How foreach casts? Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu#L30 and cast inputs and results here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L133-L135 Related to pytorch#58833 pytorch#60227 pytorch#60454 cc ptrblck mcarilli ngimel Pull Request resolved: pytorch#60715 Reviewed By: albanD Differential Revision: D29385715 Pulled By: ngimel fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603

Summary: Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not. ### Reproducible steps to see the behavioral difference ```ipython In [1]: import torch; torch.__version__ Out[1]: '1.9.0' In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half) In [4]: torch.addcmul(a, b, c, value=2) Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16) In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0] Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16) ``` ### How foreach casts? Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu#L30 and cast inputs and results here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L133-L135 Related to #58833 #60227 #60454 cc ptrblck mcarilli ngimel Pull Request resolved: #60715 Reviewed By: albanD Differential Revision: D29385715 Pulled By: ngimel fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603

use accscalar_t for add/sub with Tensor and Scalar

900531f

facebook-github-bot added the cla signed label Jun 22, 2021

pytorchbot added the open source label Jun 22, 2021

mrshenli requested a review from ngimel June 23, 2021 02:30

mrshenli added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 23, 2021

crcrpar added 2 commits June 23, 2021 15:44

fix

13c8401

add test

39914b4

crcrpar mentioned this pull request Jun 23, 2021

Foreach Functions Tracking Issue #58833

Open

28 tasks

ngimel reviewed Jun 23, 2021

View reviewed changes

use appropriate dispatch

3ac4b6f

ngimel approved these changes Jun 23, 2021

View reviewed changes

facebook-github-bot closed this in 9e773ea Jun 24, 2021

facebook-github-bot added the Merged label Jun 24, 2021

crcrpar deleted the addsub-16bits-cast-scalar branch June 24, 2021 05:16

crcrpar mentioned this pull request Jun 25, 2021

CUDA addcmul and addcdiv do math in float for 16 bits I/O #60715

Closed

ezyang mentioned this pull request Aug 18, 2021

Half division on low precision scalar doesn't work on CPU #63482

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use `accscalar_t` for CUDA add/sub with Tensor and Scalar #60454

Use `accscalar_t` for CUDA add/sub with Tensor and Scalar #60454

Uh oh!

crcrpar commented Jun 22, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Jun 22, 2021 •

edited

Loading

Uh oh!

ngimel commented Jun 22, 2021 •

edited

Loading

Uh oh!

crcrpar commented Jun 23, 2021

Uh oh!

codecov bot commented Jun 23, 2021 •

edited

Loading

Uh oh!

ngimel Jun 23, 2021

Uh oh!

crcrpar Jun 23, 2021

Uh oh!

facebook-github-bot commented Jun 23, 2021

Uh oh!

facebook-github-bot commented Jun 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	template<typename scalar_t, typename accscalar_t>
	struct MulScalarFunctor {
	MulScalarFunctor(accscalar_t b_): b(b_) {}
	__device__ scalar_t operator() (scalar_t a) const {
	return a * b;
	}
	private:
	accscalar_t b;
	};

	template<typename T, int depth, int r_args_depth, int res_arg_index>
	struct BinaryOpScalarListFunctor {
	using opmath_t = typename get_opmath_t<T>::opmath_t;

	template<int depth, typename scalar_T, typename T, typename... ArgTypes>
	void multi_tensor_apply(
	std::vector<std::vector<at::Tensor>>& tensor_lists,
	at::ArrayRef<Scalar> scalars,
	T callable,
	ArgTypes... args) {
	TORCH_CHECK(tensor_lists.size() == depth, "Number of tensor lists has to match the depth.");
	size_t n_tensors = tensor_lists[0].size();
	using scalar_vals_t = typename T::opmath_t;

Use accscalar_t for CUDA add/sub with Tensor and Scalar #60454

Use accscalar_t for CUDA add/sub with Tensor and Scalar #60454

Uh oh!

Conversation

crcrpar commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

ngimel commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crcrpar commented Jun 23, 2021

Uh oh!

codecov bot commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ngimel Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

crcrpar Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 23, 2021

Uh oh!

facebook-github-bot commented Jun 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Use `accscalar_t` for CUDA add/sub with Tensor and Scalar #60454

Use `accscalar_t` for CUDA add/sub with Tensor and Scalar #60454

crcrpar commented Jun 22, 2021 •

edited

Loading

facebook-github-bot commented Jun 22, 2021 •

edited

Loading

ngimel commented Jun 22, 2021 •

edited

Loading

codecov bot commented Jun 23, 2021 •

edited

Loading