KEMBAR78
Use `accscalar_t` for CUDA add/sub with Tensor and Scalar by crcrpar · Pull Request #60454 · pytorch/pytorch · GitHub
Skip to content

Conversation

@crcrpar
Copy link
Collaborator

@crcrpar crcrpar commented Jun 22, 2021

Follow up of #60227, related to #59907 & #58833

With this pull request, torch.add & torch.sub use acc_type for Scalar if either of two arguments is Scalar.
This mimics the behavior of torch.mul, torch._foreach_(add|sub).Scalar and torch._foreach_(add|sub).ScalarList.


reference

cc @ngimel @ptrblck @mcarilli

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 22, 2021

💊 CI failures summary and remediations

As of commit 3ac4b6f (more details on the Dr. CI page and at hud.pytorch.org/pr/60454):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@ngimel
Copy link
Collaborator

ngimel commented Jun 22, 2021

test_binary_ops_with_scalars failure looks related, I'm not sure about multiMarginWithLoss. Also, dedicated test for this (with overflowing scalar) will be good.

@mrshenli mrshenli requested a review from ngimel June 23, 2021 02:30
@mrshenli mrshenli added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 23, 2021
@crcrpar crcrpar mentioned this pull request Jun 23, 2021
28 tasks
@crcrpar
Copy link
Collaborator Author

crcrpar commented Jun 23, 2021

The initial code assumed a-b == b-a so the failures were caused by my mistake. Locally confirmed that the failures are fixed.

@codecov
Copy link

codecov bot commented Jun 23, 2021

Codecov Report

Merging #60454 (3ac4b6f) into master (15dc320) will increase coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #60454      +/-   ##
==========================================
+ Coverage   76.23%   76.28%   +0.04%     
==========================================
  Files        2054     2058       +4     
  Lines      205033   208466    +3433     
==========================================
+ Hits       156309   159027    +2718     
- Misses      48724    49439     +715     

if (!isIntegralType(iter.common_dtype(), /* includeBool */ true) && (iter.is_cpu_scalar(1) || iter.is_cpu_scalar(2))) {
// if common dtype is half the scalar constant can overflow in half precision, and yet the result can
// still be representable in the half dtype. Cast scalar to acc_type to have better accuracy.
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kHalf, kBool, kBFloat16, iter.common_dtype(), "add_cuda/sub_cuda", [&]() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iter.common_dtype cannot be integral here, right? So you don't need to dispatch to ALL types, only floating and complex and 3?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely right, thanks.

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 9e773ea.

@crcrpar crcrpar deleted the addsub-16bits-cast-scalar branch June 24, 2021 05:16
facebook-github-bot pushed a commit that referenced this pull request Jun 25, 2021
Summary:
Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not.

### Reproducible steps to see the behavioral difference
```ipython
In [1]: import torch; torch.__version__
Out[1]: '1.9.0'

In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half)

In [4]: torch.addcmul(a, b, c, value=2)
Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16)

In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0]
Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16)
```

### How foreach casts?
Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu#L30 and cast inputs and results here:
https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L133-L135

Related to #58833 #60227 #60454
cc ptrblck mcarilli ngimel

Pull Request resolved: #60715

Reviewed By: albanD

Differential Revision: D29385715

Pulled By: ngimel

fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
asuhan pushed a commit to asuhan/pytorch that referenced this pull request Jun 28, 2021
…h#60715)

Summary:
Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not.

### Reproducible steps to see the behavioral difference
```ipython
In [1]: import torch; torch.__version__
Out[1]: '1.9.0'

In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half)

In [4]: torch.addcmul(a, b, c, value=2)
Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16)

In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0]
Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16)
```

### How foreach casts?
Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu#L30 and cast inputs and results here:
https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L133-L135

Related to pytorch#58833 pytorch#60227 pytorch#60454
cc ptrblck mcarilli ngimel

Pull Request resolved: pytorch#60715

Reviewed By: albanD

Differential Revision: D29385715

Pulled By: ngimel

fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
jeffdaily pushed a commit to ROCm/pytorch that referenced this pull request Jun 29, 2021
…h#60715)

Summary:
Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not.

### Reproducible steps to see the behavioral difference
```ipython
In [1]: import torch; torch.__version__
Out[1]: '1.9.0'

In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half)

In [4]: torch.addcmul(a, b, c, value=2)
Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16)

In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0]
Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16)
```

### How foreach casts?
Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu#L30 and cast inputs and results here:
https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L133-L135

Related to pytorch#58833 pytorch#60227 pytorch#60454
cc ptrblck mcarilli ngimel

Pull Request resolved: pytorch#60715

Reviewed By: albanD

Differential Revision: D29385715

Pulled By: ngimel

fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
asuhan pushed a commit that referenced this pull request Jun 30, 2021
Summary:
Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not.

### Reproducible steps to see the behavioral difference
```ipython
In [1]: import torch; torch.__version__
Out[1]: '1.9.0'

In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half)

In [4]: torch.addcmul(a, b, c, value=2)
Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16)

In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0]
Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16)
```

### How foreach casts?
Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu#L30 and cast inputs and results here:
https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L133-L135

Related to #58833 #60227 #60454
cc ptrblck mcarilli ngimel

Pull Request resolved: #60715

Reviewed By: albanD

Differential Revision: D29385715

Pulled By: ngimel

fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: cuda Related to torch.cuda, and CUDA support in general open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants