-
Notifications
You must be signed in to change notification settings - Fork 25.7k
fix mean_out: op does not update parameter out for BF16/FP16 dtype on CPU #135174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135174
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1a4ee95 with merge base cf31724 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
|
@pytorchbot label "topic: not user facing |
|
❌ 🤖 pytorchbot command failed: |
|
@pytorchbot label "topic: not user facing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, could you add a test (see test/test_reductions.py)
3a6ee44 to
464c72c
Compare
|
Thanks for review, I have added a test to check whether the out of mean_out op is the alias of the return at ( |
|
@mruberry Please help me to review the code, thanks. |
aten/src/ATen/native/ReduceOps.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to no longer set result_mut and no longer do auto& result_mut = const_cast<Tensor&>(result); above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think yes, let me try to test it local.
test/test_reductions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test, but maybe also a small check for correctness by comparing with the out-of-place mean op would be good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it looks good? I use allclose to check it with the target.
464c72c to
2cd4f06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
aten/src/ATen/native/ReduceOps.cpp
Outdated
| at::sum_out(result_temp, self, opt_dim, keepdim, sum_out_dtype).div_(dim_prod); | ||
| // After sum & div, cast result_temp back to BF16 or FP16, if required. | ||
| if (is_half_type) { | ||
| result.copy_(result_temp.to(dtype)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to call result_temp.to(dtype) ? result.copy_(result_temp) should works and save the overhead of result_temp.to(dtype).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, I just known copy_ do this convert. I have changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CaoE - for accuracy reasons, wouldn't it be better if we have intermediate FP32 sum output that's input for division? Thanks
Just noticed that sum_out_dtype already ensures it.
2cd4f06 to
66c4558
Compare
aten/src/ATen/native/ReduceOps.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better add other data types in the testing to ensure that they also return the correct result.
@onlyCPU
@dtypes(torch.half, torch.bfloat16, torch.float, torch.double)
def test_mean_out_float16_is_alias_of_return(self, dtype, device):
a = torch.tensor([[[1.0, 1.0, 1.0, 1.0]], [[2.0, 2.0, 2.0, 2.0]], [[3.0, 3.0, 3.0, 3.0]]],
dtype=dtype, device=device)
...
Can we also avoid creating a new tensor when is_half_type is false to avoid the overhead of creating a new tensor ?
For example. We can just use:
if (is_half_type) {
auto _result_mut = result.to(sum_out_dtype);
at::sum_out(_result_mut, self, opt_dim, keepdim, sum_out_dtype).div_(dim_prod);
result.copy_(_result_mut);
} else {
at::sum_out(result, self, opt_dim, keepdim, sum_out_dtype).div_(dim_prod);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review. I add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
Is there any way we can avoid the copy? If not, can we add a comment on why it's necessary, so that someone reading the code in the future may be able to understand it without going through the history of the file? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It cannot be avoided because the promotion needs more storage. Thus, it cannot reuse the storage of input "out" parameter and need to update the result to "out".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some comments about it. It is my first time do contribution to GitHub. Thanks for all you guy's excellent suggestions!
66c4558 to
5b356f4
Compare
d40cd70 to
336eccc
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
…#134848 Signed-off-by: DavidGu-Datong <datong_gu@163.com>
|
Successfully rebased |
336eccc to
1a4ee95
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #134848
For BF16/FP16, when a tensor is specified in
outparameter of mean, the mean kernel should use its storage for output, but that doesn't happen, since anat::toin the current code causes storage to be allocated again, but theoutparameter tensor's storage doesn't get updated, resulting in it not holding the mean output.cc @albanD