Increase some tolerances for tf32 for Conv3d tests #60451

Flamefire · 2021-06-22T09:46:43Z

Allow those tests to pass on A100 GPUs which support tf32

Basically follow-up to #52871 which also increased some precisions to 0.05

For reference these are the failures I see (only ones in testnn with 1.9.0):

FAIL: test_Conv3d_pad_same_cuda_tf32 (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "test_nn.py", line 11296, in with_tf32_on
    test.test_cuda(self, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda
    test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0)
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType
    return self.assertEqual(*args, exact_dtype=False, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 161 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan compariso
ns). The greatest difference was 0.032408137116391345 (-33.45570601919647 vs. -33.42329788208008), which occurred at index (2, 0, 0, 1, 0).

======================================================================
FAIL: test_Conv3d_pad_same_dilated_cuda_tf32 (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "test_nn.py", line 11296, in with_tf32_on
    test.test_cuda(self, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda
    test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0)
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType
    return self.assertEqual(*args, exact_dtype=False, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 111 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan compariso
ns). The greatest difference was 0.024654212557543076 (35.104286017977465 vs. 35.07963180541992), which occurred at index (3, 0, 0, 0, 2).

======================================================================
FAIL: test_Conv3d_pad_valid_cuda_tf32 (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "test_nn.py", line 11296, in with_tf32_on
    test.test_cuda(self, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda
    test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0)
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType
    return self.assertEqual(*args, exact_dtype=False, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 41 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.010903167642320355 (8.074376869119371 vs. 8.06347370147705), which occurred at index (0, 0, 1, 0, 0).

Allow those tests to pass on A100 GPUs which support tf32

facebook-github-bot · 2021-06-22T09:46:49Z

💊 CI failures summary and remediations

As of commit 61687bd (more details on the Dr. CI page and at hud.pytorch.org/pr/60451):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

codecov · 2021-06-22T13:40:41Z

Codecov Report

Merging #60451 (61687bd) into master (700df82) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #60451   +/-   ##
=======================================
  Coverage   76.23%   76.23%           
=======================================
  Files        2054     2054           
  Lines      205033   205033           
=======================================
+ Hits       156299   156307    +8     
+ Misses      48734    48726    -8

mruberry · 2021-06-24T04:26:32Z

Holy cow! An order of magnitude precision degradation? @zasdfgbnm are you seeing this, too?

facebook-github-bot · 2021-06-24T04:46:49Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zasdfgbnm · 2021-06-24T20:21:09Z

@mruberry I am not looking at cudnn tests recently. But in the past, I have seen cases that the threshold went 10x. If the original threshold was very small, then I won't be surprised.

facebook-github-bot · 2021-06-24T20:38:10Z

@ngimel merged this pull request in 0ba4044.

Increase some tolerances for tf32 for Conv3d tests

61687bd

Allow those tests to pass on A100 GPUs which support tf32

facebook-github-bot added the cla signed label Jun 22, 2021

Flamefire mentioned this pull request Jun 22, 2021

TF32 threshold twiddling for tests #60209

Closed

pytorchbot added the open source label Jun 22, 2021

mrshenli requested review from mruberry and ngimel June 23, 2021 02:31

ngimel approved these changes Jun 24, 2021

View reviewed changes

facebook-github-bot closed this in 0ba4044 Jun 24, 2021

facebook-github-bot added the Merged label Jun 24, 2021

Flamefire deleted the Conv3d_tests_tf32 branch June 25, 2021 06:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase some tolerances for tf32 for Conv3d tests #60451

Increase some tolerances for tf32 for Conv3d tests #60451

Uh oh!

Flamefire commented Jun 22, 2021

Uh oh!

facebook-github-bot commented Jun 22, 2021 •

edited

Loading

Uh oh!

codecov bot commented Jun 22, 2021

Uh oh!

mruberry commented Jun 24, 2021

Uh oh!

facebook-github-bot commented Jun 24, 2021

Uh oh!

zasdfgbnm commented Jun 24, 2021

Uh oh!

facebook-github-bot commented Jun 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Increase some tolerances for tf32 for Conv3d tests #60451

Increase some tolerances for tf32 for Conv3d tests #60451

Uh oh!

Conversation

Flamefire commented Jun 22, 2021

Uh oh!

facebook-github-bot commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

codecov bot commented Jun 22, 2021

Codecov Report

Uh oh!

mruberry commented Jun 24, 2021

Uh oh!

facebook-github-bot commented Jun 24, 2021

Uh oh!

zasdfgbnm commented Jun 24, 2021

Uh oh!

facebook-github-bot commented Jun 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

facebook-github-bot commented Jun 22, 2021 •

edited

Loading