KEMBAR78
DOC Describes behavior for None in module.register_* by thomasjpfan · Pull Request #60125 · pytorch/pytorch · GitHub
Skip to content

Conversation

@thomasjpfan
Copy link
Contributor

Fixes #45834

@thomasjpfan thomasjpfan added module: docs Related to our documentation, both in docs/ and docblocks module: nn Related to torch.nn triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 16, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 16, 2021

💊 CI failures summary and remediations

As of commit 2e53166 (more details on the Dr. CI page and at hud.pytorch.org/pr/60125):



🕵️ 5 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test1 (1/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 16 21:35:43 ERROR [2.823s]: test_multiple_o...rad_is_view (__main__.DistributedDataParallelTest)
Jun 16 21:35:43     loss2.backward()
Jun 16 21:35:43   File "/opt/conda/lib/python3.6/site-packages/torch/_tensor.py", line 256, in backward
Jun 16 21:35:43     torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
Jun 16 21:35:43   File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 149, in backward
Jun 16 21:35:43     allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
Jun 16 21:35:43 RuntimeError: Trying to backward through the graph a second time (or directly access saved variables after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved variables after calling backward.
Jun 16 21:35:43 
Jun 16 21:35:43 
Jun 16 21:35:43 
Jun 16 21:35:43 ======================================================================
Jun 16 21:35:43 ERROR [2.823s]: test_multiple_outputs_multiple_backward_grad_is_view (__main__.DistributedDataParallelTest)
Jun 16 21:35:43 ----------------------------------------------------------------------
Jun 16 21:35:43 Traceback (most recent call last):
Jun 16 21:35:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 398, in wrapper
Jun 16 21:35:43     self._join_processes(fn)
Jun 16 21:35:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 590, in _join_processes
Jun 16 21:35:43     self._check_return_codes(elapsed_time)
Jun 16 21:35:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 633, in _check_return_codes
Jun 16 21:35:43     raise RuntimeError(error)
Jun 16 21:35:43 RuntimeError: Process 0 exited with error code 10 and exception:
Jun 16 21:35:43 Traceback (most recent call last):

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test2 (2/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 16 21:43:53 AssertionError: False is not tr...lowed difference with rtol=0 and atol=0 is only 0!
Jun 16 21:43:53   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 516, in run_test
Jun 16 21:43:53     getattr(self, test_name)()
Jun 16 21:43:53   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 400, in wrapper
Jun 16 21:43:53     fn()
Jun 16 21:43:53   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 4591, in test_ddp_logging_data_cpu
Jun 16 21:43:53     self.assertEqual(ddp_logging_data.get("find_unused_parameters"), 0)
Jun 16 21:43:53   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1419, in assertEqual
Jun 16 21:43:53     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Jun 16 21:43:53   File "/opt/conda/lib/python3.6/unittest/case.py", line 682, in assertTrue
Jun 16 21:43:53     raise self.failureException(msg)
Jun 16 21:43:53 AssertionError: False is not true : Scalars failed to compare as equal! Comparing 1 and 0 gives a difference of 1, but the allowed difference with rtol=0 and atol=0 is only 0!
Jun 16 21:43:53 
Jun 16 21:43:53 
Jun 16 21:43:53 
Jun 16 21:43:53 ----------------------------------------------------------------------
Jun 16 21:43:53 Ran 220 tests in 81.532s
Jun 16 21:43:53 
Jun 16 21:43:53 FAILED (errors=1, skipped=110)
Jun 16 21:43:53 
Jun 16 21:43:53 Generating XML reports...
Jun 16 21:43:53 Generated XML report: test-reports/dist-gloo/distributed.test_distributed_fork/TEST-TestDistBackendWithFork-20210616214231.xml

See CircleCI build pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test2 (3/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 16 20:45:00 AssertionError: True is not false : DDP unused parameters error not raised.
Jun 16 20:45:00   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 400, in wrapper
Jun 16 20:45:00     fn()
Jun 16 20:45:00   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 181, in wrapper
Jun 16 20:45:00     ret = func(*args, **kwargs)
Jun 16 20:45:00   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 93, in wrapper
Jun 16 20:45:00     return func(*args, **kwargs)
Jun 16 20:45:00   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 5857, in test_ddp_unused_params_rebuild_buckets_exception
Jun 16 20:45:00     self.assertFalse(
Jun 16 20:45:00   File "/opt/conda/lib/python3.9/unittest/case.py", line 676, in assertFalse
Jun 16 20:45:00     raise self.failureException(msg)
Jun 16 20:45:00 AssertionError: True is not false : DDP unused parameters error not raised.
Jun 16 20:45:00 
Jun 16 20:45:00 
Jun 16 20:45:00 
Jun 16 20:45:00 ----------------------------------------------------------------------
Jun 16 20:45:00 Ran 220 tests in 231.534s
Jun 16 20:45:00 
Jun 16 20:45:00 FAILED (errors=5, skipped=124)
Jun 16 20:45:00 
Jun 16 20:45:00 Generating XML reports...
Jun 16 20:45:00 Generated XML report: test-reports/dist-nccl/.var.lib.jenkins.workspace.test.distributed.test_distributed_fork/TEST-TestDistBackendWithFork-20210616204109.xml

See CircleCI build pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (4/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 16 20:56:39 AssertionError: True is not false : DDP unused parameters error not raised.
Jun 16 20:56:39   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 400, in wrapper
Jun 16 20:56:39     fn()
Jun 16 20:56:39   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 181, in wrapper
Jun 16 20:56:39     ret = func(*args, **kwargs)
Jun 16 20:56:39   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 93, in wrapper
Jun 16 20:56:39     return func(*args, **kwargs)
Jun 16 20:56:39   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 5858, in test_ddp_unused_params_rebuild_buckets_exception
Jun 16 20:56:39     True, "DDP unused parameters error not raised."
Jun 16 20:56:39   File "/opt/conda/lib/python3.6/unittest/case.py", line 676, in assertFalse
Jun 16 20:56:39     raise self.failureException(msg)
Jun 16 20:56:39 AssertionError: True is not false : DDP unused parameters error not raised.
Jun 16 20:56:39 
Jun 16 20:56:39 
Jun 16 20:56:39 
Jun 16 20:56:39 ----------------------------------------------------------------------
Jun 16 20:56:39 Ran 220 tests in 377.894s
Jun 16 20:56:39 
Jun 16 20:56:39 FAILED (errors=5, skipped=123)
Jun 16 20:56:39 
Jun 16 20:56:39 Generating XML reports...
Jun 16 20:56:39 Generated XML report: test-reports/dist-nccl/distributed.test_distributed_spawn/TEST-TestDistBackendWithSpawn-20210616205021.xml

See CircleCI build pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test1 (5/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 16 20:35:31 ERROR [2.422s]: test_multiple_o...rad_is_view (__main__.DistributedDataParallelTest)
Jun 16 20:35:31     loss2.backward()
Jun 16 20:35:31   File "/opt/conda/lib/python3.9/site-packages/torch/_tensor.py", line 256, in backward
Jun 16 20:35:31     torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
Jun 16 20:35:31   File "/opt/conda/lib/python3.9/site-packages/torch/autograd/__init__.py", line 147, in backward
Jun 16 20:35:31     Variable._execution_engine.run_backward(
Jun 16 20:35:31 RuntimeError: Trying to backward through the graph a second time (or directly access saved variables after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved variables after calling backward.
Jun 16 20:35:31 
Jun 16 20:35:31 
Jun 16 20:35:31 
Jun 16 20:35:31 ======================================================================
Jun 16 20:35:31 ERROR [2.422s]: test_multiple_outputs_multiple_backward_grad_is_view (__main__.DistributedDataParallelTest)
Jun 16 20:35:31 ----------------------------------------------------------------------
Jun 16 20:35:31 Traceback (most recent call last):
Jun 16 20:35:31   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 398, in wrapper
Jun 16 20:35:31     self._join_processes(fn)
Jun 16 20:35:31   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 590, in _join_processes
Jun 16 20:35:31     self._check_return_codes(elapsed_time)
Jun 16 20:35:31   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 633, in _check_return_codes
Jun 16 20:35:31     raise RuntimeError(error)
Jun 16 20:35:31 RuntimeError: Process 1 exited with error code 10 and exception:
Jun 16 20:35:31 Traceback (most recent call last):

1 failure not recognized by patterns:

Job Step Action
GitHub Actions Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) / calculate-docker-image Checkout PyTorch 🔁 rerun

❄️ 2 failures tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_bionic_py3_8_gcc9_coverage_test2 (1/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Jun 16 21:10:50 RuntimeError: Process 0 terminated or timed out after 100.06009531021118 seconds
Jun 16 21:10:50 ======================================================================
Jun 16 21:10:50 ERROR [100.158s]: test_backward_ddp_outside (__main__.TensorPipeDdpUnderDistAutogradTestWithSpawn)
Jun 16 21:10:50 ----------------------------------------------------------------------
Jun 16 21:10:50 Traceback (most recent call last):
Jun 16 21:10:50   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 398, in wrapper
Jun 16 21:10:50     self._join_processes(fn)
Jun 16 21:10:50   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 590, in _join_processes
Jun 16 21:10:50     self._check_return_codes(elapsed_time)
Jun 16 21:10:50   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 638, in _check_return_codes
Jun 16 21:10:50     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
Jun 16 21:10:50 RuntimeError: Process 0 terminated or timed out after 100.06009531021118 seconds
Jun 16 21:10:50 
Jun 16 21:10:50 ======================================================================
Jun 16 21:10:50 ERROR [5.196s]: test_backward_ddp_outside_uneven_inputs (__main__.TensorPipeDdpUnderDistAutogradTestWithSpawn)
Jun 16 21:10:50 ----------------------------------------------------------------------
Jun 16 21:10:50 Traceback (most recent call last):
Jun 16 21:10:50   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 398, in wrapper
Jun 16 21:10:50     self._join_processes(fn)
Jun 16 21:10:50   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 590, in _join_processes
Jun 16 21:10:50     self._check_return_codes(elapsed_time)
Jun 16 21:10:50   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 633, in _check_return_codes

See CircleCI build pytorch_linux_bionic_py3_8_gcc9_coverage_test1 (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Jun 16 21:14:20 RuntimeError: Process 0 terminated or timed out after 100.03983163833618 seconds
Jun 16 21:14:20 ======================================================================
Jun 16 21:14:20 ERROR [100.121s]: test_backward_ddp_outside (__main__.ProcessGroupDdpUnderDistAutogradTestWithSpawn)
Jun 16 21:14:20 ----------------------------------------------------------------------
Jun 16 21:14:20 Traceback (most recent call last):
Jun 16 21:14:20   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 398, in wrapper
Jun 16 21:14:20     self._join_processes(fn)
Jun 16 21:14:20   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 590, in _join_processes
Jun 16 21:14:20     self._check_return_codes(elapsed_time)
Jun 16 21:14:20   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 638, in _check_return_codes
Jun 16 21:14:20     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
Jun 16 21:14:20 RuntimeError: Process 0 terminated or timed out after 100.03983163833618 seconds
Jun 16 21:14:20 
Jun 16 21:14:20 ======================================================================
Jun 16 21:14:20 ERROR [4.095s]: test_backward_ddp_outside_uneven_inputs (__main__.ProcessGroupDdpUnderDistAutogradTestWithSpawn)
Jun 16 21:14:20 ----------------------------------------------------------------------
Jun 16 21:14:20 Traceback (most recent call last):
Jun 16 21:14:20   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 398, in wrapper
Jun 16 21:14:20     self._join_processes(fn)
Jun 16 21:14:20   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 590, in _join_processes
Jun 16 21:14:20     self._check_return_codes(elapsed_time)
Jun 16 21:14:20   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 633, in _check_return_codes

🚧 4 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Contributor

@jbschlosser jbschlosser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@facebook-github-bot
Copy link
Contributor

@jbschlosser has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jbschlosser merged this pull request in 7e032f1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: docs Related to our documentation, both in docs/ and docblocks module: nn Related to torch.nn open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Uninitialised buffers don't show up in state_dict()

4 participants