KEMBAR78
Avoid complex-to-real cast warning in CopyBackward by peterbell10 · Pull Request #60021 · pytorch/pytorch · GitHub
Skip to content

Conversation

@peterbell10
Copy link
Collaborator

@peterbell10 peterbell10 commented Jun 15, 2021

Stack from ghstack:

Dropping the imaginary component is expected and gives the correct gradient
formula, so silencing the warning is appropriate.

Differential Revision: D29589371

Dropping the imaginary component is expected and gives the correct gradient
formula, so silencing the warning is appropriate.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 15, 2021

💊 CI failures summary and remediations

As of commit 3b16d43 (more details on the Dr. CI page and at hud.pytorch.org/pr/60021):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test1 (1/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 15 17:03:30 ERROR [0.001s]: test_multitenancy (__main__.TCPStoreTest)
Jun 15 17:03:27 /opt/conda/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py:162: UserWarning: RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. PyTorch v1.9 will be the last release that carries PROCESS_GROUP RPC backend. If you have concerns or suggestions please comment in https://github.com/pytorch/pytorch/issues/55615
Jun 15 17:03:27   warnings.warn(
Jun 15 17:03:28 ok (0.025s)
Jun 15 17:03:28   test_multi_worker_with_fixed_world_size (__main__.TCPStoreTest) ... ok (0.004s)
Jun 15 17:03:28   test_multi_worker_with_nonfixed_world_size (__main__.TCPStoreTest) ... ok (0.006s)
Jun 15 17:03:28   test_multitenancy (__main__.TCPStoreTest) ... ERROR (0.001s)
Jun 15 17:03:30   test_numkeys_delkeys (__main__.TCPStoreTest) ... ok (2.006s)
Jun 15 17:03:30   test_set_get (__main__.TCPStoreTest) ... ok (0.003s)
Jun 15 17:03:30 
Jun 15 17:03:30 ======================================================================
Jun 15 17:03:30 ERROR [0.001s]: test_multitenancy (__main__.TCPStoreTest)
Jun 15 17:03:30 ----------------------------------------------------------------------
Jun 15 17:03:30 Traceback (most recent call last):
Jun 15 17:03:30   File "/var/lib/jenkins/workspace/test/distributed/test_store.py", line 179, in test_multitenancy
Jun 15 17:03:30     store1 = dist.TCPStore(addr, port, 1, True, multi_tenant=True)  # type: ignore[call-arg] # noqa: F841
Jun 15 17:03:30 RuntimeError: Address already in use
Jun 15 17:03:30 
Jun 15 17:03:30 ----------------------------------------------------------------------
Jun 15 17:03:30 Ran 24 tests in 12.152s
Jun 15 17:03:30 
Jun 15 17:03:30 FAILED (errors=1)

See GitHub Actions build Windows CI (pytorch-win-vs2019-cpu-py3) / test (2/2)

Step: "Run test scripts" (full log | diagnosis details | 🔁 rerun)

2021-06-15T16:49:21.1101950Z AssertionError: Fa...lowed difference with rtol=0 and atol=0 is only 0!
2021-06-15T16:49:21.1094429Z ----------------------------------------------------------------------
2021-06-15T16:49:21.1094896Z Traceback (most recent call last):
2021-06-15T16:49:21.1095684Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_distributed.py", line 398, in wrapper
2021-06-15T16:49:21.1096370Z     self._join_processes(fn)
2021-06-15T16:49:21.1097153Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_distributed.py", line 590, in _join_processes
2021-06-15T16:49:21.1098070Z     self._check_return_codes(elapsed_time)
2021-06-15T16:49:21.1098854Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_distributed.py", line 639, in _check_return_codes
2021-06-15T16:49:21.1099573Z     self.assertEqual(
2021-06-15T16:49:21.1100287Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 1430, in assertEqual
2021-06-15T16:49:21.1101075Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-06-15T16:49:21.1101950Z AssertionError: False is not true : Scalars failed to compare as equal! Comparing 3221226505 and 0 gives a difference of 3221226505, but the allowed difference with rtol=0 and atol=0 is only 0!
2021-06-15T16:49:21.1102872Z Expect process 1 exit code to match Process 0 exit code of 0, but got 3221226505
2021-06-15T16:49:21.1103196Z 
2021-06-15T16:49:21.3659007Z ----------------------------------------------------------------------
2021-06-15T16:49:21.3659537Z Ran 83 tests in 132.746s
2021-06-15T16:49:21.3659777Z 
2021-06-15T16:49:21.3660096Z FAILED (failures=1, skipped=32)
2021-06-15T16:49:21.3660342Z 
2021-06-15T16:49:21.3660642Z Generating XML reports...
2021-06-15T16:49:21.3661418Z Generated XML report: test-reports\python-unittest\distributed.test_c10d_gloo\TEST-CommTest-20210615164708.xml
2021-06-15T16:49:21.3662859Z Generated XML report: test-reports\python-unittest\distributed.test_c10d_gloo\TEST-DistributedDataParallelTest-20210615164708.xml

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Dropping the imaginary component is expected and gives the correct gradient
formula, so silencing the warning is appropriate.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 15, 2021
Dropping the imaginary component is expected and gives the correct gradient
formula, so silencing the warning is appropriate.

ghstack-source-id: ba5f4e8
Pull Request resolved: #60021
@peterbell10 peterbell10 requested a review from mruberry June 15, 2021 16:07
@peterbell10 peterbell10 added the module: complex Related to complex number support in PyTorch label Jun 15, 2021
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the warning looks ok
I would have a couple question about the function below though.

const bool copy = (grad->is_cuda() && grad->device() != src_device);
grad_inputs[1] = grad->to(
src_options,
/*non_blocking=*/false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the non_blockingness of the forward not propagated to here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that creates lifetime issues. grad needs to be known to be alive at least until the next device sync, which I don't think we can guarantee.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mcarilli we might want to double check that one later as part of the syncing in the backward.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@peterbell10
Copy link
Collaborator Author

@mruberry ping

Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @peterbell10!

@mruberry
Copy link
Collaborator

mruberry commented Jul 7, 2021

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mruberry merged this pull request in 429436e.

} else {
grad_inputs[1] = grad.to(src_options);
}
const bool copy = (grad->is_cuda() && grad->device() != src_device);
Copy link
Contributor

@zou3519 zou3519 Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to compute copy at all? If grad's device is different from the src device, then it shouldn't matter if we pass grad->to a copy parameter. Because the devices are always different, Tensor::to will always copy regardless of what the copy parameter is. Am I missing something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what the next PR in the stack changes :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! Thanks for the clarification, I should have read the stack :)

@facebook-github-bot facebook-github-bot deleted the gh/peterbell10/74/head branch July 11, 2021 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: complex Related to complex number support in PyTorch open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Complex number cast warnings from fft2 and fftn during backward pass

6 participants