KEMBAR78
[reland] Always use intrusive_ptr for Message (2 out of 2) by lw · Pull Request #59206 · pytorch/pytorch · GitHub
Skip to content

Conversation

@lw
Copy link
Contributor

@lw lw commented May 31, 2021

Stack from ghstack:

Reland of #58423

This is part 2 of the previous PR. Here we address the remaining occurrences of "raw" Message, namely the ones within toMessageImpl. And since they're the last ones, we make the constructor of Message private, to prevent new usages from emerging.

Differential Revision: D28623892

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Reland of #58423

This is part 2 of the previous PR. Here we address the remaining occurrences of "raw" Message, namely the ones within toMessageImpl. And since they're the last ones, we make the constructor of Message private, to prevent new usages from emerging.

Differential Revision: [D28623892](https://our.internmc.facebook.com/intern/diff/D28623892/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28623892/)!

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 31, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 31, 2021

💊 CI failures summary and remediations

As of commit dedea70 (more details on the Dr. CI page):



🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_bionic_py3_8_gcc9_coverage_test1 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

May 31 14:33:03 AssertionError: False is not tr...e-05 and atol=1e-05 is only 1.595822655193914e-05!
May 31 14:33:03   File "test_linalg.py", line 174, in check_correctness_ref
May 31 14:33:03     self.assertEqual(sol, solution_3d.select(0, i), atol=1e-5, rtol=1e-5)
May 31 14:33:03   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1391, in assertEqual
May 31 14:33:03     self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
May 31 14:33:03   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1391, in assertEqual
May 31 14:33:03     self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
May 31 14:33:03   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1272, in assertEqual
May 31 14:33:03     self.assertEqual(x, y.item(), atol=atol, rtol=rtol, msg=msg,
May 31 14:33:03   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1403, in assertEqual
May 31 14:33:03     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
May 31 14:33:03 AssertionError: False is not true : Scalars failed to compare as equal! Comparing 0.5958641115914712 and 0.5958226551939139 gives a difference of 4.145639755737118e-05, but the allowed difference with rtol=1e-05 and atol=1e-05 is only 1.595822655193914e-05!
May 31 14:33:03 
May 31 14:33:04 ----------------------------------------------------------------------
May 31 14:33:04 Ran 2507 tests in 327.277s
May 31 14:33:04 
May 31 14:33:04 FAILED (failures=1, skipped=51)
May 31 14:33:04 
May 31 14:33:04 Generating XML reports...
May 31 14:33:04 Generated XML report: test-reports/python-unittest/test_linalg/TEST-TestLinalgCPU-20210531142735.xml
May 31 14:33:05 Traceback (most recent call last):
May 31 14:33:05   File "test/run_test.py", line 1172, in <module>

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

May 31 14:22:50 RuntimeError: tensorflow/compil...OK() (Unknown: Could not start gRPC server vs. OK)
May 31 14:22:50   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 314, in _setup_replication
May 31 14:22:50     device = xm.xla_device()
May 31 14:22:50   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 232, in xla_device
May 31 14:22:50     devkind=devkind if devkind is not None else None)
May 31 14:22:50   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 137, in get_xla_supported_devices
May 31 14:22:50     xla_devices = _DEVICES.value
May 31 14:22:50   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/utils/utils.py", line 32, in value
May 31 14:22:50     self._value = self._gen_fn()
May 31 14:22:50   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 19, in <lambda>
May 31 14:22:50     _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
May 31 14:22:50 RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Unknown: Could not start gRPC server vs. OK)
May 31 14:22:50 Exception in device=CPU:2: Connection reset by peer
May 31 14:22:50 Traceback (most recent call last):
May 31 14:22:50   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
May 31 14:22:50     _start_fn(index, pf_cfg, fn, args)
May 31 14:22:50   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
May 31 14:22:50     fn(gindex, *args)
May 31 14:22:50   File "/var/lib/jenkins/workspace/xla/test/test_mp_replication.py", line 16, in _mp_fn
May 31 14:22:50     xm.all_reduce(xm.REDUCE_SUM, [xones, xtwos])
May 31 14:22:50   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 558, in all_reduce
May 31 14:22:50     cctx = CollectiveContext(groups=groups)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed the previous attempt. The change in reland is updating internal related code to use intrusive_ptr as well.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in b07d68e.

@facebook-github-bot facebook-github-bot deleted the gh/lw/194/head branch June 5, 2021 14:17
deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
…9206)

Summary:
Pull Request resolved: pytorch#59206

Reland of pytorch#58423

This is part 2 of the previous PR. Here we address the remaining occurrences of "raw" Message, namely the ones within toMessageImpl. And since they're the last ones, we make the constructor of Message private, to prevent new usages from emerging.
ghstack-source-id: 130202848

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623892

fbshipit-source-id: f815cf6b93e488c118e5d2298473e6e9d9f4c132
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants