KEMBAR78
Move torch/lib/c10d to torch/csrc/distributed/c10d by lw · Pull Request #60543 · pytorch/pytorch · GitHub
Skip to content

Conversation

@lw
Copy link
Contributor

@lw lw commented Jun 23, 2021

Stack from ghstack:

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.

Differential Revision: D29062002

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.

Differential Revision: [D29062002](https://our.internmc.facebook.com/intern/diff/D29062002/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29062002/)!

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 23, 2021

💊 CI failures summary and remediations

As of commit 15a541d (more details on the Dr. CI page and at hud.pytorch.org/pr/60543):



🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (1/3)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 24 19:20:08 AssertionError: "weight tensor ...nsion is 4, the 1th output dimension is 3. vs. OK)
Jun 24 19:20:08 *** End stack trace ***
Jun 24 19:20:08 
Jun 24 19:20:08 
Jun 24 19:20:08 During handling of the above exception, another exception occurred:
Jun 24 19:20:08 
Jun 24 19:20:08 Traceback (most recent call last):
Jun 24 19:20:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 397, in instantiated_test
Jun 24 19:20:08     result = test_fn(self, *args)
Jun 24 19:20:08   File "/var/lib/jenkins/workspace/xla/test/../../test/test_nn.py", line 16007, in test_nll_loss_invalid_weights
Jun 24 19:20:08     F.nll_loss(x, t, weight=weight)
Jun 24 19:20:08 AssertionError: "weight tensor should be defined either for all 3 classes or no classes" does not match "/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/xla_client/debug_macros.h:27 : Check failed: status.status() == ::tensorflow::Status::OK() (Invalid argument: Input dimension should be either 1 or equal to the output dimension it is broadcasting into; the 0th operand dimension is 4, the 1th output dimension is 3. vs. OK)
Jun 24 19:20:08 *** Begin stack trace ***
Jun 24 19:20:08 	tensorflow::CurrentStackTrace[abi:cxx11]()
Jun 24 19:20:08 	xla::Shape const* ConsumeValue<xla::Shape const*>(tensorflow::StatusOr<xla::Shape const*>&&)
Jun 24 19:20:08 	torch_xla::XlaHelpers::ShapeOfXlaOp(xla::XlaOp)
Jun 24 19:20:08 	torch_xla::ir::ops::InferOutputShape(absl::lts_20210324::Span<xla::Shape const>, std::function<xla::XlaOp (absl::lts_20210324::Span<xla::XlaOp const>)> const&)
Jun 24 19:20:08 	
Jun 24 19:20:08 	torch_xla::ir::Node::GetOpShape(std::function<xla::Shape ()> const&) const
Jun 24 19:20:08 	torch_xla::ir::Node::Node(torch_xla::ir::OpKind, absl::lts_20210324::Span<torch_xla::ir::Value const>, std::function<xla::Shape ()> const&, unsigned long, absl::lts_20210324::uint128)
Jun 24 19:20:08 	torch_xla::ir::ops::NllLoss::NllLoss(torch_xla::ir::Value const&, torch_xla::ir::Value const&, absl::lts_20210324::optional<torch_xla::ir::Value> const&, torch_xla::ReductionMode, int)
Jun 24 19:20:08 	torch_xla::XLATensor::nll_loss(torch_xla::XLATensor const&, torch_xla::XLATensor const&, torch_xla::XLATensor const&, long, int)

See CircleCI build pytorch_linux_backward_compatibility_check_test (2/3)

Step: "Report results" (full log | diagnosis details | 🔁 rerun)

Jun 24 18:01:41 python: can't open file 'tools/...est_stats.py': [Errno 2] No such file or directory
Jun 24 18:01:41 + export CIRCLE_BRANCH=gh/lw/222/head
Jun 24 18:01:41 + CIRCLE_BRANCH=gh/lw/222/head
Jun 24 18:01:41 + export JOB_BASE_NAME=pytorch_linux_backward_compatibility_check_test
Jun 24 18:01:41 + JOB_BASE_NAME=pytorch_linux_backward_compatibility_check_test
Jun 24 18:01:41 + export CIRCLE_WORKFLOW_ID=a8211869-a318-4372-9694-d41434512c96
Jun 24 18:01:41 + CIRCLE_WORKFLOW_ID=a8211869-a318-4372-9694-d41434512c96
Jun 24 18:01:41 + cd workspace
Jun 24 18:01:41 + export PYTHONPATH=/var/lib/jenkins/workspace
Jun 24 18:01:41 + PYTHONPATH=/var/lib/jenkins/workspace
Jun 24 18:01:41 + python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test
Jun 24 18:01:41 python: can't open file 'tools/print_test_stats.py': [Errno 2] No such file or directory


Exited with code exit status 2

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_jit_legacy_test (3/3)

Step: "Report results" (full log | diagnosis details | 🔁 rerun)

Jun 24 18:04:18 python: can't open file 'tools/...est_stats.py': [Errno 2] No such file or directory
Jun 24 18:04:18 + export CIRCLE_BRANCH=gh/lw/222/head
Jun 24 18:04:18 + CIRCLE_BRANCH=gh/lw/222/head
Jun 24 18:04:18 + export JOB_BASE_NAME=pytorch_linux_xenial_py3_6_gcc5_4_jit_legacy_test
Jun 24 18:04:18 + JOB_BASE_NAME=pytorch_linux_xenial_py3_6_gcc5_4_jit_legacy_test
Jun 24 18:04:18 + export CIRCLE_WORKFLOW_ID=a8211869-a318-4372-9694-d41434512c96
Jun 24 18:04:18 + CIRCLE_WORKFLOW_ID=a8211869-a318-4372-9694-d41434512c96
Jun 24 18:04:18 + cd workspace
Jun 24 18:04:18 + export PYTHONPATH=/var/lib/jenkins/workspace
Jun 24 18:04:18 + PYTHONPATH=/var/lib/jenkins/workspace
Jun 24 18:04:18 + python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test
Jun 24 18:04:18 python: can't open file 'tools/print_test_stats.py': [Errno 2] No such file or directory


Exited with code exit status 2


❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_slow_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Jun 24 18:41:13 ConnectionResetError: [Errno 104] Connection reset by peer
Jun 24 18:41:13   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 455, in accept
Jun 24 18:41:13     deliver_challenge(c, self._authkey)
Jun 24 18:41:13   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 722, in deliver_challenge
Jun 24 18:41:13     response = connection.recv_bytes(256)        # reject large message
Jun 24 18:41:13   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
Jun 24 18:41:13     buf = self._recv_bytes(maxlength)
Jun 24 18:41:13   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
Jun 24 18:41:13     buf = self._recv(4)
Jun 24 18:41:13   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
Jun 24 18:41:13     chunk = read(handle, remaining)
Jun 24 18:41:13 ConnectionResetError: [Errno 104] Connection reset by peer
Jun 24 18:41:13 /opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
Jun 24 18:41:13   len(cache))
Jun 24 18:41:16 Process ErrorTrackingProcess-88:
Jun 24 18:41:16 Traceback (most recent call last):
Jun 24 18:41:16   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
Jun 24 18:41:16     self.run()
Jun 24 18:41:16   File "/var/lib/jenkins/workspace/test/test_dataloader.py", line 374, in run
Jun 24 18:41:16     super(ErrorTrackingProcess, self).run()
Jun 24 18:41:16   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
Jun 24 18:41:16     self._target(*self._args, **self._kwargs)

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Jun 23, 2021
Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.

Differential Revision: [D29062002](https://our.internmc.facebook.com/intern/diff/D29062002/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29062002/)!

[ghstack-poisoned]
lw added a commit that referenced this pull request Jun 23, 2021
Pull Request resolved: #60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132184065

Differential Revision: [D29062002](https://our.internmc.facebook.com/intern/diff/D29062002/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29062002/)!
Copy link
Contributor

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Hard to spot whether there are any missing items, but since this is mostly a migration work, it should be good to go once all CI tests are green. Approving in advance to unblock you.

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.

Differential Revision: [D29062002](https://our.internmc.facebook.com/intern/diff/D29062002/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29062002/)!

[ghstack-poisoned]
lw added a commit that referenced this pull request Jun 24, 2021
Pull Request resolved: #60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132293145

Differential Revision: [D29062002](https://our.internmc.facebook.com/intern/diff/D29062002/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29062002/)!
Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.

Differential Revision: [D29062002](https://our.internmc.facebook.com/intern/diff/D29062002/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29062002/)!

[ghstack-poisoned]
Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.

Differential Revision: [D29062002](https://our.internmc.facebook.com/intern/diff/D29062002/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29062002/)!

[ghstack-poisoned]
lw added a commit that referenced this pull request Jun 24, 2021
Pull Request resolved: #60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132296287

Differential Revision: [D29062002](https://our.internmc.facebook.com/intern/diff/D29062002/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29062002/)!
Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.

Differential Revision: [D29062002](https://our.internmc.facebook.com/intern/diff/D29062002/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29062002/)!

[ghstack-poisoned]
lw added a commit that referenced this pull request Jun 24, 2021
Pull Request resolved: #60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292

Differential Revision: [D29062002](https://our.internmc.facebook.com/intern/diff/D29062002/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29062002/)!
@lw lw added the ci/master label Jun 24, 2021
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in a016150.

malfet added a commit that referenced this pull request Jun 25, 2021
After #60543 they are installed in the same folder as the rest of the tests
facebook-github-bot pushed a commit that referenced this pull request Jun 25, 2021
Summary:
After #60543 they are installed in the same folder as the rest of the tests

Pull Request resolved: #60705

Reviewed By: driazati

Differential Revision: D29380670

Pulled By: malfet

fbshipit-source-id: a432d26c731e9220e00d8c800b1429b37d51655b
@facebook-github-bot facebook-github-bot deleted the gh/lw/222/head branch June 28, 2021 14:17
asuhan pushed a commit to asuhan/pytorch that referenced this pull request Jun 28, 2021
Summary:
After pytorch#60543 they are installed in the same folder as the rest of the tests

Pull Request resolved: pytorch#60705

Reviewed By: driazati

Differential Revision: D29380670

Pulled By: malfet

fbshipit-source-id: a432d26c731e9220e00d8c800b1429b37d51655b
asuhan pushed a commit that referenced this pull request Jun 30, 2021
Summary:
After #60543 they are installed in the same folder as the rest of the tests

Pull Request resolved: #60705

Reviewed By: driazati

Differential Revision: D29380670

Pulled By: malfet

fbshipit-source-id: a432d26c731e9220e00d8c800b1429b37d51655b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants