KEMBAR78
Make processPythonExecution return a future by lw · Pull Request #57855 · pytorch/pytorch · GitHub
Skip to content

Conversation

@lw
Copy link
Contributor

@lw lw commented May 7, 2021

Stack from ghstack:

We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.

Differential Revision: D28253846

We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.

Differential Revision: [D28253846](https://our.internmc.facebook.com/intern/diff/D28253846/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels May 7, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 7, 2021

💊 CI failures summary and remediations

As of commit f100675 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

May 21 12:17:05 torch_xla/csrc/aten_xla_type.cp... match any declaration in 'torch_xla::AtenXlaType'
May 21 12:16:48                   ^
May 21 12:16:52 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/aten_xla_type_default.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type_default.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:16:54 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/aten_xla_type.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:16:58 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/batch_norm.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/batch_norm.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:17:05 torch_xla/csrc/aten_xla_type.cpp:1238:25: error: out-of-line definition of 'div' does not match any declaration in 'torch_xla::AtenXlaType'
May 21 12:17:05 at::Tensor AtenXlaType::div(const at::Tensor& self, const at::Tensor& other,
May 21 12:17:05                         ^~~
May 21 12:17:05 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.h:338:74: note: type of 3rd parameter of member declaration does not match definition ('optional<c10::string_view>' vs 'optional<std::string>')
May 21 12:17:05 static at::Tensor div(const at::Tensor & self, const at::Tensor & other, c10::optional<c10::string_view> rounding_mode);
May 21 12:17:05                                                                          ^
May 21 12:17:05 torch_xla/csrc/aten_xla_type.cpp:1257:26: error: out-of-line definition of 'div_' does not match any declaration in 'torch_xla::AtenXlaType'
May 21 12:17:05 at::Tensor& AtenXlaType::div_(at::Tensor& self, const at::Tensor& other,
May 21 12:17:05                          ^~~~
May 21 12:17:05 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.h:160:71: note: type of 3rd parameter of member declaration does not match definition ('optional<c10::string_view>' vs 'optional<std::string>')
May 21 12:17:05 static at::Tensor & div_(at::Tensor & self, const at::Tensor & other, c10::optional<c10::string_view> rounding_mode);
May 21 12:17:05                                                                       ^
May 21 12:17:05 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/reduction.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/reduction.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:17:08 2 errors generated.
May 21 12:17:08 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/matrix.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/matrix.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:17:14 1 warning generated.
May 21 12:17:15 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/pooling.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/pooling.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

lw added 2 commits May 17, 2021 03:53
We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.

Differential Revision: [D28253846](https://our.internmc.facebook.com/intern/diff/D28253846/)

[ghstack-poisoned]
We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.

Differential Revision: [D28253846](https://our.internmc.facebook.com/intern/diff/D28253846/)

[ghstack-poisoned]
@mrshenli
Copy link
Contributor

Test failures are real:

May 17 22:43:05   test_py_rref_args (__main__.ProcessGroupRpcTestWithSpawn) ... [E request_callback_no_python.cpp:620] Received error while processing request type 260: *ownerRRef->type() == *typeINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":362, please report a bug to PyTorch. OwnerRRef type is Tensor, expected type is Tensor
May 17 22:43:05 Exception raised from getOrCreateOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:362 (most recent call first):
May 17 22:43:05 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f4df0f01799 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
May 17 22:43:05 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7f4df0efd8d2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
May 17 22:43:05 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7f4df0eff22e in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
May 17 22:43:05 frame #3: torch::distributed::rpc::RRefContext::getOrCreateOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, std::shared_ptr<c10::Type> const&) + 0x3c1 (0x7f4de9aca2a1 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
May 17 22:43:05 frame #4: torch::distributed::rpc::RequestCallbackNoPython::processBaseScriptRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::function<void (torch::distributed::rpc::Message)> const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> > const&) const + 0x694 (0x7f4de9ab9434 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
May 17 22:43:05 frame #5: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::shared_ptr<torch::distributed::rpc::LazyStreamContext>) const + 0x2a0 (0x7f4de9abf010 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
May 17 22:43:05 frame #6: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::shared_ptr<torch::distributed::rpc::LazyStreamContext>) const + 0x54 (0x7f4df226d734 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
May 17 22:43:05 frame #7: <unknown function> + 0x3f9b95e (0x7f4de9abb95e in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
May 17 22:43:05 frame #8: torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::shared_ptr<torch::distributed::rpc::LazyStreamContext>) const + 0xb60 (0x7f4de9abccb0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
May 17 22:43:05 frame #9: torch::distributed::rpc::RequestCallback::operator()(torch::distributed::rpc::Message&, std::shared_ptr<torch::distributed::rpc::LazyStreamContext>) const + 0x47 (0x7f4de9ab57d7 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
May 17 22:43:05 frame #10: torch::distributed::rpc::ProcessGroupAgent::handleRecv(torch::distributed::rpc::RecvWork&) + 0x1e0 (0x7f4df224b8b0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
May 17 22:43:05 frame #11: <unknown function> + 0xd10c42 (0x7f4df224cc42 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
May 17 22:43:05 frame #12: c10::ThreadPool::main_loop(unsigned long) + 0x2a3 (0x7f4df0ed7113 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
May 17 22:43:05 frame #13: <unknown function> + 0xc8421 (0x7f4de5a74421 in /opt/conda/lib/libstdc++.so.6)
May 17 22:43:05 frame #14: <unknown function> + 0x76ba (0x7f4e0033f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
May 17 22:43:05 frame #15: clone + 0x6d (0x7f4e0007551d in /lib/x86_64-linux-gnu/libc.so.6)

@lw
Copy link
Contributor Author

lw commented May 18, 2021

The test failures turned out to come from the previous PR.

We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.

Differential Revision: [D28253846](https://our.internmc.facebook.com/intern/diff/D28253846/)

[ghstack-poisoned]
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

lw added 2 commits May 19, 2021 05:14
We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.

Differential Revision: [D28253846](https://our.internmc.facebook.com/intern/diff/D28253846/)

[ghstack-poisoned]
We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.

Differential Revision: [D28253846](https://our.internmc.facebook.com/intern/diff/D28253846/)

[ghstack-poisoned]
We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.

Differential Revision: [D28253846](https://our.internmc.facebook.com/intern/diff/D28253846/)

[ghstack-poisoned]
dgl-intel pushed a commit to dgl-intel/pytorch that referenced this pull request May 21, 2021
Pull Request resolved: pytorch#57855

We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.
ghstack-source-id: 129567054

Differential Revision: [D28253846](https://our.internmc.facebook.com/intern/diff/D28253846/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in e220a1b.

@facebook-github-bot facebook-github-bot deleted the gh/lw/169/head branch May 25, 2021 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants