KEMBAR78
Unify invoking JIT functions by lw · Pull Request #57851 · pytorch/pytorch · GitHub
Skip to content

Conversation

@lw
Copy link
Contributor

@lw lw commented May 7, 2021

Stack from ghstack:

The same as the previous PR, but for JIT functions.

Differential Revision: D28253841

The same as the previous PR, but for JIT functions.

Differential Revision: [D28253841](https://our.internmc.facebook.com/intern/diff/D28253841/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels May 7, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 7, 2021

💊 CI failures summary and remediations

As of commit 9baf610 (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

May 21 12:22:32 torch_xla/csrc/aten_xla_type.cp... match any declaration in 'torch_xla::AtenXlaType'
May 21 12:22:13                   ^
May 21 12:22:18 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/aten_xla_type_default.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type_default.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:22:20 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/aten_xla_type.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:22:24 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/batch_norm.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/batch_norm.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:22:32 torch_xla/csrc/aten_xla_type.cpp:1238:25: error: out-of-line definition of 'div' does not match any declaration in 'torch_xla::AtenXlaType'
May 21 12:22:32 at::Tensor AtenXlaType::div(const at::Tensor& self, const at::Tensor& other,
May 21 12:22:32                         ^~~
May 21 12:22:32 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.h:64:74: note: type of 3rd parameter of member declaration does not match definition ('optional<c10::string_view>' vs 'optional<std::string>')
May 21 12:22:32 static at::Tensor div(const at::Tensor & self, const at::Tensor & other, c10::optional<c10::string_view> rounding_mode);
May 21 12:22:32                                                                          ^
May 21 12:22:32 torch_xla/csrc/aten_xla_type.cpp:1257:26: error: out-of-line definition of 'div_' does not match any declaration in 'torch_xla::AtenXlaType'
May 21 12:22:32 at::Tensor& AtenXlaType::div_(at::Tensor& self, const at::Tensor& other,
May 21 12:22:32                          ^~~~
May 21 12:22:32 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.h:117:71: note: type of 3rd parameter of member declaration does not match definition ('optional<c10::string_view>' vs 'optional<std::string>')
May 21 12:22:32 static at::Tensor & div_(at::Tensor & self, const at::Tensor & other, c10::optional<c10::string_view> rounding_mode);
May 21 12:22:32                                                                       ^
May 21 12:22:32 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/reduction.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/reduction.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:22:35 2 errors generated.
May 21 12:22:35 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/matrix.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/matrix.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:22:43 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/pooling.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/pooling.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:22:43 1 warning generated.

1 failure not recognized by patterns:

Job Step Action
CircleCI pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_build Build 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

lw added 2 commits May 17, 2021 03:53
The same as the previous PR, but for JIT functions.

Differential Revision: [D28253841](https://our.internmc.facebook.com/intern/diff/D28253841/)

[ghstack-poisoned]
The same as the previous PR, but for JIT functions.

Differential Revision: [D28253841](https://our.internmc.facebook.com/intern/diff/D28253841/)

[ghstack-poisoned]
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

bool hasOp() const;
std::shared_ptr<Operator> op() const;
bool hasQualifiedName() const;
// FIXME It's weird to return a const _value_, should this return a reference?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, returning const reference look better to me too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it then


// Helpers to run user-defined functions, operators and other computations.

c10::intrusive_ptr<JitFuture> runJitFunction(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar as the comment in the previous PR, this does not need to be a member function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be used by the subclass later in the stack

Copy link
Contributor

@mrshenli mrshenli May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, we are adding a new subclass of RequestCallbackImpl? IIUC, it does not have a subclass today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, no, got confused with the previous PR. This method in fact will just be used in here, so it could perhaps be a simple function, though I think it might use some other methods in turn, and also making a method is more "consistent" with runJitOperator. Does this make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, LGTM

.jitCompilationUnit()
->get_function(scriptRemoteCall.qualifiedName())
.runAsync(stack);
if (jitFuture->completed()) { // short-cut.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are skipping fast passes, let's measure the impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What benchmarks would you like me to run?

Also, I don't remember if I said it clearly before, so let me explain why I think it's better to avoid these fastpaths/shortcuts: they are terribly dangerous for CUDA sync, because at no point we're synchronizing the events contained in the future with the current streams at the callsite. (It can be done, one just has to call future->wait(), but we didn't do it anywhere, and it's super easy to forget about).

Moreover, I believe they hurt readability, because they force us to lay out the code in a different order than how it's executed. Namely, now we need to do something like this:

continuation = []() { bar(); };
auto fut = foo();
if (fut->completed()) {
  continuation();
} else {
  fut->addCallback(continuation);
}

Note how in the code bar() appears before foo(), whereas in the "logical flow" of the program bar() comes after foo(). This requires the reader to "jump around" and makes it harder to quickly understand what the code is doing. Without shortcuts, the above code becomes:

auto fut = foo();
fut->addCallback([]() { bar(); });

As for the supposed performance benefit of these fastpaths, I honestly don't think that we're introducing a regression. The addCallback method itself already has such a shortcut logic internally, so in fact these fastpaths were redundant. Moreover, after the changes in #57638, invoking addCallback should come with literally zero overhead, and the compiler will most likely inline it all the time. (I said addCallback but I also mean then since they're basically the same).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, @jjlilley added this a while ago when optimizing PyPer perf. So, I assume there was a reason to introduce fast path at least at that time.

Moreover, after the changes in #57638, invoking addCallback should come with literally zero overhead, and the compiler will most likely inline it all the time.

If we are confident there is no additional overhead introduce, this looks good to me.

lw added 3 commits May 18, 2021 02:35
The same as the previous PR, but for JIT functions.

Differential Revision: [D28253841](https://our.internmc.facebook.com/intern/diff/D28253841/)

[ghstack-poisoned]
The same as the previous PR, but for JIT functions.

Differential Revision: [D28253841](https://our.internmc.facebook.com/intern/diff/D28253841/)

[ghstack-poisoned]
The same as the previous PR, but for JIT functions.

Differential Revision: [D28253841](https://our.internmc.facebook.com/intern/diff/D28253841/)

[ghstack-poisoned]
The same as the previous PR, but for JIT functions.

Differential Revision: [D28253841](https://our.internmc.facebook.com/intern/diff/D28253841/)

[ghstack-poisoned]
dgl-intel pushed a commit to dgl-intel/pytorch that referenced this pull request May 21, 2021
Pull Request resolved: pytorch#57851

The same as the previous PR, but for JIT functions.
ghstack-source-id: 129567069

Differential Revision: [D28253841](https://our.internmc.facebook.com/intern/diff/D28253841/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in bfdc279.

@facebook-github-bot facebook-github-bot deleted the gh/lw/165/head branch May 25, 2021 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants