Make RequestCallback collect Futures from methods, rather than providing them #57847

lw · 2021-05-07T20:29:06Z

Stack from ghstack:

Fix race condition in TP agent #58753 Fix race condition in TP agent
Make Future always extract DataPtrs in RPC tests #58675 Make Future always extract DataPtrs in RPC tests
Ensure async_execution works with CUDAFuture #56863 Ensure async_execution works with CUDAFuture
Avoid re-doing CUDA stream sync in OwnerRRef #57355 Avoid re-doing CUDA stream sync in OwnerRRef
Make TP agent use streams from Future when sending response #58428 Make TP agent use streams from Future when sending response
Set and propagate devices in RRef completion future #58674 Set and propagate devices in RRef completion future
Set streams when invoking UDFs #58427 Set streams when invoking UDFs
Create CUDA-aware futures in RequestCallback #58426 Create CUDA-aware futures in RequestCallback
Provide pre-extracted DataPtrs when completing a Future with a Message #58425 Provide pre-extracted DataPtrs when completing a Future with a Message
Allow Future::then to return pre-extracted DataPtrs #58424 Allow Future::then to return pre-extracted DataPtrs
Always use intrusive_ptr for Message (2 out of 2) #58423 Always use intrusive_ptr for Message (2 out of 2)
Always use intrusive_ptr for Message (1 out of 2) #58422 Always use intrusive_ptr for Message (1 out of 2)
Prevent using anything other than intrusive_ptr for Future #58421 Prevent using anything other than intrusive_ptr for Future
Migrate remaining shared_ptr<Future> to intrusive_ptr #58420 Migrate remaining shared_ptr to intrusive_ptr
Make remaining autograd methods return futures #57861 Make remaining autograd methods return futures
Make remaining RRef methods return futures #57860 Make remaining RRef methods return futures
Unify fetching RRefs #57859 Unify fetching RRefs
Deduplicate Python object serialization #57858 Deduplicate Python object serialization
Simplify process(Script|Python)(Remote)?Call #57857 Simplify process(Script|Python)(Remote)?Call
Unify assignment of OwnerRRef result #57856 Unify assignment of OwnerRRef result
Make processPythonExecution return a future #57855 Make processPythonExecution return a future
Remove getScriptRemoteCallType #57854 Remove getScriptRemoteCallType
Simplify OwnerRRef completion #57853 Simplify OwnerRRef completion
Unify async execution for JIT functions #57852 Unify async execution for JIT functions
Unify invoking JIT functions #57851 Unify invoking JIT functions
Unify invoking JIT operands #57850 Unify invoking JIT operands
Make some methods of RequestCallback return void instead of bool #57849 Make some methods of RequestCallback return void instead of bool
Centralize setting messageId in RequestCallback #57848 Centralize setting messageId in RequestCallback
Make RequestCallback collect Futures from methods, rather than providing them #57847 Make RequestCallback collect Futures from methods, rather than providing them
Add helpers to manipulate futures #57846 Add helpers to manipulate futures
Reduce overhead when Future invokes callbacks inline #57638 Reduce overhead when Future invokes callbacks inline
Introduce thenAsync method on Future #57637 Introduce thenAsync method on Future

This is the first PR of a stack that aims to simplify RequestCallback, and I want to start by explaining my intentions.

With the introduction of CUDA support in the TensorPipe agent, we found out that other layers higher up in the stack (RRefs, dist autograd, ...) were not "ready" to support CUDA. One cause of this was that in PyTorch most CUDA state is thread-local, and the RequestCallback class (and others) might execute different steps of an operation on multiple threads. The solution to this problem is to preserve or recreate the CUDA state when switching between threads (propagating streams, or recording events and then wait on them). If we were to manually do this everywhere it would be tedious, error-prone, and hard to maintain.

In fact, we already have a primitive that can do this for us: CUDAFuture (now known as just Future). If whenever we switch threads we were to pack the values in a CUDAFuture and then unpack them on the other threads, all CUDA stuff would be taken care of for us.

If our code leveraged CUDAFuture at its core, thing would become the "idiomatic" thing to do, the natural behavior. Future changes would thus also be inclined to follow this pattern, hence automatically doing the right thing.

I also think that, even without these concerns about CUDA, there are benefits to use Futures more extensively. Currently RequestCallback uses a mix of Futures and callbacks. These are two tools for the same job, and thus mixing them creates confusion. Futures are more powerful than simple callbacks (they can be passed around, inspected, chained, waited on, ...) and thus should be preferred. They also lead to more readable code, as each step can be defined and chained in logical order, whereas callbacks must either be nested, or defined inline, or defined before and used later (thus making the code out-of-order).

In short: I intend to rework RequestCallback to use Futures much more. I believe it will greatly simplify the code, help readability, and prove invaluable to support CUDA.

Until now, we had the final result future being created at the very beginning, and then passed around everywhere, so that the various method could "fill in" its value. I think it's much lighter to instead allow each method to create or obtain its futures however it wants, and have it return them. I.e., have these futures "bubble up" from the lower layers, rather than them being "pushed down" from the upper ones.

In this initial PR, I move the place where we create this "final result future", but I still keep it around. I will then, in later PRs, slowly migrate each method so that it returns a future, and in the end I will avoid creating the final result future.

Differential Revision: D28224478

…ing them This is the first PR of a stack that aims to simplify RequestCallback, and I want to start by explaining my intentions. With the introduction of CUDA support in the TensorPipe agent, we found out that other layers higher up in the stack (RRefs, dist autograd, ...) were not "ready" to support CUDA. One cause of this was that in PyTorch most CUDA state is thread-local, and the RequestCallback class (and others) might execute different steps of an operation on multiple threads. The solution to this problem is to preserve or recreate the CUDA state when switching between threads (propagating streams, or recording events and then wait on them). If we were to manually do this everywhere it would be tedious, error-prone, and hard to maintain. In fact, we already have a primitive that can do this for us: CUDAFuture (now known as just Future). If whenever we switch threads we were to pack the values in a CUDAFuture and then unpack them on the other threads, all CUDA stuff would be taken care of for us. If our code leveraged CUDAFuture at its core, thing would become the "idiomatic" thing to do, the natural behavior. Future changes would thus also be inclined to follow this pattern, hence automatically doing the right thing. I also think that, even without these concerns about CUDA, there are benefits to use Futures more extensively. Currently RequestCallback uses a mix of Futures and callbacks. These are two tools for the same job, and thus mixing them creates confusion. Futures are more powerful than simple callbacks (they can be passed around, inspected, chained, waited on, ...) and thus should be preferred. They also lead to more readable code, as each step can be defined and chained in logical order, whereas callbacks must either be nested, or defined inline, or defined before and used later (thus making the code out-of-order). In short: I intend to rework RequestCallback to use Futures much more. I believe it will greatly simplify the code, help readability, and prove invaluable to support CUDA. --- Until now, we had the final result future being created at the very beginning, and then passed around everywhere, so that the various method could "fill in" its value. I think it's much lighter to instead allow each method to create or obtain its futures however it wants, and have it return them. I.e., have these futures "bubble up" from the lower layers, rather than them being "pushed down" from the upper ones. In this initial PR, I move the place where we create this "final result future", but I still keep it around. I will then, in later PRs, slowly migrate each method so that it returns a future, and in the end I will avoid creating the final result future. Differential Revision: [D28224478](https://our.internmc.facebook.com/intern/diff/D28224478/) [ghstack-poisoned]

facebook-github-bot · 2021-05-07T20:29:14Z

💊 CI failures summary and remediations

As of commit 05fd523 (more details on the Dr. CI page):

1/2 failures introduced in this PR
1/2 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_xla_linux_bionic_py3_6_clang9_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

May 21 12:16:05 torch_xla/csrc/aten_xla_type.cp... match any declaration in 'torch_xla::AtenXlaType'

May 21 12:15:52 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/aten_xla_type_default.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type_default.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:15:54 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/aten_xla_type.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:15:58 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/batch_norm.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/batch_norm.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:16:05 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/reduction.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/reduction.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:16:05 torch_xla/csrc/aten_xla_type.cpp:1238:25: error: out-of-line definition of 'div' does not match any declaration in 'torch_xla::AtenXlaType'
May 21 12:16:05 at::Tensor AtenXlaType::div(const at::Tensor& self, const at::Tensor& other,
May 21 12:16:05                         ^~~
May 21 12:16:05 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.h:257:74: note: type of 3rd parameter of member declaration does not match definition ('optional<c10::string_view>' vs 'optional<std::string>')
May 21 12:16:05 static at::Tensor div(const at::Tensor & self, const at::Tensor & other, c10::optional<c10::string_view> rounding_mode);
May 21 12:16:05                                                                          ^
May 21 12:16:05 torch_xla/csrc/aten_xla_type.cpp:1257:26: error: out-of-line definition of 'div_' does not match any declaration in 'torch_xla::AtenXlaType'
May 21 12:16:05 at::Tensor& AtenXlaType::div_(at::Tensor& self, const at::Tensor& other,
May 21 12:16:05                          ^~~~
May 21 12:16:05 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.h:124:71: note: type of 3rd parameter of member declaration does not match definition ('optional<c10::string_view>' vs 'optional<std::string>')
May 21 12:16:05 static at::Tensor & div_(at::Tensor & self, const at::Tensor & other, c10::optional<c10::string_view> rounding_mode);
May 21 12:16:05                                                                       ^
May 21 12:16:08 2 errors generated.
May 21 12:16:08 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/matrix.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/matrix.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:16:15 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/pooling.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/pooling.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:16:15 1 warning generated.
May 21 12:16:15 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/aten_xla_bridge.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_bridge.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_build is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

May 21 15:39:35 unknown file: Failure

May 21 15:39:30 [       OK ] NVFuserTest.FusionBCastConcretizeRfactor_CUDA (0 ms)
May 21 15:39:30 [ RUN      ] NVFuserTest.FusionProveIdEqBasic_CUDA
May 21 15:39:30 [       OK ] NVFuserTest.FusionProveIdEqBasic_CUDA (1 ms)
May 21 15:39:30 [ RUN      ] NVFuserTest.FusionProveIdEqRfactor_CUDA
May 21 15:39:30 [       OK ] NVFuserTest.FusionProveIdEqRfactor_CUDA (0 ms)
May 21 15:39:30 [ RUN      ] NVFuserTest.FusionScalarInputs_CUDA
May 21 15:39:30 [       OK ] NVFuserTest.FusionScalarInputs_CUDA (300 ms)
May 21 15:39:30 [ RUN      ] NVFuserTest.FusionLoopUnroll_CUDA
May 21 15:39:31 [       OK ] NVFuserTest.FusionLoopUnroll_CUDA (366 ms)
May 21 15:39:31 [ RUN      ] NVFuserTest.FusionUnaryOps_CUDA
May 21 15:39:35 unknown file: Failure
May 21 15:39:35 C++ exception with description "
May 21 15:39:35 Op Type: -- round -- had a mismatch.
May 21 15:39:35 INPUT1: Columns 1 to 10 0.6679  0.6561  0.8235  0.9950  0.2605  0.0077  0.0217  0.0670  0.5183  0.1176
May 21 15:39:35  0.2206  0.7947  0.0103  0.7392  0.2854  0.2734  0.5876  0.3296  0.0555  0.9072
May 21 15:39:35  0.9240  0.3902  0.3022  0.6836  0.2465  0.6214  0.4829  0.9843  0.1591  0.3109
May 21 15:39:35  0.5940  0.7379  0.3928  0.6303  0.7150  0.2658  0.4903  0.1414  0.6779  0.8543
May 21 15:39:35  0.6512  0.1737  0.5654  0.0209  0.4092  0.2296  0.2387  0.8106  0.9143  0.5791
May 21 15:39:35  0.5822  0.5327  0.1002  0.2290  0.9826  0.1264  0.3578  0.6732  0.4435  0.9999
May 21 15:39:35  0.6856  0.7244  0.9589  0.1918  0.2369  0.0943  0.3146  0.1273  0.0378  0.6000
May 21 15:39:35  0.5110  0.7978  0.2814  0.4920  0.2033  0.9598  0.5104  0.6520  0.7563  0.9660

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…than providing them" This is the first PR of a stack that aims to simplify RequestCallback, and I want to start by explaining my intentions. With the introduction of CUDA support in the TensorPipe agent, we found out that other layers higher up in the stack (RRefs, dist autograd, ...) were not "ready" to support CUDA. One cause of this was that in PyTorch most CUDA state is thread-local, and the RequestCallback class (and others) might execute different steps of an operation on multiple threads. The solution to this problem is to preserve or recreate the CUDA state when switching between threads (propagating streams, or recording events and then wait on them). If we were to manually do this everywhere it would be tedious, error-prone, and hard to maintain. In fact, we already have a primitive that can do this for us: CUDAFuture (now known as just Future). If whenever we switch threads we were to pack the values in a CUDAFuture and then unpack them on the other threads, all CUDA stuff would be taken care of for us. If our code leveraged CUDAFuture at its core, thing would become the "idiomatic" thing to do, the natural behavior. Future changes would thus also be inclined to follow this pattern, hence automatically doing the right thing. I also think that, even without these concerns about CUDA, there are benefits to use Futures more extensively. Currently RequestCallback uses a mix of Futures and callbacks. These are two tools for the same job, and thus mixing them creates confusion. Futures are more powerful than simple callbacks (they can be passed around, inspected, chained, waited on, ...) and thus should be preferred. They also lead to more readable code, as each step can be defined and chained in logical order, whereas callbacks must either be nested, or defined inline, or defined before and used later (thus making the code out-of-order). In short: I intend to rework RequestCallback to use Futures much more. I believe it will greatly simplify the code, help readability, and prove invaluable to support CUDA. --- Until now, we had the final result future being created at the very beginning, and then passed around everywhere, so that the various method could "fill in" its value. I think it's much lighter to instead allow each method to create or obtain its futures however it wants, and have it return them. I.e., have these futures "bubble up" from the lower layers, rather than them being "pushed down" from the upper ones. In this initial PR, I move the place where we create this "final result future", but I still keep it around. I will then, in later PRs, slowly migrate each method so that it returns a future, and in the end I will avoid creating the final result future. Differential Revision: [D28224478](https://our.internmc.facebook.com/intern/diff/D28224478/) [ghstack-poisoned]

mrshenli

LGTM!

(can be done on the top commit of this stack) shall we run some micro-benchmark to understand the perf differences (if there are any) after this stack

mrshenli · 2021-05-18T01:30:11Z

torch/csrc/distributed/rpc/request_callback_no_python.cpp

        if (wrappedRpcResponseFuture.hasError()) {
          // Propagate error to responseFuture if we had one.
-          responseFuture->setError(wrappedRpcResponseFuture.exception_ptr());
+          std::rethrow_exception(wrappedRpcResponseFuture.exception_ptr());


IIUC, Future::then wraps this callback, captures this exception, and sets the error on the returned child Future?

Correct: the beauty of then is that we can almost forget about the error path: then will capture and store any exception for us automatically, and it's impossible to mess it up. In addition, calling value on the parent future will automatically rethrow the exception if any. Hence consider this code: childFut.then([](Future& parentFut) { return doSth(parentFut.value()); });. This code does correct error propagation, without even thinking about it.

lw · 2021-05-18T08:06:01Z

shall we run some micro-benchmark to understand the perf differences (if there are any) after this stack

Do you have any benchmark in mind I should run?

…than providing them" This is the first PR of a stack that aims to simplify RequestCallback, and I want to start by explaining my intentions. With the introduction of CUDA support in the TensorPipe agent, we found out that other layers higher up in the stack (RRefs, dist autograd, ...) were not "ready" to support CUDA. One cause of this was that in PyTorch most CUDA state is thread-local, and the RequestCallback class (and others) might execute different steps of an operation on multiple threads. The solution to this problem is to preserve or recreate the CUDA state when switching between threads (propagating streams, or recording events and then wait on them). If we were to manually do this everywhere it would be tedious, error-prone, and hard to maintain. In fact, we already have a primitive that can do this for us: CUDAFuture (now known as just Future). If whenever we switch threads we were to pack the values in a CUDAFuture and then unpack them on the other threads, all CUDA stuff would be taken care of for us. If our code leveraged CUDAFuture at its core, thing would become the "idiomatic" thing to do, the natural behavior. Future changes would thus also be inclined to follow this pattern, hence automatically doing the right thing. I also think that, even without these concerns about CUDA, there are benefits to use Futures more extensively. Currently RequestCallback uses a mix of Futures and callbacks. These are two tools for the same job, and thus mixing them creates confusion. Futures are more powerful than simple callbacks (they can be passed around, inspected, chained, waited on, ...) and thus should be preferred. They also lead to more readable code, as each step can be defined and chained in logical order, whereas callbacks must either be nested, or defined inline, or defined before and used later (thus making the code out-of-order). In short: I intend to rework RequestCallback to use Futures much more. I believe it will greatly simplify the code, help readability, and prove invaluable to support CUDA. --- Until now, we had the final result future being created at the very beginning, and then passed around everywhere, so that the various method could "fill in" its value. I think it's much lighter to instead allow each method to create or obtain its futures however it wants, and have it return them. I.e., have these futures "bubble up" from the lower layers, rather than them being "pushed down" from the upper ones. In this initial PR, I move the place where we create this "final result future", but I still keep it around. I will then, in later PRs, slowly migrate each method so that it returns a future, and in the end I will avoid creating the final result future. Differential Revision: [D28224478](https://our.internmc.facebook.com/intern/diff/D28224478/) [ghstack-poisoned]

…ing them Pull Request resolved: pytorch#57847 This is the first PR of a stack that aims to simplify RequestCallback, and I want to start by explaining my intentions. With the introduction of CUDA support in the TensorPipe agent, we found out that other layers higher up in the stack (RRefs, dist autograd, ...) were not "ready" to support CUDA. One cause of this was that in PyTorch most CUDA state is thread-local, and the RequestCallback class (and others) might execute different steps of an operation on multiple threads. The solution to this problem is to preserve or recreate the CUDA state when switching between threads (propagating streams, or recording events and then wait on them). If we were to manually do this everywhere it would be tedious, error-prone, and hard to maintain. In fact, we already have a primitive that can do this for us: CUDAFuture (now known as just Future). If whenever we switch threads we were to pack the values in a CUDAFuture and then unpack them on the other threads, all CUDA stuff would be taken care of for us. If our code leveraged CUDAFuture at its core, thing would become the "idiomatic" thing to do, the natural behavior. Future changes would thus also be inclined to follow this pattern, hence automatically doing the right thing. I also think that, even without these concerns about CUDA, there are benefits to use Futures more extensively. Currently RequestCallback uses a mix of Futures and callbacks. These are two tools for the same job, and thus mixing them creates confusion. Futures are more powerful than simple callbacks (they can be passed around, inspected, chained, waited on, ...) and thus should be preferred. They also lead to more readable code, as each step can be defined and chained in logical order, whereas callbacks must either be nested, or defined inline, or defined before and used later (thus making the code out-of-order). In short: I intend to rework RequestCallback to use Futures much more. I believe it will greatly simplify the code, help readability, and prove invaluable to support CUDA. --- Until now, we had the final result future being created at the very beginning, and then passed around everywhere, so that the various method could "fill in" its value. I think it's much lighter to instead allow each method to create or obtain its futures however it wants, and have it return them. I.e., have these futures "bubble up" from the lower layers, rather than them being "pushed down" from the upper ones. In this initial PR, I move the place where we create this "final result future", but I still keep it around. I will then, in later PRs, slowly migrate each method so that it returns a future, and in the end I will avoid creating the final result future. ghstack-source-id: 129567062 Differential Revision: [D28224478](https://our.internmc.facebook.com/intern/diff/D28224478/)

facebook-github-bot · 2021-05-21T20:16:02Z

This pull request has been merged in f6844ea.

lw requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma, wayi1 and zhaojuanmao as code owners May 7, 2021 20:29

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels May 7, 2021

lw added 2 commits May 17, 2021 03:53

mrshenli approved these changes May 18, 2021

View reviewed changes

lw added 3 commits May 18, 2021 02:35

This was referenced May 20, 2021

Set and propagate devices in RRef completion future #58674

Closed

Avoid re-doing CUDA stream sync in OwnerRRef #57355

Closed

Ensure async_execution works with CUDAFuture #56863

Closed

Make Future always extract DataPtrs in RPC tests #58675

Closed

lw mentioned this pull request May 21, 2021

Fix race condition in TP agent #58753

Closed

facebook-github-bot closed this in f6844ea May 21, 2021

facebook-github-bot added the Merged label May 21, 2021

facebook-github-bot deleted the gh/lw/161/head branch May 25, 2021 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make RequestCallback collect Futures from methods, rather than providing them #57847

Make RequestCallback collect Futures from methods, rather than providing them #57847

Uh oh!

lw commented May 7, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented May 7, 2021 •

edited

Loading

Uh oh!

mrshenli left a comment

Uh oh!

mrshenli May 18, 2021

Uh oh!

lw May 18, 2021

Uh oh!

lw commented May 18, 2021

Uh oh!

facebook-github-bot commented May 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make RequestCallback collect Futures from methods, rather than providing them #57847

Make RequestCallback collect Futures from methods, rather than providing them #57847

Uh oh!

Conversation

lw commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_xla_linux_bionic_py3_6_clang9_build (1/1)

XLA failure

❄️ 1 failure tentatively classified as flaky

pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test2 (1/1)

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli May 18, 2021

Choose a reason for hiding this comment

Uh oh!

lw May 18, 2021

Choose a reason for hiding this comment

Uh oh!

lw commented May 18, 2021

Uh oh!

facebook-github-bot commented May 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lw commented May 7, 2021 •

edited

Loading

facebook-github-bot commented May 7, 2021 •

edited

Loading