KEMBAR78
Add generic join unit tests by awgu · Pull Request #61786 · pytorch/pytorch · GitHub
Skip to content

Conversation

@awgu
Copy link
Collaborator

@awgu awgu commented Jul 16, 2021

Stack from ghstack:

This adds unit tests for the generic join context manager.

gpurun python test/distributed/algorithms/test_join.py

Differential Revision: D29746646

[ghstack-poisoned]
This was referenced Jul 16, 2021
@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Jul 16, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 16, 2021

💊 CI failures summary and remediations

As of commit a7e4496 (more details on the Dr. CI page and at hud.pytorch.org/pr/61786):


  • 6/6 failures possibly* introduced in this PR
    • 1/6 non-scanned failure(s)

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_macos_10_13_py3_test (1/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jul 20 17:21:20 test_remote_message_script_de...yUniqueId(created_on=0, local_id=0) to be created.
Jul 20 17:20:51 frame #12: std::__1::__function::__func<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork>, std::__1::allocator<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork> >, void ()>::operator()() + 42 (0x11bb6db7a in libtorch_cpu.dylib)
Jul 20 17:20:51 frame #13: c10::ThreadPool::main_loop(unsigned long) + 569 (0x10cb64369 in libc10.dylib)
Jul 20 17:20:51 frame #14: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, c10::ThreadPool::ThreadPool(int, int, std::__1::function<void ()>)::$_0> >(void*) + 67 (0x10cb64a13 in libc10.dylib)
Jul 20 17:20:51 frame #15: _pthread_start + 148 (0x7fff6b376109 in libsystem_pthread.dylib)
Jul 20 17:20:51 frame #16: thread_start + 15 (0x7fff6b371b8b in libsystem_pthread.dylib)
Jul 20 17:20:51 
Jul 20 17:20:51 ok (4.556s)
Jul 20 17:21:00   test_remote_message_dropped_pickle (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.695s)
Jul 20 17:21:09   test_remote_message_dropped_pickle_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.687s)
Jul 20 17:21:16   test_remote_message_script_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (7.440s)
Jul 20 17:21:20   test_remote_message_script_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:555] Received error while processing request type 260: falseINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rref_context.cpp":390, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jul 20 17:21:20 Exception raised from getOwnerRRef at ../torch/csrc/distributed/rpc/rref_context.cpp:390 (most recent call first):
Jul 20 17:21:20 frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x10f3526b2 in libc10.dylib)
Jul 20 17:21:20 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 106 (0x10f350e2a in libc10.dylib)
Jul 20 17:21:20 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 64 (0x10f351060 in libc10.dylib)
Jul 20 17:21:20 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 1711 (0x11588207f in libtorch_cpu.dylib)
Jul 20 17:21:20 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 86 (0x11586c8d6 in libtorch_cpu.dylib)
Jul 20 17:21:20 frame #5: torch::distributed::rpc::RequestCallbackImpl::processScriptRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 376 (0x1119237a8 in libtorch_python.dylib)
Jul 20 17:21:20 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 437 (0x11586b525 in libtorch_cpu.dylib)
Jul 20 17:21:20 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 74 (0x11192451a in libtorch_python.dylib)
Jul 20 17:21:20 frame #8: c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> > c10::ivalue::Future::thenAsync<torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1>(torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1, std::__1::shared_ptr<c10::Type>)::'lambda'(c10::ivalue::Future&)::operator()(c10::ivalue::Future&) + 223 (0x1158731ef in libtorch_cpu.dylib)

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (2/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/binary-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/binary-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/commands.yml
Auto-merging .circleci/verbatim-sources/commands.yml
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (3/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/binary-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/binary-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/commands.yml
Auto-merging .circleci/verbatim-sources/commands.yml
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1


2 failures not recognized by patterns:

Job Step Action
GitHub Actions Windows CI (pytorch-win-vs2019-cpu-py3) / test (default, 1, 2, windows.4xlarge) Install Visual Studio 2019 toolchain 🔁 rerun
GitHub Actions Test tools / test Install dependencies 🔁 rerun

ci.pytorch.org: 1 failed


Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

awgu pushed a commit that referenced this pull request Jul 16, 2021
ghstack-source-id: 33abae0
Pull Request resolved: #61786
@awgu
Copy link
Collaborator Author

awgu commented Jul 16, 2021

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to add this new test file to https://github.com/pytorch/pytorch/blob/master/test/run_test.py to enable it?

This adds unit tests for the generic join context manager.

```
gpurun python test/distributed/algorithms/test_join.py
```

Differential Revision: [D29746646](https://our.internmc.facebook.com/intern/diff/D29746646)

[ghstack-poisoned]
awgu pushed a commit that referenced this pull request Jul 19, 2021
ghstack-source-id: 8947bcf
Pull Request resolved: #61786
@awgu
Copy link
Collaborator Author

awgu commented Jul 19, 2021

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I left some minor comments. Please make sure tests are launched in CI and tests pass before landing.

When imported, please also add this to internal tests.

inputs = self.construct_uneven_inputs(BASE_NUM_INPUTS, OFFSET)
with _Join([allreducer], run_post_hooks=True):
for _ in inputs:
allreducer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would I be correct if I assume main hook would still run in this case? And compared to test_single_joinable, the only difference here is that test_single_joinable_post_hooks does not check main hook results? If so, does it make sense if we remove this test, as it seems test_single_joinable behaves exactly the same way and only differs in asserts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now made test_single_joinable_post_hooks() not run the main hooks.

@require_n_gpus_for_nccl_backend(
WORLD_SIZE, BACKEND
)
def test_single_joinable(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to dedup test code with test_single_joinable_main_hooks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I factored the core logic out into a single _test_join_base() and now use that as the base for all tests.

@require_n_gpus_for_nccl_backend(
WORLD_SIZE, BACKEND
)
def test_multiple_joinables(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this can also dedup with test_single_joinable_main_hooks by providing a list of hooks and expected results to a common function?

allreducer = AllReducer(self.device, self.process_group)
inputs = self.construct_uneven_inputs(BASE_NUM_INPUTS, OFFSET)
allreduce_total = 0
with self.assertRaises(RuntimeError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we check the error message is expected?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

]
inputs = self.construct_uneven_inputs(BASE_NUM_INPUTS, OFFSET)
allreduce_total = 0
with self.assertRaises(RuntimeError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

This adds unit tests for the generic join context manager.

```
gpurun python test/distributed/algorithms/test_join.py
```

Differential Revision: [D29746646](https://our.internmc.facebook.com/intern/diff/D29746646)

[ghstack-poisoned]
@awgu
Copy link
Collaborator Author

awgu commented Jul 19, 2021

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This adds unit tests for the generic join context manager.

```
gpurun python test/distributed/algorithms/test_join.py
```

Differential Revision: [D29746646](https://our.internmc.facebook.com/intern/diff/D29746646)

[ghstack-poisoned]
awgu pushed a commit that referenced this pull request Jul 19, 2021
ghstack-source-id: b4395d6
Pull Request resolved: #61786
@awgu
Copy link
Collaborator Author

awgu commented Jul 19, 2021

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This adds unit tests for the generic join context manager.

```
gpurun python test/distributed/algorithms/test_join.py
```

Differential Revision: [D29746646](https://our.internmc.facebook.com/intern/diff/D29746646)

[ghstack-poisoned]
This adds unit tests for the generic join context manager.

```
gpurun python test/distributed/algorithms/test_join.py
```

Differential Revision: [D29746646](https://our.internmc.facebook.com/intern/diff/D29746646)

[ghstack-poisoned]
awgu pushed a commit that referenced this pull request Jul 20, 2021
ghstack-source-id: 01f9b28
Pull Request resolved: #61786
@awgu
Copy link
Collaborator Author

awgu commented Jul 20, 2021

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@andwgu merged this pull request in c2cc6a9.

@facebook-github-bot facebook-github-bot deleted the gh/andwgu/7/head branch July 24, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants