KEMBAR78
Minor documentation fixes by awgu · Pull Request #61785 · pytorch/pytorch · GitHub
Skip to content

Conversation

@awgu
Copy link
Collaborator

@awgu awgu commented Jul 16, 2021

Stack from ghstack:

Differential Revision: D29746648

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 16, 2021

💊 CI failures summary and remediations

As of commit f8b0221 (more details on the Dr. CI page and at hud.pytorch.org/pr/61785):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jul 16 23:35:54 test_remote_message_script_de...yUniqueId(created_on=0, local_id=0) to be created.
Jul 16 23:35:26 frame #12: std::__1::__function::__func<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork>, std::__1::allocator<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork> >, void ()>::operator()() + 42 (0x111b58b7a in libtorch_cpu.dylib)
Jul 16 23:35:26 frame #13: c10::ThreadPool::main_loop(unsigned long) + 569 (0x10c39f369 in libc10.dylib)
Jul 16 23:35:26 frame #14: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, c10::ThreadPool::ThreadPool(int, int, std::__1::function<void ()>)::$_0> >(void*) + 67 (0x10c39fa13 in libc10.dylib)
Jul 16 23:35:26 frame #15: _pthread_start + 148 (0x7fff70eb2109 in libsystem_pthread.dylib)
Jul 16 23:35:26 frame #16: thread_start + 15 (0x7fff70eadb8b in libsystem_pthread.dylib)
Jul 16 23:35:26 
Jul 16 23:35:26 ok (4.051s)
Jul 16 23:35:35   test_remote_message_dropped_pickle (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.269s)
Jul 16 23:35:43   test_remote_message_dropped_pickle_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.250s)
Jul 16 23:35:50   test_remote_message_script_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (7.097s)
Jul 16 23:35:54   test_remote_message_script_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:555] Received error while processing request type 260: falseINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rref_context.cpp":390, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jul 16 23:35:54 Exception raised from getOwnerRRef at ../torch/csrc/distributed/rpc/rref_context.cpp:390 (most recent call first):
Jul 16 23:35:54 frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x1173bf6b2 in libc10.dylib)
Jul 16 23:35:54 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 106 (0x1173bde2a in libc10.dylib)
Jul 16 23:35:54 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 64 (0x1173be060 in libc10.dylib)
Jul 16 23:35:54 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 1711 (0x11c98407f in libtorch_cpu.dylib)
Jul 16 23:35:54 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 86 (0x11c96e8d6 in libtorch_cpu.dylib)
Jul 16 23:35:54 frame #5: torch::distributed::rpc::RequestCallbackImpl::processScriptRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 376 (0x118a257a8 in libtorch_python.dylib)
Jul 16 23:35:54 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 437 (0x11c96d525 in libtorch_cpu.dylib)
Jul 16 23:35:54 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 74 (0x118a2651a in libtorch_python.dylib)
Jul 16 23:35:54 frame #8: c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> > c10::ivalue::Future::thenAsync<torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1>(torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1, std::__1::shared_ptr<c10::Type>)::'lambda'(c10::ivalue::Future&)::operator()(c10::ivalue::Future&) + 223 (0x11c9751ef in libtorch_cpu.dylib)

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This was referenced Jul 16, 2021
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for enhancing the docs!

class _Joinable(ABC):
r"""
This defines an abstract base class for joinable classes. A joinable class
(inheriting from :class:`_Joinable`) should implement a private
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is _Joinable ever intended to be used by non PyTorch developers? If so we should eventually remove the _ prefix and make it a public API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I think it is intended to be used by non-PyTorch developers eventually. Everything (meaning _Joinable, _Join, _JoinHook and also DistributedDataParallel._join_hook(), ZeroRedundancyOptimizer._join_hook()) has a prefix underscore for now. I am not sure when would be the time to do the removal, and I was originally waiting for the feature to be approved. I also think it would be good to discuss exactly which components will be made public and which are kept private.

@awgu
Copy link
Collaborator Author

awgu commented Jul 16, 2021

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@andwgu merged this pull request in 3e3acf8.

@facebook-github-bot facebook-github-bot deleted the gh/andwgu/6/head branch July 23, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants