KEMBAR78
Fix logging in socket.cpp by ppwwyyxx · Pull Request #137745 · pytorch/pytorch · GitHub
Skip to content

Conversation

@ppwwyyxx
Copy link
Collaborator

@ppwwyyxx ppwwyyxx commented Oct 10, 2024

Formatter shall avoid throwing exceptions as much as possible.

Fixes #128673 (comment)

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137745

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 20a2072 with merge base ad38bad (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 10, 2024
@ppwwyyxx
Copy link
Collaborator Author

Relevant crash:

[W1010 23:30:01.169683897 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 33.193.13.94:2222 - retrying (try=2, timeout=900000ms, delay=1307ms): failed to format addr, unknown family=0
Exception raised from format at ../torch/csrc/distributed/c10d/socket.cpp:239 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::string> const> (), c10::SetStackTraceFetcher(std::function<std::string ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
#6 decltype(auto) fmt::v11::formatter<addrinfo, char, void>::format<fmt::v11::context>(addrinfo const&, fmt::v11::context&) const from :0
#7 void fmt::v11::detail::value<fmt::v11::context>::format_custom_arg<addrinfo, fmt::v11::formatter<addrinfo, char, void> >(void*, fmt::v11::basic_format_parse_context<char>&, fmt::v11::context&) from :0
#8 char const* fmt::v11::detail::parse_replacement_field<char, fmt::v11::detail::vformat_to<char>(fmt::v11::detail::buffer<char>&, fmt::v11::basic_string_view<char>, fmt::v11::detail::vformat_args<char>::type, fmt::v11::detail::locale_ref)::format_handler&>(char const*, char const*, fmt::v11::detail::vformat_to<char>(fmt::v11::detail::buffer<char>&, fmt::v11::basic_string_view<char>, fmt::v11::detail::vformat_args<char>::type, fmt::v11::detail::locale_ref)::format_handler&) from :0
#9 void fmt::v11::detail::vformat_to<char>(fmt::v11::detail::buffer<char>&, fmt::v11::basic_string_view<char>, fmt::v11::detail::vformat_args<char>::type, fmt::v11::detail::locale_ref) from :0
#10 std::remove_cv<std::remove_reference<fmt::v11::basic_appender<char> >::type>::type fmt::v11::vformat_to<fmt::v11::basic_appender<char>, 0>(fmt::v11::basic_appender<char>&&, fmt::v11::basic_string_view<char>, fmt::v11::basic_format_args<fmt::v11::context>) from :0
#11 decltype(auto) fmt::v11::formatter<c10d::detail::SocketImpl, char, void>::format<fmt::v11::context>(c10d::detail::SocketImpl const&, fmt::v11::context&) const from :0
#12 void fmt::v11::detail::value<fmt::v11::context>::format_custom_arg<c10d::detail::SocketImpl, fmt::v11::formatter<c10d::detail::SocketImpl, char, void> >(void*, fmt::v11::basic_format_parse_context<char>&, fmt::v11::context&) from :0
#13 char const* fmt::v11::detail::parse_replacement_field<char, fmt::v11::detail::vformat_to<char>(fmt::v11::detail::buffer<char>&, fmt::v11::basic_string_view<char>, fmt::v11::detail::vformat_args<char>::type, fmt::v11::detail::locale_ref)::format_handler&>(char const*, char const*, fmt::v11::detail::vformat_to<char>(fmt::v11::detail::buffer<char>&, fmt::v11::basic_string_view<char>, fmt::v11::detail::vformat_args<char>::type, fmt::v11::detail::locale_ref)::format_handler&) from :0
#14 void fmt::v11::detail::vformat_to<char>(fmt::v11::detail::buffer<char>&, fmt::v11::basic_string_view<char>, fmt::v11::detail::vformat_args<char>::type, fmt::v11::detail::locale_ref) from :0
#15 fmt::v11::vformat(fmt::v11::basic_string_view<char>, fmt::v11::basic_format_args<fmt::v11::context>) from :0
#16 c10d::detail::(anonymous namespace)::SocketConnectOp::tryConnect(int) from socket.cpp:0
#17 c10d::detail::(anonymous namespace)::SocketConnectOp::run() [clone .constprop.0] from socket.cpp:0
#18 c10d::detail::Socket::connect(std::string const&, unsigned short, c10d::detail::SocketOptions const&) from :0
#19 c10d::detail::TCPClient::connect(c10d::detail::SocketAddress const&, c10d::TCPStoreOptions const&, std::shared_ptr<c10d::Backoff>) from :0
#20 c10d::TCPStore::TCPStore(std::string, c10d::TCPStoreOptions const&) from ??:0
#21 pybind11::cpp_function::initialize<pybind11::detail::initimpl::factory<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::string const&, unsigned short, std::optional<int>, bool, std::chrono::duration<long, std::ratio<1l, 1000l> >, bool, bool, std::optional<int>, bool)#38}, pybind11::detail::void_type (*)(), c10::intrusive_ptr<c10d::TCPStore, c10::detail::intrusive_target_default_null_type<c10d::TCPStore> > (std::string const&, unsigned short, std::optional<int>, bool, std::chrono::duration<long, std::ratio<1l, 1000l> >, bool, bool, std::optional<int>, bool), pybind11::detail::void_type ()>::execute<pybind11::class_<c10d::TCPStore, c10::intrusive_ptr<c10d::TCPStore, c10::detail::intrusive_target_default_null_type<c10d::TCPStore> > >, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::class_<c10d::TCPStore, c10::intrusive_ptr<c10d::TCPStore, c10::detail::intrusive_target_default_null_type<c10d::TCPStore> > >&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&) &&::{lambda(pybind11::detail::value_and_holder&, std::string const&, unsigned short, std::optional<int>, bool, std::chrono::duration<long, std::ratio<1l, 1000l> >, bool, bool, std::optional<int>, bool)#1}, void, pybind11::detail::value_and_holder&, std::string const&, unsigned short, std::optional<int>, bool, std::chrono::duration<long, std::ratio<1l, 1000l> >, bool, bool, std::optional<int>, bool, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::detail::is_new_style_constructor, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::class_<c10d::TCPStore, c10::intrusive_ptr<c10d::TCPStore, c10::detail::intrusive_target_default_null_type<c10d::TCPStore> > >&&, void (*)(pybind11::detail::value_and_holder&, std::string const&, unsigned short, std::optional<int>, bool, std::chrono::duration<long, std::ratio<1l, 1000l> >, bool, bool, std::optional<int>, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::detail::is_new_style_constructor const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
#22 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0

@ppwwyyxx
Copy link
Collaborator Author

ppwwyyxx commented Oct 11, 2024

The following log is from a successful run using torch2.4:

[I1010 23:53:41.791359160 socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (33.193.19.246, 2222).
[I1010 23:53:41.820038800 TCPStore.cpp:312] [c10d - debug] The server has started on port = 2222.
[I1010 23:53:41.820047660 TCPStoreLibUvBackend.cpp:1067] [c10d - debug] Uv main loop running
[I1010 23:53:41.825145144 socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (33.193.19.246, 2222).
[I1010 23:53:41.835779701 socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (33.193.19.246, 2222).
[I1010 23:53:41.850837523 socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (33.193.19.246, 2222).
[I1010 23:53:41.862635881 socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (33.193.19.246, 2222).
[I1010 23:53:41.871480728 socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (33.193.19.246, 2222).
[I1010 23:53:41.873013978 socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (33.193.19.246, 2222).
[I1010 23:53:41.873068367 socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (33.193.19.246, 2222).
[I1010 23:53:41.850860565 socket.cpp:796] [c10d - trace] The client socket is attempting to connect to ?UNKNOWN?.
[I1010 23:53:41.862657464 socket.cpp:796] [c10d - trace] The client socket is attempting to connect to ?UNKNOWN?.
[I1010 23:53:45.867694477 socket.cpp:884] [c10d] The client socket has connected to ?UNKNOWN? on ?UNKNOWN?.
[I1010 23:53:45.879781309 TCPStore.cpp:350] [c10d - debug] TCP client connected to host 33.193.19.246:2222
[I1010 23:53:45.880181696 ProcessGroupNCCL.cpp:852] [PG 0 Rank 1] ProcessGroupNCCL initialization options: size: 8, global rank: 1, TIMEOUT(ms): 900000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
[I1010 23:53:45.880198859 ProcessGroupNCCL.cpp:861] [PG 0 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.20.5, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 1000, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
[I1010 23:53:41.873031760 socket.cpp:796] [c10d - trace] The client socket is attempting to connect to ?UNKNOWN?.
[I1010 23:53:41.871495547 socket.cpp:796] [c10d - trace] The client socket is attempting to connect to ?UNKNOWN?.
[I1010 23:53:41.791381050 socket.cpp:796] [c10d - trace] The client socket is attempting to connect to ?UNKNOWN?.
[I1010 23:53:41.835796122 socket.cpp:796] [c10d - trace] The client socket is attempting to connect to ?UNKNOWN?.
[I1010 23:53:41.825152953 socket.cpp:796] [c10d - trace] The client socket is attempting to connect to ?UNKNOWN?.
[I1010 23:53:46.853026221 socket.cpp:884] [c10d] The client socket has connected to ?UNKNOWN? on ?UNKNOWN?.
[I1010 23:53:46.864387431 TCPStore.cpp:350] [c10d - debug] TCP client connected to host 33.193.19.246:2222
[I1010 23:53:46.855658659 socket.cpp:884] [c10d] The client socket has connected to ?UNKNOWN? on ?UNKNOWN?.

Note the "?UNKNOWN?" part. This means that it is possible to make a successful connection with "UNKNOWN" addrinfo. The detail reason of that to be found, but we should not crash with a seemingly invalid addrinfo.

Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kit1980 kit1980 added this to the 2.5.1 milestone Oct 11, 2024
@ppwwyyxx
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 12, 2024

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@Skylion007
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 12, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@cyyever
Copy link
Collaborator

cyyever commented Oct 13, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@cyyever
Copy link
Collaborator

cyyever commented Oct 13, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, lf.linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@kit1980
Copy link
Contributor

kit1980 commented Oct 23, 2024

2.5.1 is an emergency patch release to address known large regressions, moving this to 2.6.0

@kit1980 kit1980 modified the milestones: 2.5.1, 2.6.0 Oct 23, 2024
@ppwwyyxx
Copy link
Collaborator Author

This is a regression: distributed code worked in 2.4 but crashed in 2.5

@kit1980
Copy link
Contributor

kit1980 commented Oct 24, 2024

@ppwwyyxx we are targeting specific critical regressions with 2.5.1.
We might have 2.5.2 later.

@kit1980
Copy link
Contributor

kit1980 commented Oct 24, 2024

Also, cherry-picks to patch releases should have tests demonstrating the issue is fixed.

@kit1980
Copy link
Contributor

kit1980 commented Jan 24, 2025

Verified for 2.6 release that the change is there in https://github.com/pytorch/pytorch/blob/v2.6.0-rc8/torch/csrc/distributed/c10d/socket.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants