Use CapturedTraceback symbolizer for C++ exceptions from Python library #113207

ezyang · 2023-11-07T22:11:31Z

Stack from ghstack (oldest at bottom):

-> Use CapturedTraceback symbolizer for C++ exceptions from Python library #113207

This is the cheap and cheerful implementation, which is only enabled on TORCH_SHOW_CPP_STACKTRACES, because it eagerly symbolizes immediately at exception throw time, even if the exception will end up getting caught. It would be better to do this lazily and only symbolize when we try to print the exception, but that requires a more involved refactor of c10::Error that I don't feel like doing.

Compare the output before:

frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x95 (0x7fa21b99d975 in /data/users/ezyang/c/pytorch/torch/lib/libc10.so)                                                                                                                                                                                                         
frame #1: c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const + 0x8d (0x7fa21b951269 in /data/users/ezyang/c/pytorch/torch/lib/libc10.so)                                                  
frame #2: c10::TensorImpl::sizes_custom() const + 0x9f (0x7fa21b9770df in /data/users/ezyang/c/pytorch/torch/lib/libc10.so)                                                                                
frame #3: at::meta::structured_mm::meta(at::Tensor const&, at::Tensor const&) + 0x31e (0x7fa20a202a8e in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so)                                           
frame #4: <unknown function> + 0x29f34de (0x7fa20b5f34de in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so)                                                                                        
frame #5: <unknown function> + 0x2a1fd8e (0x7fa20b61fd8e in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so)                                                                                        
frame #6: <unknown function> + 0x6b907b (0x7fa2142b907b in /data/users/ezyang/c/pytorch/torch/lib/libtorch_python.so)                                                                                      
frame #7: <unknown function> + 0x6b6175 (0x7fa2142b6175 in /data/users/ezyang/c/pytorch/torch/lib/libtorch_python.so)

and after:

#4 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0                                                                       
#5 c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const from ??:0
#6 c10::TensorImpl::sizes_custom() const [clone .localalias] from TensorImpl.cpp:0
#7 at::meta::structured_mm::meta(at::Tensor const&, at::Tensor const&) from ??:0
#8 at::(anonymous namespace)::wrapper_Meta_mm_out_out(at::Tensor const&, at::Tensor const&, at::Tensor&) from RegisterMeta.cpp:0
#9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (at::Tensor const&, at::Tensor const&, at::Tensor&), &at::(anonymous namespace)::wrapper_Meta_mm_out_out>, at::Tensor&, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from RegisterMeta.cpp:0

Signed-off-by: Edward Z. Yang ezyang@meta.com

Signed-off-by: Edward Z. Yang <ezyang@meta.com> [ghstack-poisoned]

pytorch-bot · 2023-11-07T22:11:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113207

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit d8e4dc2 with merge base 68dead4 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: c1e01cf Pull Request resolved: #113207

ezyang · 2023-11-08T01:01:40Z

Wondering if I should prune the top three frames lol

zdevito

I think we should leave a way to have the old behavior still possible because I worried about a case when you are debugging and then trying to symbolize the stack frame hangs talking to addr2line, possibly because the process is in a bad state.

ezyang · 2023-11-08T19:01:12Z

I'll add another env var.

albanD

That might conflict with people changing the fetcher as well (in fbcode?).
I agree that making this the default is good though.

Also removing the top 3 frames sgtm

…ython library" This is the cheap and cheerful implementation, which is only enabled on TORCH_SHOW_CPP_STACKTRACES, because it *eagerly* symbolizes immediately at exception throw time, even if the exception will end up getting caught. It would be better to do this lazily and only symbolize when we try to print the exception, but that requires a more involved refactor of c10::Error that I don't feel like doing. Compare the output before: ``` frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x95 (0x7fa21b99d975 in /data/users/ezyang/c/pytorch/torch/lib/libc10.so) frame #1: c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const + 0x8d (0x7fa21b951269 in /data/users/ezyang/c/pytorch/torch/lib/libc10.so) frame #2: c10::TensorImpl::sizes_custom() const + 0x9f (0x7fa21b9770df in /data/users/ezyang/c/pytorch/torch/lib/libc10.so) frame #3: at::meta::structured_mm::meta(at::Tensor const&, at::Tensor const&) + 0x31e (0x7fa20a202a8e in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so) frame #4: <unknown function> + 0x29f34de (0x7fa20b5f34de in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so) frame #5: <unknown function> + 0x2a1fd8e (0x7fa20b61fd8e in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so) frame #6: <unknown function> + 0x6b907b (0x7fa2142b907b in /data/users/ezyang/c/pytorch/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x6b6175 (0x7fa2142b6175 in /data/users/ezyang/c/pytorch/torch/lib/libtorch_python.so) ``` and after: ``` #1 torch::CapturedTraceback::gather(bool, bool, bool) from ??:0 #2 THPModule_initExtension(_object*, _object*)::{lambda()#1}::operator()() const [clone .constprop.0] from Module.cpp:0 #3 std::_Function_handler<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > (), THPModule_initExtension(_object*, _object*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Module.cpp:0 #4 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #5 c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const from ??:0 #6 c10::TensorImpl::sizes_custom() const [clone .localalias] from TensorImpl.cpp:0 #7 at::meta::structured_mm::meta(at::Tensor const&, at::Tensor const&) from ??:0 #8 at::(anonymous namespace)::wrapper_Meta_mm_out_out(at::Tensor const&, at::Tensor const&, at::Tensor&) from RegisterMeta.cpp:0 #9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (at::Tensor const&, at::Tensor const&, at::Tensor&), &at::(anonymous namespace)::wrapper_Meta_mm_out_out>, at::Tensor&, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from RegisterMeta.cpp:0 ``` Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

ezyang · 2023-11-08T21:27:13Z

oh yeah, I need to turn this off in fbcode. hmm...

…ython library" This is the cheap and cheerful implementation, which is only enabled on TORCH_SHOW_CPP_STACKTRACES, because it *eagerly* symbolizes immediately at exception throw time, even if the exception will end up getting caught. It would be better to do this lazily and only symbolize when we try to print the exception, but that requires a more involved refactor of c10::Error that I don't feel like doing. Compare the output before: ``` frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x95 (0x7fa21b99d975 in /data/users/ezyang/c/pytorch/torch/lib/libc10.so) frame #1: c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const + 0x8d (0x7fa21b951269 in /data/users/ezyang/c/pytorch/torch/lib/libc10.so) frame #2: c10::TensorImpl::sizes_custom() const + 0x9f (0x7fa21b9770df in /data/users/ezyang/c/pytorch/torch/lib/libc10.so) frame #3: at::meta::structured_mm::meta(at::Tensor const&, at::Tensor const&) + 0x31e (0x7fa20a202a8e in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so) frame #4: <unknown function> + 0x29f34de (0x7fa20b5f34de in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so) frame #5: <unknown function> + 0x2a1fd8e (0x7fa20b61fd8e in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so) frame #6: <unknown function> + 0x6b907b (0x7fa2142b907b in /data/users/ezyang/c/pytorch/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x6b6175 (0x7fa2142b6175 in /data/users/ezyang/c/pytorch/torch/lib/libtorch_python.so) ``` and after: ``` #4 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #5 c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const from ??:0 #6 c10::TensorImpl::sizes_custom() const [clone .localalias] from TensorImpl.cpp:0 #7 at::meta::structured_mm::meta(at::Tensor const&, at::Tensor const&) from ??:0 #8 at::(anonymous namespace)::wrapper_Meta_mm_out_out(at::Tensor const&, at::Tensor const&, at::Tensor&) from RegisterMeta.cpp:0 #9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (at::Tensor const&, at::Tensor const&, at::Tensor&), &at::(anonymous namespace)::wrapper_Meta_mm_out_out>, at::Tensor&, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from RegisterMeta.cpp:0 ``` Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 91d2e0e Pull Request resolved: #113207

ezyang · 2023-11-09T15:02:48Z

@pytorchbot merge -i

pytorchmergebot · 2023-11-09T15:05:05Z

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-focal-py3.8-clang10 / test (dynamo, 1, 2, linux.2xlarge), pull / linux-focal-py3.11-clang10 / test (dynamo, 1, 2, linux.2xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ry (pytorch#113207) This is the cheap and cheerful implementation, which is only enabled on TORCH_SHOW_CPP_STACKTRACES, because it *eagerly* symbolizes immediately at exception throw time, even if the exception will end up getting caught. It would be better to do this lazily and only symbolize when we try to print the exception, but that requires a more involved refactor of c10::Error that I don't feel like doing. Compare the output before: ``` frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x95 (0x7fa21b99d975 in /data/users/ezyang/c/pytorch/torch/lib/libc10.so) frame pytorch#1: c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const + 0x8d (0x7fa21b951269 in /data/users/ezyang/c/pytorch/torch/lib/libc10.so) frame pytorch#2: c10::TensorImpl::sizes_custom() const + 0x9f (0x7fa21b9770df in /data/users/ezyang/c/pytorch/torch/lib/libc10.so) frame pytorch#3: at::meta::structured_mm::meta(at::Tensor const&, at::Tensor const&) + 0x31e (0x7fa20a202a8e in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so) frame pytorch#4: <unknown function> + 0x29f34de (0x7fa20b5f34de in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so) frame pytorch#5: <unknown function> + 0x2a1fd8e (0x7fa20b61fd8e in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so) frame pytorch#6: <unknown function> + 0x6b907b (0x7fa2142b907b in /data/users/ezyang/c/pytorch/torch/lib/libtorch_python.so) frame pytorch#7: <unknown function> + 0x6b6175 (0x7fa2142b6175 in /data/users/ezyang/c/pytorch/torch/lib/libtorch_python.so) ``` and after: ``` pytorch#4 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 pytorch#5 c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const from ??:0 pytorch#6 c10::TensorImpl::sizes_custom() const [clone .localalias] from TensorImpl.cpp:0 pytorch#7 at::meta::structured_mm::meta(at::Tensor const&, at::Tensor const&) from ??:0 pytorch#8 at::(anonymous namespace)::wrapper_Meta_mm_out_out(at::Tensor const&, at::Tensor const&, at::Tensor&) from RegisterMeta.cpp:0 pytorch#9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (at::Tensor const&, at::Tensor const&, at::Tensor&), &at::(anonymous namespace)::wrapper_Meta_mm_out_out>, at::Tensor&, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from RegisterMeta.cpp:0 ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#113207 Approved by: https://github.com/Skylion007

ppwwyyxx · 2024-05-09T19:47:01Z

After upgrading to 2.2 we observed a ton of extra warning messages. A minimal repro:

import torch
import torch.distributed as dist
if __name__ == '__main__':
    dist.init_process_group(backend="nccl")
    torch.cuda.set_device(dist.get_rank())
    dist.all_gather_object([None, None], dist.get_rank())

output:

$ TORCH_SHOW_CPP_STACKTRACES=1 torchrun --nnodes=1 --nproc-per-node=2 a.py

[2024-05-09 19:42:07,153] torch.distributed.run: [WARNING]
[2024-05-09 19:42:07,153] torch.distributed.run: [WARNING] *****************************************
[2024-05-09 19:42:07,153] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-05-09 19:42:07,153] torch.distributed.run: [WARNING] *****************************************
[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

Note that the code executes correctly. Not sure why all_gather_object would trigger an exception, but IMO the additional warning message shouldn't appear in any case and now it's not very user friendly.

ezyang · 2024-05-10T13:57:19Z

#125750 should help here

ppwwyyxx · 2024-08-20T06:33:00Z

The above repo no longer prints the warning. But this still does:

import torch
import torch.distributed as dist
from torch.distributed._tensor import distribute_tensor, Shard
from torch.distributed.device_mesh import init_device_mesh
if __name__ == '__main__':
    dist.init_process_group(backend="nccl")
    torch.cuda.set_device(dist.get_rank())
    t = torch.randn(256, device='cuda')
    mesh = init_device_mesh('cuda', (2, ))
    dt = distribute_tensor(t, mesh, [Shard(0)])

Tested with 2.4.0

Use CapturedTraceback symbolizer for C++ exceptions from Python library

e1c3aab

Signed-off-by: Edward Z. Yang <ezyang@meta.com> [ghstack-poisoned]

ezyang added a commit that referenced this pull request Nov 7, 2023

Use CapturedTraceback symbolizer for C++ exceptions from Python library

3b9bcdc

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: c1e01cf Pull Request resolved: #113207

github-actions bot requested review from SherlockNoMad, albanD, antoniojkim, bdhirsh, miladm, voznesenskym and wconstab November 7, 2023 22:11

ezyang requested a review from zdevito November 7, 2023 22:11

Skylion007 approved these changes Nov 7, 2023

View reviewed changes

ezyang added ciflow/trunk Trigger trunk jobs on your pull request release notes: cpp release notes category topic: new features topic category labels Nov 8, 2023

zdevito reviewed Nov 8, 2023

View reviewed changes

albanD reviewed Nov 8, 2023

View reviewed changes

ezyang added a commit that referenced this pull request Nov 8, 2023

Use CapturedTraceback symbolizer for C++ exceptions from Python library

060bc6c

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 91d2e0e Pull Request resolved: #113207

pytorchmergebot added the merging label Nov 9, 2023

pytorchmergebot added the Merged label Nov 9, 2023

pytorchmergebot closed this in f98ba59 Nov 9, 2023

pytorchmergebot removed the merging label Nov 9, 2023

facebook-github-bot deleted the gh/ezyang/2420/head branch November 12, 2023 15:24

ppwwyyxx mentioned this pull request Aug 20, 2024

Stack trace is symbolized when no exception is thrown #133979

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use CapturedTraceback symbolizer for C++ exceptions from Python library #113207

Use CapturedTraceback symbolizer for C++ exceptions from Python library #113207

Uh oh!

ezyang commented Nov 7, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 7, 2023 •

edited

Loading

Uh oh!

ezyang commented Nov 8, 2023

Uh oh!

zdevito left a comment

Uh oh!

ezyang commented Nov 8, 2023

Uh oh!

albanD left a comment

Uh oh!

ezyang commented Nov 8, 2023

Uh oh!

ezyang commented Nov 9, 2023

Uh oh!

pytorchmergebot commented Nov 9, 2023

Uh oh!

ppwwyyxx commented May 9, 2024 •

edited

Loading

Uh oh!

ezyang commented May 10, 2024

Uh oh!

ppwwyyxx commented Aug 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Use CapturedTraceback symbolizer for C++ exceptions from Python library #113207

Use CapturedTraceback symbolizer for C++ exceptions from Python library #113207

Uh oh!

Conversation

ezyang commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113207

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

ezyang commented Nov 8, 2023

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Nov 8, 2023

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Nov 8, 2023

Uh oh!

ezyang commented Nov 9, 2023

Uh oh!

pytorchmergebot commented Nov 9, 2023

Merge started

Uh oh!

ppwwyyxx commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented May 10, 2024

Uh oh!

ppwwyyxx commented Aug 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ezyang commented Nov 7, 2023 •

edited

Loading

pytorch-bot bot commented Nov 7, 2023 •

edited

Loading

ppwwyyxx commented May 9, 2024 •

edited

Loading