KEMBAR78
[c10d] Use TORCH_CHECK for monitored barrier error by rohan-varma · Pull Request #59667 · pytorch/pytorch · GitHub
Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Jun 8, 2021

Stack from ghstack:

Use torch_check over throw std::runtime_error in monitored barrier so
that it works with torch_cpp_show_stacktraces to reveal the entire callstack
where the monitored barrier failed, which can help determine where the
particular rank encountered an issue.

We should eventually replace all uses of throwing runtime_error with torch_check in distributed C++ code as the latter can provide cpp stack traces.

Differential Revision: D28974510

Use torch_check over throw std::runtime_error in monitored barrier so
that it works with torch_cpp_show_stacktraces to reveal the entire callstack
where the monitored barrier failed, which can help determine where the
particular rank encountered an issue.

Differential Revision: [D28974510](https://our.internmc.facebook.com/intern/diff/D28974510/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 8, 2021

💊 CI failures summary and remediations

As of commit 7462f94 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Jun 8, 2021
rohan-varma added a commit that referenced this pull request Jun 8, 2021
Use torch_check over throw std::runtime_error in monitored barrier so
that it works with torch_cpp_show_stacktraces to reveal the entire callstack
where the monitored barrier failed, which can help determine where the
particular rank encountered an issue.

Differential Revision: [D28974510](https://our.internmc.facebook.com/intern/diff/D28974510/)

ghstack-source-id: 130879792
Pull Request resolved: #59667
Copy link
Contributor

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Use torch_check over throw std::runtime_error in monitored barrier so
that it works with torch_cpp_show_stacktraces to reveal the entire callstack
where the monitored barrier failed, which can help determine where the
particular rank encountered an issue.

We should eventually replace all uses of throwing runtime_error with torch_check in distributed C++ code as the latter can provide cpp stack traces. 

Differential Revision: [D28974510](https://our.internmc.facebook.com/intern/diff/D28974510/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in fc0582e.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/325/head branch June 13, 2021 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants