KEMBAR78
distributed debug handlers by d4l3k · Pull Request #126601 · pytorch/pytorch · GitHub
Skip to content

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented May 18, 2024

This adds debug handlers as described in:

This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.

This adds 2 handlers out of the box:

  • /handler/ping for testing purposes
  • /handler/dump_nccl_trace_pickle as a POC integration with Flight Recorder

Test plan:

python test/distributed/elastic/test_control_plane.py

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

@pytorch-bot
Copy link

pytorch-bot bot commented May 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126601

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (1 Unrelated Failure)

As of commit 469b55d with merge base 38a33c3 (image):

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) labels May 18, 2024
@d4l3k d4l3k force-pushed the d4l3k/control_plane branch 4 times, most recently from 26c5779 to 7b71448 Compare May 21, 2024 18:06
@d4l3k d4l3k changed the title distributed/control_plane distributed debug handlers May 21, 2024
@d4l3k d4l3k marked this pull request as ready for review May 21, 2024 18:11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a more descriptive name, given that is from another module?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about following the hydra pattern here: https://hydra.cc/docs/intro/

In practice this should be something like @worker_server.main.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UX:

  • Do we need to register the handler first?
  • If not for some cases, how would users know what is enabled by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should adopt some RESTful API approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handlers do need to be registered before the post request -- we're currently statically registering all handlers so they'll always be present by this call

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RESTful is a pretty overloaded term could you clarify what you mean? I'm not really sure how to express "dump_nccl_trace_pickle" in the rest model since it doesn't really map to a data object with CRUD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess naively, GET "/worker/handler/nccl_trace" seems 'restful' as you're reading the state of an object. I'm not sure if the RESTful model accounts for an object that is constantly changing, but otherwise that seems ok to me?

or is the difference that we are requesting a dump to disk in this call vs a return of the current value?

Copy link
Member Author

@d4l3k d4l3k May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns the data directly. For this one I think it's more or less fine to do get, but for some other potential handlers it's weirder. "Start tracing for 10 seconds" is not a very restful behavior.

I think the prior art for that is a CREATE/POST call to create the profiling and then other requests to get the data. i.e.

CREATE /profile {"time": 10} -> {"id": 15}
GET /profile/15 -> {...}

vs

POST /handler/profile {"time": 10} -> {...}

We'd have to overhaul the simpler handler design to support that I think -- so in someways would just prefer keeping it all POSTs even if it isn't pure "rest"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RESTful is a pretty overloaded term could you clarify what you mean?

Worth exploring HATEOAS that could support various cases given pluggable nature of handlers.

Copy link
Member Author

@d4l3k d4l3k May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were going that route I think something like openapi or Swagger might be best. Could potentially generate thrift defs off of Swagger as well

I do like keeping things simple though initially until we flesh out what these interfaces should be

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a standard pattern to update shared state in a constructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a static object guard -- we use the same thing for torch custom op registration. The shared state is in a meyers singleton in getHandlersRegistry so it neatly avoids any static initialization ordering issues.

Ex:

explicit RegisterOperators(std::vector<std::optional<Operator>> operators) {

@d4l3k d4l3k force-pushed the d4l3k/control_plane branch from 7b71448 to 1e2c7fc Compare May 21, 2024 19:28
@wconstab
Copy link
Contributor

didnt do a careful review yet but it looks great to me overall!

@d4l3k d4l3k force-pushed the d4l3k/control_plane branch 2 times, most recently from 6cfac6d to 9dbfd35 Compare May 23, 2024 18:52
Copy link
Contributor

@kurman kurman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, approving to unblock

  • Would be good to get additional reviews on C++ code structure
  • Address failing tests

@d4l3k d4l3k force-pushed the d4l3k/control_plane branch 4 times, most recently from 256640a to f9ed48c Compare May 23, 2024 22:46
@requires_cuda
def test_dump_nccl_trace_pickle(self) -> None:
with local_worker_server() as pool:
resp = pool.request("POST", "/handler/dump_nccl_trace_pickle")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!


auto it = handlers_.find(name);
if (it == handlers_.end()) {
throw std::runtime_error(fmt::format("Failed to find handler {}", name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this cause a crash at runtime if I call with a BAD request?

Or will it be caught somewhere?
Maybe it is better to just return some HTTP friendly code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that it doesn't cause a crash and just returns a 500 erorr. I added a custom error handling though to report a sane error instead.

Copy link
Contributor

@c-p-i-o c-p-i-o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - aside from the concern of crash when calling
curl http://host/NON_EXISTENT_PATH

@d4l3k d4l3k force-pushed the d4l3k/control_plane branch from f9ed48c to 041cfe5 Compare May 24, 2024 20:52
pytorch-bot bot pushed a commit that referenced this pull request May 30, 2024
This adds debug handlers as described in:
* https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy)
* https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy)

This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.

This adds 2 handlers out of the box:

* `/handler/ping` for testing purposes
* `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: #126601
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
@d4l3k d4l3k deleted the d4l3k/control_plane branch May 30, 2024 15:59
@PaliC
Copy link
Contributor

PaliC commented May 31, 2024

@pytorchbot revert -m "breaking internal typechecking tests" -c "ghfirst"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@d4l3k your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request May 31, 2024
This reverts commit 3d54183.

Reverted #126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](#126601 (comment)))
d4l3k added a commit that referenced this pull request Jun 3, 2024
pytorchmergebot pushed a commit that referenced this pull request Jun 4, 2024
bigfootjon pushed a commit that referenced this pull request Jun 5, 2024
This adds debug handlers as described in:
* https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy)
* https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy)

This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.

This adds 2 handlers out of the box:

* `/handler/ping` for testing purposes
* `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: #126601
Approved by: https://github.com/kurman, https://github.com/c-p-i-o

(cherry picked from commit 3d54183)
bigfootjon pushed a commit that referenced this pull request Jun 5, 2024
This reverts commit 3d54183.

Reverted #126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](#126601 (comment)))

(cherry picked from commit 7646825)
bigfootjon pushed a commit that referenced this pull request Jun 5, 2024
This reverts commit 7646825.

Pull Request resolved: #127805
Approved by: https://github.com/PaliC

(cherry picked from commit 597922b)
petrex pushed a commit to petrex/pytorch that referenced this pull request Jun 5, 2024
This reverts commit 3d54183.

Reverted pytorch#126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](pytorch#126601 (comment)))
izaitsevfb added a commit that referenced this pull request Jul 8, 2024
pytorchmergebot pushed a commit that referenced this pull request Jul 9, 2024
pytorchmergebot pushed a commit that referenced this pull request Jul 10, 2024
#126601 (internally [D58103182](https://www.internalfb.com/diff/D58103182)) was exported missing one class definition. This PR brings github repo in sync with fbcode.
Pull Request resolved: #130296
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Jul 25, 2024
pytorch#126601 (internally [D58103182](https://www.internalfb.com/diff/D58103182)) was exported missing one class definition. This PR brings github repo in sync with fbcode.
Pull Request resolved: pytorch#130296
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet
pytorchmergebot pushed a commit that referenced this pull request Aug 27, 2024
Because right now it leads to symbol conflict from binary builds.
Use of `std::filesystem::file_exists` was introduced by #126601 and in this PR it is replaced with a very straightforward implementation that calls `stat` on the given path, which is a classic C-way of checking for the file existence.

This PR should be reverted once one figures out how to keep `std::filesystem` methods linked into the binary private

Fixes symptoms of #133437

Pull Request resolved: #134494
Approved by: https://github.com/atalman, https://github.com/d4l3k
atalman pushed a commit to atalman/pytorch that referenced this pull request Aug 27, 2024
Because right now it leads to symbol conflict from binary builds.
Use of `std::filesystem::file_exists` was introduced by pytorch#126601 and in this PR it is replaced with a very straightforward implementation that calls `stat` on the given path, which is a classic C-way of checking for the file existence.

This PR should be reverted once one figures out how to keep `std::filesystem` methods linked into the binary private

Fixes symptoms of pytorch#133437

Pull Request resolved: pytorch#134494
Approved by: https://github.com/atalman, https://github.com/d4l3k
atalman added a commit that referenced this pull request Aug 27, 2024
Because right now it leads to symbol conflict from binary builds.
Use of `std::filesystem::file_exists` was introduced by #126601 and in this PR it is replaced with a very straightforward implementation that calls `stat` on the given path, which is a classic C-way of checking for the file existence.

This PR should be reverted once one figures out how to keep `std::filesystem` methods linked into the binary private

Fixes symptoms of #133437

Pull Request resolved: #134494
Approved by: https://github.com/atalman, https://github.com/d4l3k

Co-authored-by: Nikita Shulga <nshulga@meta.com>
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
Because right now it leads to symbol conflict from binary builds.
Use of `std::filesystem::file_exists` was introduced by pytorch#126601 and in this PR it is replaced with a very straightforward implementation that calls `stat` on the given path, which is a classic C-way of checking for the file existence.

This PR should be reverted once one figures out how to keep `std::filesystem` methods linked into the binary private

Fixes symptoms of pytorch#133437

Pull Request resolved: pytorch#134494
Approved by: https://github.com/atalman, https://github.com/d4l3k
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants