KEMBAR78
[dynamo] implement framelocals mapping as c++ object by williamwen42 · Pull Request #140063 · pytorch/pytorch · GitHub
Skip to content

Conversation

@williamwen42
Copy link
Member

@williamwen42 williamwen42 commented Nov 7, 2024

Stack from ghstack (oldest at bottom):

Implements #93753 - move frame local guard accessors to C++.

Before, we used dict accessors on a Python dict representing the frame's fastlocals that we manually build. We move this accessor to C++ and additionally use the fastlocal index whenever possible.

Some implementation notes:

  • FrameLocalsMapping is now initialized as a C++ vector of PyObjects. We do not just use the frame's localsplus/fastlocals buffer because we also unbox cells.
  • FrameLocalsMapping can still be converted into a Python dict representing the frame's fastlocals, but it is done lazily.
  • We update LeafGuard, GuardAccessor, and GuardManager's check_nopybind methods to accept FrameLocalsMapping. By default, we convert the FrameLocalsMapping to a Python dict and run the original check_nopybind on it, but in some cases, conversion is not needed.
  • We add a new guard accessor FrameLocalsGuardAccessor, which is similar to DictGetItemGuardAccessor but has special handling for FrameLocalsMapping. We create a separate class to emphasize different use cases, but we could probably combine these two (can do in a follow up)

dynamo_guard_eval.py microbenchmark update:

  • 713.2us -> 630.0us (3.10)
  • 598.8us -> 530.7us (3.12)

Other followups:

  • Add FrameLocalsMapping version for check_verbose_nopybind in order to match behavior between check_nopybind and check_verbose_nopybind. This can prevent difficult debugging situations where guards fail (check_nopybind returns false) but no guard error message is generated (check_verbose_nopybind succeeds).
  • Rewrite the SHAPE_ENV guard into C++ - it is a fairly common guard that results in FrameLocalsMapping needing to convert to a dict

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 7, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140063

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit c01e3fd with merge base 7ab3177 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@williamwen42 williamwen42 added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Nov 7, 2024
Part of implementing #93753. Implement `FrameLocalsMapping` as a C++ `unordered_map` over a `py::dict`. This PR ensures we can pass around the C++ object around C Dynamo properly. It is still converted back into a `py::dict` for guards and convert_frame, but the next step will involve removing the `py::dict` conversion for guards.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
williamwen42 added a commit that referenced this pull request Nov 7, 2024
ghstack-source-id: be903d9
Pull Request resolved: #140063
Part of implementing #93753. Implement `FrameLocalsMapping` as a C++ `unordered_map` over a `py::dict`. This PR ensures we can pass around the C++ object around C Dynamo properly. It is still converted back into a `py::dict` for guards and convert_frame, but the next step will involve removing the `py::dict` conversion for guards.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
Part of implementing #93753. Implement `FrameLocalsMapping` as a C++ `unordered_map` over a `py::dict`. This PR ensures we can pass around the C++ object around C Dynamo properly. It is still converted back into a `py::dict` for guards and convert_frame, but the next step will involve removing the `py::dict` conversion for guards.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]

void framelocals_mapping_free(FrameLocalsMapping* map);

// Borrowed reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you say borrowed, but above you use py::reinterpret_steal which seems wrong if this is borrowed.


typedef struct VISIBILITY_HIDDEN FrameLocalsMapping {
private:
std::unordered_map<std::string, PyObject*> _map;
Copy link
Contributor

@jansel jansel Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a std::unordered_map actually much faster than a Python dict?

The thing I was suggesting in #93753 was replacing:

f_locals["foo"]

with

fastlocals[3]

where the conversion to figure out that the variable "foo" is stored at index 3 can be done ahead of time. Thus making the lookup just a single load instruction (rather than a hashtable lookup).

Since both PyDict and unordered_map are hashtables, I'd expect similar performance.

You should compute the fastlocals offset ahead of time when the guard object is constructed, and store it on the guard. So when we evaluate the guard you can replace the hashtable lookup with an array lookup.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could use pybind11 to do this casting for you?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe there is a guarantee on the ordering of the fastlocals names (but in practice, this may be true). The safe thing to do would be to add a check that fastlocals name order is the same.

I'd also think that a C++ unordered has less overhead than using Python dict, although I'd have to measure to be sure. The point of this PR is not to introduce the complete optimization, but rather, is a step to moving framelocals/fastlocals lookup to C++. I plan on incrementally introducing optimizations in followup PRs. For this PR, we just need to make sure there are no regressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe there is a guarantee on the ordering of the fastlocals names (but in practice, this may be true). The safe thing to do would be to add a check that fastlocals name order is the same.

Isn't the order defined on the code object which is a constant? How could the order change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I hadn't considered that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think directly using fastlocals will actually be easier to implement than a unordered_map. This also matches what CPython does, I believe:
LOAD_FAST 3 is just push(fastlocals[3])

Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above.

Part of implementing #93753. Implement `FrameLocalsMapping` as a C++ `unordered_map` over a `py::dict`. This PR ensures we can pass around the C++ object around C Dynamo properly. It is still converted back into a `py::dict` for guards and convert_frame, but the next step will involve removing the `py::dict` conversion for guards.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
Part of implementing #93753. Implement `FrameLocalsMapping` as a C++ `unordered_map` over a `py::dict`. This PR ensures we can pass around the C++ object around C Dynamo properly. It is still converted back into a `py::dict` for guards and convert_frame, but the next step will involve removing the `py::dict` conversion for guards.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
Part of implementing #93753. Implement `FrameLocalsMapping` as a C++ `unordered_map` over a `py::dict`. This PR ensures we can pass around the C++ object around C Dynamo properly. It is still converted back into a `py::dict` for guards and convert_frame, but the next step will involve removing the `py::dict` conversion for guards.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
Part of implementing #93753. Implement `FrameLocalsMapping` as a C++ `unordered_map` over a `py::dict`. This PR ensures we can pass around the C++ object around C Dynamo properly. It is still converted back into a `py::dict` for guards and convert_frame, but the next step will involve removing the `py::dict` conversion for guards.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
Part of implementing #93753. Implement `FrameLocalsMapping` as a C++ `unordered_map` over a `py::dict`. This PR ensures we can pass around the C++ object around C Dynamo properly. It is still converted back into a `py::dict` for guards and convert_frame, but the next step will involve removing the `py::dict` conversion for guards.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
Part of implementing #93753. Implement `FrameLocalsMapping` as a C++ `unordered_map` over a `py::dict`. This PR ensures we can pass around the C++ object around C Dynamo properly. It is still converted back into a `py::dict` for guards and convert_frame, but the next step will involve removing the `py::dict` conversion for guards.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
Part of implementing #93753. Implement `FrameLocalsMapping` as a C++ `unordered_map` over a `py::dict`. This PR ensures we can pass around the C++ object around C Dynamo properly. It is still converted back into a `py::dict` for guards and convert_frame, but the next step will involve removing the `py::dict` conversion for guards.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
Part of implementing #93753. Implement `FrameLocalsMapping` as a C++ `unordered_map` over a `py::dict`. This PR ensures we can pass around the C++ object around C Dynamo properly. It is still converted back into a `py::dict` for guards and convert_frame, but the next step will involve removing the `py::dict` conversion for guards.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
TODO: update PR description

Benchmark update:
- 713.2us -> 630.0us (3.10)
- 598.8us -> 530.7us (3.12)

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
williamwen42 added a commit that referenced this pull request Dec 11, 2024
ghstack-source-id: 67b5fac
Pull Request resolved: #140063
Move frame local guard accessors to C++.

Before, we used dict accessors on a Python dict representing the frame's fastlocals that we manually build. We move this accessor to C++ and additionally use the fastlocal index whenever possible.

Some implementation notes:
- `FrameLocalsMapping` is now initialized as a C++ vector of `PyObject`s. We do not just use the frame's localsplus/fastlocals buffer because we also unbox cells.
- `FrameLocalsMapping` can still be converted into a Python dict representing the frame's fastlocals, but it is done lazily.
- We update `LeafGuard`, `GuardAccessor`, and `GuardManager`'s `check_nopybind` methods to accept `FrameLocalsMapping`. By default, we convert the `FrameLocalsMapping` to a Python dict and run the original `check_nopybind` on it, but in some cases, conversion is not needed.
- We add a new guard accessor `FrameLocalsGuardAccessor`, which is similar to `DictGetItemGuardAccessor` but has special handling for `FrameLocalsMapping`.

dynamo_guard_eval.py microbenchmark update:
- 713.2us -> 630.0us (3.10)
- 598.8us -> 530.7us (3.12)

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
williamwen42 added a commit that referenced this pull request Dec 13, 2024
ghstack-source-id: da32c91
Pull Request resolved: #140063
DICT_SUBCLASS_GUARD_MANAGER = 3


@functools.lru_cache(None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a limit here?

Comment on lines +982 to +989
framelocals_names_reversed = code_framelocals_names_reversed_cached(
self.f_code
)
framelocals_idx = (
len(framelocals_names_reversed)
- framelocals_names_reversed.index(source.local_name)
- 1
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the first one return a dict so you can compute framelocals_idx in O(1) time?

@williamwen42
Copy link
Member Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Dec 16, 2024
ghstack-source-id: 9b5e82d
Pull Request resolved: #140063
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/williamwen42/160/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/140063)

@williamwen42
Copy link
Member Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

facebook-github-bot pushed a commit to pytorch/benchmark that referenced this pull request Dec 18, 2024
Summary:
Implements pytorch/pytorch#93753 - move frame local guard accessors to C++.

Before, we used dict accessors on a Python dict representing the frame's fastlocals that we manually build. We move this accessor to C++ and additionally use the fastlocal index whenever possible.

Some implementation notes:
- `FrameLocalsMapping` is now initialized as a C++ vector of `PyObject`s. We do not just use the frame's localsplus/fastlocals buffer because we also unbox cells.
- `FrameLocalsMapping` can still be converted into a Python dict representing the frame's fastlocals, but it is done lazily.
- We update `LeafGuard`, `GuardAccessor`, and `GuardManager`'s `check_nopybind` methods to accept `FrameLocalsMapping`. By default, we convert the `FrameLocalsMapping` to a Python dict and run the original `check_nopybind` on it, but in some cases, conversion is not needed.
- We add a new guard accessor `FrameLocalsGuardAccessor`, which is similar to `DictGetItemGuardAccessor` but has special handling for `FrameLocalsMapping`. We create a separate class to emphasize different use cases, but we could probably combine these two (can do in a follow up)

dynamo_guard_eval.py microbenchmark update:
- 713.2us -> 630.0us (3.10)
- 598.8us -> 530.7us (3.12)

Other followups:
- Add `FrameLocalsMapping` version for `check_verbose_nopybind` in order to match behavior between `check_nopybind` and `check_verbose_nopybind`. This can prevent difficult debugging situations where guards fail (`check_nopybind` returns false) but no guard error message is generated (`check_verbose_nopybind` succeeds).
- Rewrite the `SHAPE_ENV` guard into C++ - it is a fairly common guard that results in `FrameLocalsMapping` needing to convert to a dict

X-link: pytorch/pytorch#140063
Approved by: https://github.com/jansel
ghstack dependencies: #142117, #142430

Reviewed By: huydhn

Differential Revision: D67355697

Pulled By: williamwen42

fbshipit-source-id: 930855f49db629b2bc2d7b7ff14c065bd044e894
@github-actions github-actions bot deleted the gh/williamwen42/160/head branch January 18, 2025 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants