-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
gh-136459: Add perf trampoline support for macOS #136461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @pablogsal. I went ahead and submitted this PR for #136459 but please let me know what you think! |
A running process may create a file in the ``/tmp`` directory, which contains entries | ||
that can map a section of executable code to a name. This interface is described in the | ||
profiling tool (such as `perf <https://perf.wiki.kernel.org/index.php/Main_Page>`_ or | ||
`samply <https://github.com/mstange/samply/>`_). A running process may create a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a shameless plug to samply 😄 Disclaimer: I'm not the maintainer of the project, but the maintainer is my colleague. But it doesn't change the fact that it's an awesome profiler! But I can revert it if you prefer not to include :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with the plug, but this docs are going to need much more than this then. If samply
is the main way to use this on macOS then we will need to update https://docs.python.org/3/howto/perf_profiling.html with full instructions for samply :)
f0887a1
to
f663627
Compare
Python/perf_jit_trampoline.c
Outdated
|
||
/* These constants are defined inside <elf.h>, which we can't use outside of linux. */ | ||
#if !defined(__linux__) | ||
# define EM_386 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to define the variable for different platforms. like this https://github.com/python/cpython/blob/main/Python/perf_jit_trampoline.c#L126-L135
Not define all the possible variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I just changed them to be defined depending on the platform. I actually copy pasted this from the old code but your suggestion is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, the trampoline depens on the mmap.
But it has record about the mmap performance issue on macOS.
I think we need a benchmark here
FYI
Edit: ah I think you are referring to the mmap to tell Perf about the maps file no? Sorry I thought you were referring to the jit trampoline compiler that itself uses mmap to get the chunks. |
@canova i am currently on vacation until EOW I will answer more when I have some time later today or tomorrow but I would love to get macOS support. I have some questions I think we need to answer first to ensure we provide a good UX if we are adding more profilers. |
Ah I think you are right. I actually looked for this and only found one mmap inside the cpython/Python/perf_jit_trampoline.c Lines 945 to 961 in e697f5e
I updated the PR to remove this on macOS, let me know what you think!
Thanks! I'm glad that you would like to support it on macOS. And I'm happy to discuss the details further and update the code later. (Enjoy your vacation!) |
@@ -1,13 +1,21 @@ | |||
.text | |||
#if defined(__APPLE__) | |||
.globl __Py_trampoline_func_start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, Apple and their symbol choices...:S
@canova One thing I am concerned is that for perf I had to add jit support because nothing compiles with frame pointers in the wild which means that unless people compile Python themselves (and take a big-ish hit unfortunately) the frame-pointer version is suboptimal. What's the status of this for samply? Does it also support the perf jit interface? Can you ping here the samply maintainer if the answer is nuanced? |
Samply maintainer here - I don't understand the question, I'm afraid. What do you mean by "add jit support", and what do you mean by "the perf jit interface"? Samply supports unwinding with DWARF information if available, but only for regular code mapped from a binary / shared library, not for JIT code. When it encounters JIT code during unwinding, it falls back to using frame pointers for that frame. It does not make use of any |
Perf supports two modes to deal with JIT compilers which is what we are leveraging to make it work with Python via the trampolines:
We do implement both. The reason we implement both is that the simple maps interface works as long as perf can unwind through the JIT frames. Perf uses Unfortunately, nobody compiles Python with frame pointers because it's very slow. See #96174. Not only that, even if you compile with frame pointers most of the time wheels and binary packages are not compiled with frame pointers so this is sort of not very useful in the wild unless you have control over your full ecosystem. To deal with this we implement the much more complex JIT interface. This allows us to provide to perf DWARF for the trampolines so we can pass the I hope this gives some background for the question. I really want to know if simply is going to be able cope with the general case in general in Linux and Mac for all arches that we support. I am asking this among other things because we already had problems with code that doesn't have frame pointers around the trampolines. See #130856. |
Also, we probably need a buildbot and tests for this to ensure this doesn't break in the future |
Thanks for the pointer! Wow, I feel slightly nauseous after reading the part about having to to space out the trampolines by the size of the unwind info... Anyway: samply supports both perf.map file and jitdump. But as I said above, it does not respect the JIT unwind info. Instead, when finding the return address for a JIT frame, samply always uses framepointer unwinding. Looking at the actual assembly for the trampolines, I can see the following:
For samply's current approach to work on x86_64, the x86_64 trampoline just needs to set up a framepointer - add a |
Thanks a lot @mstange, that makes perfect sense and really clarifies the situation! If we’re going to add samply support, we need to ensure it works reliably on both aarch64 and x86_64 across both macOS and Linux. Based on your analysis, we should definitely add frame pointers to the x86_64 trampoline in this PR to ensure consistent behavior across architectures. I can help set up some buildbot infrastructure to ensure all of this works properly with samply once the PR is ready for more comprehensive testing. I also think we should have full end-to-end documentation that covers how to use this with Python, similar to our existing perf docs. We should probably make the docs more generic to cover both perf and samply workflows, since users will likely want guidance on both tools depending on their platform and preferences. @canova would you be willing to add the frame pointer setup to the x86_64 trampoline and the DWARF as part of this PR? It seems like the right time to ensure sample works everywhere and we can advertise it properly |
I feel you: I was certainly nauseous when I had to debug that nonsense for hours and hours and even more when I had to implement the hack to "fix" it. |
Yes, I mean maps file here. Sorry for confusing |
SGTM |
ev.base.time_stamp = get_current_monotonic_ticks(); | ||
ev.process_id = getpid(); | ||
#if defined(__APPLE__) | ||
pthread_threadid_np(NULL, &ev.thread_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use this implementation in here?
Lines 193 to 255 in b44316a
_Py_ThreadId(void) | |
{ | |
uintptr_t tid; | |
#if defined(_MSC_VER) && defined(_M_X64) | |
tid = __readgsqword(48); | |
#elif defined(_MSC_VER) && defined(_M_IX86) | |
tid = __readfsdword(24); | |
#elif defined(_MSC_VER) && defined(_M_ARM64) | |
tid = __getReg(18); | |
#elif defined(__MINGW32__) && defined(_M_X64) | |
tid = __readgsqword(48); | |
#elif defined(__MINGW32__) && defined(_M_IX86) | |
tid = __readfsdword(24); | |
#elif defined(__MINGW32__) && defined(_M_ARM64) | |
tid = __getReg(18); | |
#elif defined(__i386__) | |
__asm__("movl %%gs:0, %0" : "=r" (tid)); // 32-bit always uses GS | |
#elif defined(__MACH__) && defined(__x86_64__) | |
__asm__("movq %%gs:0, %0" : "=r" (tid)); // x86_64 macOSX uses GS | |
#elif defined(__x86_64__) | |
__asm__("movq %%fs:0, %0" : "=r" (tid)); // x86_64 Linux, BSD uses FS | |
#elif defined(__arm__) && __ARM_ARCH >= 7 | |
__asm__ ("mrc p15, 0, %0, c13, c0, 3\nbic %0, %0, #3" : "=r" (tid)); | |
#elif defined(__aarch64__) && defined(__APPLE__) | |
__asm__ ("mrs %0, tpidrro_el0" : "=r" (tid)); | |
#elif defined(__aarch64__) | |
__asm__ ("mrs %0, tpidr_el0" : "=r" (tid)); | |
#elif defined(__powerpc64__) | |
#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer) | |
tid = (uintptr_t)__builtin_thread_pointer(); | |
#else | |
// r13 is reserved for use as system thread ID by the Power 64-bit ABI. | |
register uintptr_t tp __asm__ ("r13"); | |
__asm__("" : "=r" (tp)); | |
tid = tp; | |
#endif | |
#elif defined(__powerpc__) | |
#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer) | |
tid = (uintptr_t)__builtin_thread_pointer(); | |
#else | |
// r2 is reserved for use as system thread ID by the Power 32-bit ABI. | |
register uintptr_t tp __asm__ ("r2"); | |
__asm__ ("" : "=r" (tp)); | |
tid = tp; | |
#endif | |
#elif defined(__s390__) && defined(__GNUC__) | |
// Both GCC and Clang have supported __builtin_thread_pointer | |
// for s390 from long time ago. | |
tid = (uintptr_t)__builtin_thread_pointer(); | |
#elif defined(__riscv) | |
#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer) | |
tid = (uintptr_t)__builtin_thread_pointer(); | |
#else | |
// tp is Thread Pointer provided by the RISC-V ABI. | |
__asm__ ("mv %0, tp" : "=r" (tid)); | |
#endif | |
#else | |
// Fallback to a portable implementation if we do not have a faster | |
// platform-specific implementation. | |
tid = _Py_GetThreadLocal_Addr(); | |
#endif | |
return tid; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! It looks like _Py_ThreadId
is defined only when Py_GIL_DISABLED
is defined and Py_LIMITED_API
is not defined:
Line 189 in d754f75
#if defined(Py_GIL_DISABLED) && !defined(Py_LIMITED_API) |
It doesn't compile because of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I meant just for inline assembly because it was for low over head at free threading, maybe you can just pick it for apple implementation.
@mstange @pablogsal Thanks for the investigation!
Sure that sounds good to me, I'll update the PR soon with this and add some more tests. Would you like me to update the documentation in this PR as well? |
On macOS, we don't need to call mmap because samply has already detected the file path during the call to `open` before (it interposes `open` with a preloaded library), and because the mmap call can be slow.
Co-authored-by: Hugo van Kemenade <1324225+hugovk@users.noreply.github.com>
2830252
to
2812159
Compare
Thanks for the suggestions. I've updated the docs and rebased the whole PR on top of the current |
2812159
to
7b463c8
Compare
7b463c8
to
3ae5cb2
Compare
I'm landing this for now but we should work on improving the docs separately. And there is still the Windows question for @canova to answer. |
Thanks for the reviews and help @pablogsal and @hugovk! I will file an issue for the docs improvements.
I'm not so sure as I'm not familiar with its implementation in samply. Maybe @mstange could give more insight into it. |
samply does work on Windows but it uses ETW JSCRIPT9::MethodLoad events instead of jitdump. Firefox and Chrome both emit these events for jitted JS code. We emit the events here: https://searchfox.org/mozilla-central/rev/387160feb07b75ae76bfc12df035a10f58d25168/js/src/jit/PerfSpewer.cpp#643 It probably makes sense for python to follow a similar approach as that will allow it to work properly with Windows Performance Analyzer as well. |
Ah, thanks for the information, Jeff! I filed #137022 for this. |
pythongh-136461 added perf support for macOS, with ifdefs around all changes except increasing thread_id to 64 bits. Make that change Apple-specific too.
…ythonGH-137031) pythongh-136461 added perf support for macOS, with ifdefs around all changes except increasing thread_id to 64 bits. Make that change Apple-specific too.
…ythonGH-137031) pythongh-136461 added perf support for macOS, with ifdefs around all changes except increasing thread_id to 64 bits. Make that change Apple-specific too.
This PR adds perf trampoline support for macOS. Previously it was Linux-only, which is not actually needed it to be. We can share the implementation of Linux and macOS with some minor changes done in this PR.
For example, here's a before and after profile of this PR, captured by
samply
(for the same dummy script):Before / After
As you can see, now we have blue frames that mentions the real Python function names, which makes it a lot more useful to understand this profile. Without it, it's just native frames that's really difficult to understand.
This is my first PR in this project, so please let me know if there is anything I didn't do or should do. Thanks!
📚 Documentation preview 📚: https://cpython-previews--136461.org.readthedocs.build/