-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
gh-110481: Implement biased reference counting #110764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note for reviewer: the first commit in this PR refactors the internal-only |
@vstinner, will you have time to look at this PR this week? |
I read your PEP and related papers, so I can review for specific implementation of your works. I will leave review for this PR in this week :) |
Include/internal/pycore_object.h
Outdated
{ | ||
if (op) { | ||
#ifdef Py_NOGIL | ||
op->ob_tid = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we declare the constant variable for this? Something like UNKNOWN_TID
?
I mean if someone already read the PEP, the person knows what the state that tid is zero stands for, but if not some kind of variable name will be helpful to read.
But I am not sure which will be the best naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've named it _Py_UNOWNED_TID
because these are objects that are not owned by any thread.
Include/object.h
Outdated
// zero. The call will deallocate the object if the shared refcount is also | ||
// zero. Otherwise, the thread gives up ownership and merges the reference | ||
// count fields. | ||
PyAPI_FUNC(void) _Py_MergeZeroRefcount(PyObject *); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read the code, the purpose of this function is to handle the object when the local ref count is zero.
So might be, we can improve function name to understand what's happening
PyAPI_FUNC(void) _Py_MergeZeroRefcount(PyObject *); | |
PyAPI_FUNC(void) _Py_MergeZeroRefLocalcount(PyObject *); |
or
PyAPI_FUNC(void) _Py_MergeZeroRefcount(PyObject *); | |
PyAPI_FUNC(void) _Py_MergeZeroLocalRefcount(PyObject *); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've named it _Py_MergeZeroLocalRefcount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me.
I just leave the styling things for other readers.
IMO, It will help to understand the mechanism between local refcount and shared ref count.
Py_ssize_t | ||
_Py_ExplicitMergeRefcount(PyObject *op, Py_ssize_t extra) | ||
{ | ||
Py_ssize_t refcnt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, Don't we have to add assert statement that op should not be an immortal object?
assert(!_Py_IsImmortal(op));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good assertion to add.
2144e0e
to
9bc2b6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing this! I've updated the PR and rebased on the main branch.
Include/internal/pycore_object.h
Outdated
{ | ||
if (op) { | ||
#ifdef Py_NOGIL | ||
op->ob_tid = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've named it _Py_UNOWNED_TID
because these are objects that are not owned by any thread.
Include/object.h
Outdated
// zero. The call will deallocate the object if the shared refcount is also | ||
// zero. Otherwise, the thread gives up ownership and merges the reference | ||
// count fields. | ||
PyAPI_FUNC(void) _Py_MergeZeroRefcount(PyObject *); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've named it _Py_MergeZeroLocalRefcount
Py_ssize_t | ||
_Py_ExplicitMergeRefcount(PyObject *op, Py_ssize_t extra) | ||
{ | ||
Py_ssize_t refcnt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good assertion to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@colesbury In my macOS, the PR is failed to build :) Pleace check
9 warnings generated.
gcc -fno-strict-overflow -Wsign-compare -Wunreachable-code -DNDEBUG -g -O3 -Wall -std=c11 -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wstrict-prototypes -Werror=implicit-function-declaration -fvisibility=hidden -I./Include/internal -I. -I./Include -c ./Modules/xxsubtype.c -o Modules/xxsubtype.o
In file included from ./Modules/_scproxy.c:9:
In file included from ./Include/Python.h:52:
./Include/object.h:276:22: error: implicit declaration of function '_Py_atomic_load_uint32_relaxed' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
uint32_t local = _Py_atomic_load_uint32_relaxed(&ob->ob_ref_local);
^
./Include/object.h:280:25: error: implicit declaration of function '_Py_atomic_load_ssize_relaxed' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
Py_ssize_t shared = _Py_atomic_load_ssize_relaxed(&ob->ob_ref_shared);
^
./Include/object.h:280:25: note: did you mean '_Py_atomic_load_uint32_relaxed'?
./Include/object.h:276:22: note: '_Py_atomic_load_uint32_relaxed' declared here
uint32_t local = _Py_atomic_load_uint32_relaxed(&ob->ob_ref_local);
^
./Include/object.h:343:9: error: implicit declaration of function '_Py_IsOnwedByCurrentThread' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
if (_Py_IsOnwedByCurrentThread(ob)) {
^
These aren't necessary for correctness, just performance.
d0b6720
to
ac2c957
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@colesbury cc @vstinner @ambv
LGTM Again
Let's not block the PEP-703 works for the things.
It will not affect the default implementation, and for the limited C API issue, I believe that we can solve it before we want to use free-threaded
mode by default.
Let's solve it as a separate issue.
|
|
|
|
For the build bot issue, please follow up: #111503 |
For the limited C API, I wrote a first change: PR #111505. |
Congrats @colesbury, that's a massive change :-) |
#elif defined(__arm__) | ||
__asm__ ("mrc p15, 0, %0, c13, c0, 3\nbic %0, %0, #3" : "=r" (tid)); | ||
#elif defined(__aarch64__) && defined(__APPLE__) | ||
__asm__ ("mrs %0, tpidrro_el0" : "=r" (tid)); | ||
#elif defined(__aarch64__) | ||
__asm__ ("mrs %0, tpidr_el0" : "=r" (tid)); | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caveat: I'm not an expert.
In case of x86/x86-64 above, the tid values has to be copied to a regular register for comparison.
In case of ARM32, the comparison could be made directly, were it not for the 2 lsb.
In case of ARM64, the comparison could be made directly.
Leaving aside the less common case of stamping the biased reference count block, and focusing on the more common case of comparing the current tid with object owner tid:
Does the compiler optimise this move away?
Does it perhaps not matter, as move is fast compared to loading the object's value from ram or cache?
Or would a custom comparison macro make sense here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is backwards: on x86/x86-64 you can do a comparison without copying tid
to a register; you can't on ARM.
On x86, many instructions, such as cmp
, can operate on memory addresses, including addresses with fs:
or gs:
prefixes. It doesn't matter much here because because the object's owner tid also requires a memory load. So either one or the other needs to be moved to a temporary registers and it's more convenient to move the current tid to a temporary register because it may be re-used across multiple reference count operations.
On ARM you can't do a comparison with a system or coprocessor register. You need to move it to a gpr first.
I ran this PR on the Faster CPython team's benchmarking infrastructure, and its effect is basically immeasurable, 1.00x slower with a reliability of 95%: https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20231030-3.13.0a1%2B-6dfb8fe/bm-20231030-linux-x86_64-python-6dfb8fe0236718e9afc8-3.13.0a1%2B-6dfb8fe-vs-base.md EDIT: As @ericsnowcurrently pointed out, this is with a default build, so is not terribly interesting. I'll rerun with |
This PR only impacts Python if Python is configured with |
And the correct benchmarking results are in. With |
@colesbury: An interesting thing the benchmarking revealed is the |
@mdboom, Yeah the
|
CPython's current reference counting implementation would not be thread-safe without the GIL. This implements biased reference counting in
--disable-gil
builds, which is thread-safe and has lower execution overhead compared to plain atomic reference counting. The design is described in the "Biased Reference Counting" section of PEP 703.--disable-gil
builds #110481