The CPython reference counting semantics generally fall into the following three categories:

  1. Return borrowed reference

  2. Return new reference

  3. Steal reference

Currently, there are two issues with CPython reference counting semantics:

  1. Some APIs in the CPython documentation lack clear reference counting semantics.

  2. The refcount.dat file does not represent the semantics of reference counting—it only tracks changes in reference count values.

I have used static analysis to partially recover these reference counting semantics. However, there are some false positives and false negatives. I would like to know how the community views such efforts—is very high accuracy required?

Additionally, besides PyAPI_FUNC, there are other functions that are not officially APIs. Should these functions also have documented reference counting semantics? I believe this could also be helpful for developers, but I’m curious about others’ opinions on this.

P.S.: Based on manual analysis of the static analysis tool’s reports:

  1. Accuracy for “return new reference” is 99.25%

  2. Accuracy for “return borrowed reference” is 76.6%

  3. Accuracy for “steal reference” is 35.8%

Let me add one to that list: return an immortal reference, which doesn’t need to be reference counted at all. A reference count analyzer should be aware of immortal objects, because sometimes developers omit Py_INCREF/Py_DECREF calls that would otherwise be necessary for a strong reference.

Yeah, definitely. For example, PyList_SET_ITEM is a static inline function and not an actual symbol, but has some interesting reference count semantics:

  1. References passed to it are stolen, but existing references at the position are “leaked”.
  2. In turn, borrowed references to the object at that position are turned into strong references.

To visualize, the following is valid:

PyObject *item = PyList_GET_ITEM(xyz, 0); // item is borrowed
PyList_SET_ITEM(xyz, 0, Py_NewRef(Py_None)); // Py_None is immortal; ignore that
// item is now a strong reference
Py_DECREF(item);

This kind of thing is very difficult for static analyzers to detect, and to make matters worse, this is actually code that exists in CPython: cpython/Modules/_heapqmodule.c at 921f61bd82908ab245d6776068a366da152788d4 · python/cpython · GitHub

Ideally, it should be as accurate as possible, but it’s fine if there are false-positives as long as there’s a way to shut up the linter (for example, a // reference-count: ignore comment).

2 Likes

Yes, I also noticed the impact of immortal PyObjects on reference counting, and there are quite a few instances of Py_INCREF(x) in the code where x is an immortal PyObject.

While researching the source code, I extracted some immortal PyObjects, such as:

static PyObject* constants[] = {
    &_Py_NoneStruct,                   // Py_CONSTANT_NONE
    (PyObject*)(&_Py_FalseStruct),     // Py_CONSTANT_FALSE
    (PyObject*)(&_Py_TrueStruct),      // Py_CONSTANT_TRUE
    &_Py_EllipsisObject,               // Py_CONSTANT_ELLIPSIS
    &_Py_NotImplementedStruct,         // Py_CONSTANT_NOT_IMPLEMENTED
    NULL,  // Py_CONSTANT_ZERO
    NULL,  // Py_CONSTANT_ONE
    NULL,  // Py_CONSTANT_EMPTY_STR
    NULL,  // Py_CONSTANT_EMPTY_BYTES
    NULL,  // Py_CONSTANT_EMPTY_TUPLE
};

Based on the immortal PyObjects in this structure, I also found many functions that return what temporarily call “immortal references” I have a question: should developers treat these returned immortal references as borrowed references? Would this be a more conservative or accurate approach?

Besides these, there are many other immortal objects, such as small integer objects and certain Unicode objects…
However, due to the complexity of the code, I still haven’t fully figured out which PyObjects are immortal and which are not. I would appreciate some advice on this matter.

Unfortunately, we haven’t been able to completely agree on how to reference count immortal objects. Some people treat them like actual references, while others don’t attempt to do any reference counting operations on them at all. The latter is generally more common in CPython, but I see the former in older extensions (because prior to 3.12, there were no immortal objects).

It’s not correct to always treat immortal references as borrowed references, because reference stealing doesn’t affect immortals. For example, PyTuple_SET_ITEM(tup, 0, xyz) is valid if xyz is immortal, but invalid if it’s borrowed.

I would only deal with static singletons (things like Py_None, Py_True, PyExc_RuntimeError, etc.), because developers know that those will always be immortal – interned strings and small integers are typically going to be much less predictable, so people tend to keep treating them as if they were real references.