KEMBAR78
gh-135552: Make the GC clear weakrefs later by nascheme · Pull Request #136189 · python/cpython · GitHub
Skip to content

Conversation

@nascheme
Copy link
Member

@nascheme nascheme commented Jul 1, 2025

Fix two bugs related to the interaction of weakrefs and the garbage
collector. The weakrefs in the tp_subclasses dictionary are needed in
order to correctly invalidate type caches (for example, by calling
PyType_Modified()). Clearing weakrefs before calling finalizers causes
the caches to not be correctly invalidated. That can cause crashes since the
caches can refer to invalid objects. This is fixed by deferring the
clearing of weakrefs to classes and without callbacks until after finalizers
are executed.

The second bug is caused by weakrefs created while running finalizers. Those
weakrefs can be outside the set of unreachable garbage and therefore survive
the delete_garbage() phase (where tp_clear() is called on objects).
Those weakrefs can expose to Python-level code objects that have had
tp_clear() called on them. See GH-91636 as an example of this kind of
bug. This is fixed be clearing all weakrefs to unreachable objects after
finalizers have been executed.

Clear the weakrefs to unreachable objects after finalizers are called.
@neonene
Copy link
Contributor

neonene commented Jul 1, 2025

I can confirm this PR fixes the gh-132413 issue as well.

@nascheme
Copy link
Member Author

nascheme commented Jul 1, 2025

I think this fixes (or mostly fixes) gh-91636 as well.

@nascheme nascheme requested a review from pablogsal July 1, 2025 23:33
@nascheme nascheme added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 1, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @nascheme for commit 12f0b5c 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F136189%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 1, 2025
@nascheme
Copy link
Member Author

nascheme commented Jul 2, 2025

This introduces refleaks, it seems. One of the leaking tests:
test.test_concurrent_futures.test_shutdown.ProcessPoolSpawnProcessPoolShutdownTest.test_shutdown_gh_132969_case_1
My unconfirmed suspicion is that a finalizer is now resurrecting an object via a weakref. Previously that weakref would be cleared before the finalizer is run. The multiprocessing finalizer logic seems very complicated. :-/

@nascheme
Copy link
Member Author

nascheme commented Jul 2, 2025

This is the smallest leaking example I could make so far. Something with ProcessPoolExecutor leaking maybe.

class LeakTest(unittest.TestCase):
    @classmethod
    def _fail_task(cls, n):
        raise ValueError("failing task")

    def test_leak_case(self):
        # this leaks references
        executor = futures.ProcessPoolExecutor(
                max_workers=1,
                max_tasks_per_child=1,
                )
        f2 = executor.submit(LeakTest._fail_task, 0)
        try:
            f2.result()
        except ValueError:
            pass

        # Ensure that the executor cleans up after called
        # shutdown with wait=False
        executor_manager_thread = executor._executor_manager_thread
        executor.shutdown(wait=False)
        time.sleep(0.2)
        executor_manager_thread.join()

    def test_leak_case2(self):
        # this does not leak
        with futures.ProcessPoolExecutor(
                max_workers=1,
                max_tasks_per_child=1,
                ) as executor:
            f2 = executor.submit(LeakTest._fail_task, 0)
            try:
                f2.result()
            except ValueError:
                pass

@neonene
Copy link
Contributor

neonene commented Jul 2, 2025

Other leaking examples (on Windows):

1. test_logging:
import logging
import logging.config
import logging.handlers
from multiprocessing import Queue, Manager

class ConfigDictTest(unittest.TestCase):
    def test_multiprocessing_queues_XXX(self):
        config = {
            'version': 1,
            'handlers' : {
                'spam' : {
                    'class': 'logging.handlers.QueueHandler',
                    'queue': Manager().Queue()  ,         # Leak
                    # 'queue': Manager().JoinableQueue()  # Leak
                    # 'queue': Queue(),                   # No leak

                },
            },
            'root': {'handlers': ['spam']}
        }
        logger = logging.getLogger()
        logging.config.dictConfig(config)
        while logger.handlers:
            h = logger.handlers[0]
            logger.removeHandler(h)
            h.close()
2. test_interpreters.test_api:
import contextlib
import threading
import types
from concurrent import interpreters

def func():
    raise Exception('spam!')

@contextlib.contextmanager
def captured_thread_exception():
    ctx = types.SimpleNamespace(caught=None)
    def excepthook(args):
        ctx.caught = args
    orig_excepthook = threading.excepthook
    threading.excepthook = excepthook
    try:
        yield ctx
    finally:
        threading.excepthook = orig_excepthook

class TestInterpreterCall(unittest.TestCase):
    def test_call_in_thread_XXX(self):
        interp = interpreters.create()
        call = (interp._call, interp.call)[1]   # 0: No leak, 1: Leak
        with captured_thread_exception() as _:
            t = threading.Thread(target=call, args=(interp, func, (), {}))
            t.start()
            t.join()

UPDATE:

import weakref, _interpreters
wd = weakref.WeakValueDictionary()

class TestInterpreterCall(unittest.TestCase):
    def test_call_in_thread_XXX(self):
        id = _interpreters.create()
        wd[id] = type("", (), {})
        _interpreters.destroy(id)

@nascheme
Copy link
Member Author

nascheme commented Jul 2, 2025

The majority (maybe all) of these leaks are caused by the WeakValueDictionary used as multiprocessing.util._afterfork_registry. That took some digging to find. I'm not yet sure of a good fix for this. Explicitly cleaning the dead weak references from the .data dict works but it not too elegant.

This avoids breaking tests while fixing the issue with tp_subclasses. In
the long term, it would be better to defer the clear of all weakrefs,
not just the ones referring to classes.  However, that is a more
distruptive change and would seem to have a higher chance of breaking
user code.  So, it would not be something to do in a bugfix release.
@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2025

The majority (maybe all) of these leaks are caused by the WeakValueDictionary used as multiprocessing.util._afterfork_registry. That took some digging to find. I'm not yet sure of a good fix for this. Explicitly cleaning the dead weak references from the .data dict works but it not too elegant.

Nope, that doesn't fix all the leaks. And having to explicitly clean the weakrefs from the WeakValueDictionary really shouldn't be needed, I think. The KeyedRef class uses a callback and so they should be cleaned from the dict when the referred value dies. So, I'm not exactly sure what's going on there.

For the purposes of having a fix that we can backport (should probably be backported to all maintained Python versions), a less disruptive fix would be better. To that end, I've changed this PR to only defer clearing weakrefs to class objects. That fixes the tp_subclasses bug but should be less likely to break currently working code.

@nascheme nascheme added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 3, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @nascheme for commit 2f3daba 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F136189%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 3, 2025
We need to clear those before executing the callback.  Since this
ensures they can't run a second time, we don't need
_PyGC_SET_FINALIZED().  Revise comments.
@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2025

The KeyedRef class uses a callback and so they should be cleaned from the dict when the referred value dies. So, I'm not exactly sure what's going on there.

Ah, the KeyedRef callback requires that the weakref is cleared when it is called, otherwise it was not deleting the item from the weak dictionary. So we need to clear the weakrefs with callbacks before executing them. That fixes the refleaks, I believe.

I revised this PR to be something that is potentially suitable for backporting. To minimize the behaviour change, I'm only deferring the clear of weakrefs that refer to types (in order to allow tp_subclasses to work) or with callbacks. I still have an extra clearing pass that gets done after the finalizers are called. That avoids bugs like #91636.

If this gets merged, I think we should consider deferring clearing all weakrefs (without callbacks) until after finalizers are executed. I think that's more correct since it gives the finalizers a better opportunity to do their job.

* *not* be cleared so that caches based on the type version are correctly
* invalidated (see GH-135552 as a bug caused by this).
*/
clear_weakrefs(&final_unreachable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ough! Nice trick with final_unreachable!

@pablogsal pablogsal self-assigned this Jul 3, 2025
nascheme added 3 commits July 3, 2025 06:04
This is a bit trickly because the wrlist can be changing as we
iterate over it.
@nascheme nascheme requested review from abalkin and pganssle as code owners July 8, 2025 21:54
@nascheme
Copy link
Member Author

nascheme commented Jul 8, 2025

GH-136401 was merged. I think it should be backported as well since it should be pretty low risk.

The main branch has been merged into this PR. I've updated the PR to defer clearing all weakrefs without callbacks (not just weakrefs to classes). This seems more correct to me (rather than special casing weakrefs to classes) and seems the best way to fix GH-132413.

I think GH-132413 is actually more likely to be encountered in real code. For example, if you have some logging function in a __del__ method that uses the datetime module. So, maybe this PR should be backported to 3.13 and 3.14 as well.

@sergey-miryanov
Copy link
Contributor

Looks good to me. Thanks for detailed comments!

@nascheme
Copy link
Member Author

@pablogsal are you planning to take another look at this? If so, no problem, there is no great rush. If not, I think I'll do another review pass myself and then merge.

@pablogsal
Copy link
Member

pablogsal commented Jul 14, 2025

Yeah i have it pending for this week (sorry fixing a lot of bugs for 3.14 currently) but if you feel I am taking too much time I am happy if you go ahead I don't want to block it on this

@sergey-miryanov
Copy link
Contributor

@nascheme I see you updated comment for handle_weakrefs and removed mentions of GC_REACHABLE and GC_TENTATIVELY_UNREACHABLE.

Would you mind also removing another one from subtract_refs comment?

cpython/Python/gc.c

Lines 572 to 576 in 9363703

/* Subtract internal references from gc_refs. After this, gc_refs is >= 0
* for all objects in containers, and is GC_REACHABLE for all tracked gc
* objects not in containers. The ones with gc_refs > 0 are directly
* reachable from outside containers, and so can't be collected.
*/

@efimov-mikhail
Copy link
Member

Do we want to merge this PR w/o tests on subclasses destruction from #135552?
Or maybe it will be better to place them here?

@nascheme
Copy link
Member Author

Do we want to merge this PR w/o tests on subclasses destruction from #135552?
Or maybe it will be better to place them here?

Sergey has tests in another PR and I'll merge that right after I merge this one.

Consolidate most of the comments related to handling of weakrefs.  This
avoids having the information repeated in different comments.
@nascheme nascheme requested a review from AA-Turner as a code owner August 7, 2025 21:35
@AA-Turner AA-Turner changed the title gh-135552: Make the GC clear weakrefs later. gh-135552: Make the GC clear weakrefs later Aug 7, 2025
@nascheme nascheme merged commit 350c58b into python:main Aug 7, 2025
41 checks passed
@nascheme
Copy link
Member Author

nascheme commented Aug 7, 2025

@pablogsal I'm merging this since I want to get it into the alpha release. However, if you have time to make another review, that would still be appreciated.

nascheme added a commit that referenced this pull request Aug 8, 2025
These are tests to ensure behaviour introduced by GH-136189 is working as expected.

Co-authored-by: Mikhail Borisov <43937008+fxeqxmulfx@users.noreply.github.com>
Co-authored-by: Kumar Aditya <kumaraditya@python.org>
Co-authored-by: Neil Schemenauer <nas-github@arctrix.com>
@sergey-miryanov
Copy link
Contributor

Hooray! Thanks all!

Agent-Hellboy pushed a commit to Agent-Hellboy/cpython that referenced this pull request Aug 19, 2025
Fix a bug caused by the garbage collector clearing weakrefs too early.  The
weakrefs in the ``tp_subclasses`` dictionary are needed in order to correctly
invalidate type caches (for example, by calling ``PyType_Modified()``).
Clearing weakrefs before calling finalizers causes the caches to not be
correctly invalidated.  That can cause crashes since the caches can refer to
invalid objects.  Defer the clearing of weakrefs without callbacks until after
finalizers are executed.
Agent-Hellboy pushed a commit to Agent-Hellboy/cpython that referenced this pull request Aug 19, 2025
These are tests to ensure behaviour introduced by pythonGH-136189 is working as expected.

Co-authored-by: Mikhail Borisov <43937008+fxeqxmulfx@users.noreply.github.com>
Co-authored-by: Kumar Aditya <kumaraditya@python.org>
Co-authored-by: Neil Schemenauer <nas-github@arctrix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants