KEMBAR78
gh-113993: Allow interned strings to be mortal, and fix related issues by encukou · Pull Request #120520 · python/cpython · GitHub
Skip to content

Conversation

@encukou
Copy link
Member

@encukou encukou commented Jun 14, 2024

I've spent too much time looking at this myself, it wants more eyes :)

I spent a week learning about the string interning mechanism, and wrote up how I think it should work in an InternalDocs file I'm adding here.

I found a bunch of ... quirks if not outright bugs. For example, we have duplicate singletons (e.g. _Py_ID(a) and the latin1 short string a). I don't think I can bring back mortal interned strings without getting my idea of the design in sync with the code, so, this ended up being a big PR.


  • Add an InternalDocs file describing how interning should work and how to use it.
    (Please review this first!)

  • Add internal functions to explicitly request what kind of interning is done:

    • _PyUnicode_InternMortal
    • _PyUnicode_InternImmortal
    • _PyUnicode_InternStatic
  • Switch uses of PyUnicode_InternInPlace to those.

  • Disallow using _Py_SetImmortal on strings directly.
    You should use _PyUnicode_InternImmortal instead:

    • Strings should be interned before immortalization, otherwise you're possibly
      interning a immortalizing copy.
    • _Py_SetImmortal doesn't handle the SSTATE_INTERNED_MORTAL to
      SSTATE_INTERNED_IMMORTAL update, and those flags can't be changed in
      backports, as they are now part of public API and version-specific ABI.
  • Add private _only_immortal argument for sys.XXX, used in refleak test machinery.

  • Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:

    • _Py_ID
    • _Py_STR (including the empty string)
    • one-character latin-1 singletons

    Now, when you intern a singleton, that exact singleton will be interned.

  • Add a _Py_LATIN1_CHR macro, use it instead of _Py_ID/_Py_STR for one-character latin-1 singletons everywhere (including Clinic).

  • Intern _Py_STR singletons at startup.

    Try this in 3.12: (click to expand)
    import sys
    
    a = sys.intern('<module>')  # normal string
    print('a', id(a), sys.getrefcount(a))
    try:
        raise Exception()
    except Exception as err:
        b = err.__traceback__.tb_frame.f_code.co_name  # same string via _Py_STR
    
    assert sys.intern(a) is sys.intern(b)

    In 3.13 the reproducer doesn't work but I don't think the underlying unsoundness was fixed.

  • For free-threaded builds, intern _Py_LATIN1_CHR singletons at startup.

  • Beef up the tests. Cover internal details (marked with @cpython_only).

  • Add lots of assertions

encukou added 30 commits June 14, 2024 21:24
Also, the `PyUnicode_InternImmortal` function (with a public-looking name)
is switched to use this.

(I picked a relatively inconsequential module on purpose.)
AFAIUI, this is now handled by the `statically_allocated` flag.
…rned

Mortal interned strings don't count references from the interned_dict in
their refcount. This menans a -2 at interning time, a +2 in ClearInterned,
and special handling in unicode_dealloc.

Note that unicode_dealloc will currently immortalize an interned string.
That means we shouldn't get refleaks *yet*.
bpo-40521 (pythonGH-84701) is now long solved.
It's marshal and the compiler that immortalize strings, not the
code object.
@encukou encukou added the needs backport to 3.13 bugs and security fixes label Jun 21, 2024
@encukou
Copy link
Member Author

encukou commented Jun 21, 2024

The buildbot failures are unrelated; I'll merge.

Thank you for the reviews!

@encukou encukou merged commit 6f1d448 into python:main Jun 21, 2024
@miss-islington-app
Copy link

Thanks @encukou for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

@encukou encukou deleted the immortal-interned branch June 21, 2024 15:19
@miss-islington-app
Copy link

Sorry, @encukou, I could not cleanly backport this to 3.13 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 6f1d448bc110633eda110310fd833bd46e7b30f2 3.13

@miss-islington-app
Copy link

Sorry, @encukou, I could not cleanly backport this to 3.12 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 6f1d448bc110633eda110310fd833bd46e7b30f2 3.12

@ericsnowcurrently
Copy link
Member

Thanks for taking care of this, @encukou!

encukou added a commit to encukou/cpython that referenced this pull request Jun 24, 2024
…related issues (pythonGH-120520)

* Add an InternalDocs file describing how interning should work and how to use it.

* Add internal functions to *explicitly* request what kind of interning is done:
  - `_PyUnicode_InternMortal`
  - `_PyUnicode_InternImmortal`
  - `_PyUnicode_InternStatic`

* Switch uses of `PyUnicode_InternInPlace` to those.

* Disallow using `_Py_SetImmortal` on strings directly.
  You should use `_PyUnicode_InternImmortal` instead:
  - Strings should be interned before immortalization, otherwise you're possibly
    interning a immortalizing copy.
  - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
    `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
    backports, as they are now part of public API and version-specific ABI.

* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
  - `_Py_ID`
  - `_Py_STR` (including the empty string)
  - one-character latin-1 singletons

  Now, when you intern a singleton, that exact singleton will be interned.

* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

* Intern `_Py_STR` singletons at startup.

* For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup.

* Beef up the tests. Cover internal details (marked with `@cpython_only`).

* Add lots of assertions

Co-Authored-By: Eric Snow <ericsnowcurrently@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented Jun 24, 2024

GH-120945 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Jun 24, 2024
encukou added a commit that referenced this pull request Jun 24, 2024
…d issues (GH-120520) (GH-120945)

* Add an InternalDocs file describing how interning should work and how to use it.

* Add internal functions to *explicitly* request what kind of interning is done:
  - `_PyUnicode_InternMortal`
  - `_PyUnicode_InternImmortal`
  - `_PyUnicode_InternStatic`

* Switch uses of `PyUnicode_InternInPlace` to those.

* Disallow using `_Py_SetImmortal` on strings directly.
  You should use `_PyUnicode_InternImmortal` instead:
  - Strings should be interned before immortalization, otherwise you're possibly
    interning a immortalizing copy.
  - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
    `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
    backports, as they are now part of public API and version-specific ABI.

* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
  - `_Py_ID`
  - `_Py_STR` (including the empty string)
  - one-character latin-1 singletons

  Now, when you intern a singleton, that exact singleton will be interned.

* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

* Intern `_Py_STR` singletons at startup.

* For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup.

* Beef up the tests. Cover internal details (marked with `@cpython_only`).

* Add lots of assertions

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
mrahtz pushed a commit to mrahtz/cpython that referenced this pull request Jun 30, 2024
… issues (pythonGH-120520)


* Add an InternalDocs file describing how interning should work and how to use it.

* Add internal functions to *explicitly* request what kind of interning is done:
  - `_PyUnicode_InternMortal`
  - `_PyUnicode_InternImmortal`
  - `_PyUnicode_InternStatic`

* Switch uses of `PyUnicode_InternInPlace` to those.

* Disallow using `_Py_SetImmortal` on strings directly.
  You should use `_PyUnicode_InternImmortal` instead:
  - Strings should be interned before immortalization, otherwise you're possibly
    interning a immortalizing copy.
  - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
    `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
    backports, as they are now part of public API and version-specific ABI.

* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
  - `_Py_ID`
  - `_Py_STR` (including the empty string)
  - one-character latin-1 singletons

  Now, when you intern a singleton, that exact singleton will be interned.

* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

* Intern `_Py_STR` singletons at startup.

* For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup.

* Beef up the tests. Cover internal details (marked with `@cpython_only`).

* Add lots of assertions

Co-Authored-By: Eric Snow <ericsnowcurrently@gmail.com>
@eduardo-elizondo
Copy link
Contributor

This is great! Quick follow-up @encukou, should we rebase this one now: #113601?

noahbkim pushed a commit to hudson-trading/cpython that referenced this pull request Jul 11, 2024
… issues (pythonGH-120520)


* Add an InternalDocs file describing how interning should work and how to use it.

* Add internal functions to *explicitly* request what kind of interning is done:
  - `_PyUnicode_InternMortal`
  - `_PyUnicode_InternImmortal`
  - `_PyUnicode_InternStatic`

* Switch uses of `PyUnicode_InternInPlace` to those.

* Disallow using `_Py_SetImmortal` on strings directly.
  You should use `_PyUnicode_InternImmortal` instead:
  - Strings should be interned before immortalization, otherwise you're possibly
    interning a immortalizing copy.
  - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
    `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
    backports, as they are now part of public API and version-specific ABI.

* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
  - `_Py_ID`
  - `_Py_STR` (including the empty string)
  - one-character latin-1 singletons

  Now, when you intern a singleton, that exact singleton will be interned.

* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

* Intern `_Py_STR` singletons at startup.

* For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup.

* Beef up the tests. Cover internal details (marked with `@cpython_only`).

* Add lots of assertions

Co-Authored-By: Eric Snow <ericsnowcurrently@gmail.com>
estyxx pushed a commit to estyxx/cpython that referenced this pull request Jul 17, 2024
… issues (pythonGH-120520)


* Add an InternalDocs file describing how interning should work and how to use it.

* Add internal functions to *explicitly* request what kind of interning is done:
  - `_PyUnicode_InternMortal`
  - `_PyUnicode_InternImmortal`
  - `_PyUnicode_InternStatic`

* Switch uses of `PyUnicode_InternInPlace` to those.

* Disallow using `_Py_SetImmortal` on strings directly.
  You should use `_PyUnicode_InternImmortal` instead:
  - Strings should be interned before immortalization, otherwise you're possibly
    interning a immortalizing copy.
  - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
    `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
    backports, as they are now part of public API and version-specific ABI.

* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
  - `_Py_ID`
  - `_Py_STR` (including the empty string)
  - one-character latin-1 singletons

  Now, when you intern a singleton, that exact singleton will be interned.

* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

* Intern `_Py_STR` singletons at startup.

* For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup.

* Beef up the tests. Cover internal details (marked with `@cpython_only`).

* Add lots of assertions

Co-Authored-By: Eric Snow <ericsnowcurrently@gmail.com>
@encukou
Copy link
Member Author

encukou commented Jul 22, 2024

@eduardo-elizondo Yup, I've updated it, and left some notes about the docs.
If you want to delegate the writing, let me know and I'll take over the PR :)

@eduardo-elizondo
Copy link
Contributor

@eduardo-elizondo Yup, I've updated it, and left some notes about the docs. If you want to delegate the writing, let me know and I'll take over the PR :)

Fixing them now! Let's close out in the other PR

encukou added a commit to encukou/cpython that referenced this pull request Aug 16, 2024
…related issues (pythonGH-120520)

* Add an InternalDocs file describing how interning should work and how to use it.

* Add internal functions to *explicitly* request what kind of interning is done:
  - `_PyUnicode_InternMortal`
  - `_PyUnicode_InternImmortal`
  - `_PyUnicode_InternStatic`

* Switch uses of `PyUnicode_InternInPlace` to those.

* Disallow using `_Py_SetImmortal` on strings directly.
  You should use `_PyUnicode_InternImmortal` instead:
  - Strings should be interned before immortalization, otherwise you're possibly
    interning a immortalizing copy.
  - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
    `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
    backports, as they are now part of public API and version-specific ABI.

* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
  - `_Py_ID`
  - `_Py_STR` (including the empty string)
  - one-character latin-1 singletons

  Now, when you intern a singleton, that exact singleton will be interned.

* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

* Intern `_Py_STR` singletons at startup.

* Beef up the tests. Cover internal details (marked with `@cpython_only`).

* Add lots of assertions

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
Yhg1s pushed a commit that referenced this pull request Sep 27, 2024
…H-121903, GH-122303) (#123065)

This backports several PRs for gh-113993, making interned strings mortal so they can be garbage-collected when no longer needed.

* Allow interned strings to be mortal, and fix related issues (GH-120520)

  * Add an InternalDocs file describing how interning should work and how to use it.

  * Add internal functions to *explicitly* request what kind of interning is done:
    - `_PyUnicode_InternMortal`
    - `_PyUnicode_InternImmortal`
    - `_PyUnicode_InternStatic`

  * Switch uses of `PyUnicode_InternInPlace` to those.

  * Disallow using `_Py_SetImmortal` on strings directly.
    You should use `_PyUnicode_InternImmortal` instead:
    - Strings should be interned before immortalization, otherwise you're possibly
      interning a immortalizing copy.
    - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
      `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
      backports, as they are now part of public API and version-specific ABI.

  * Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

   Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
    - `_Py_ID`
    - `_Py_STR` (including the empty string)
    - one-character latin-1 singletons

    Now, when you intern a singleton, that exact singleton will be interned.

  * Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

  * Intern `_Py_STR` singletons at startup.

  * Beef up the tests. Cover internal details (marked with `@cpython_only`).

  * Add lots of assertions

* Don't immortalize in PyUnicode_InternInPlace; keep immortalizing in other API (GH-121364)

  * Switch PyUnicode_InternInPlace to _PyUnicode_InternMortal, clarify docs

  * Document immortality in some functions that take `const char *`

  This is PyUnicode_InternFromString;
  PyDict_SetItemString, PyObject_SetAttrString;
  PyObject_DelAttrString; PyUnicode_InternFromString;
  and the PyModule_Add convenience functions.

  Always point out a non-immortalizing alternative.

  * Don't immortalize user-provided attr names in _ctypes

* Immortalize names in code objects to avoid crash (GH-121903)

* Intern latin-1 one-byte strings at startup (GH-122303)

There are some 3.12-specific changes, mainly to allow statically allocated strings in deepfreeze. (In 3.13, deepfreeze switched to the general `_Py_ID`/`_Py_STR`.)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
@@ -0,0 +1,122 @@
# String interning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have a link to this file from the index in InternalDocs/README.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants