KEMBAR78
Constant hash value for None to aid reproducibility · Issue #99540 · python/cpython · GitHub
Skip to content

Constant hash value for None to aid reproducibility #99540

@yonillasky

Description

@yonillasky

Feature or enhancement

Fix hash(None) to a constant value.

Pitch

(Updated 2022.11.18)

  • Under current behavior, the runtime leaks the ASLR offset, since the original address of the None singleton is fixed and _Py_HashPointerRaw is reversible. Admittedly, there are other similar objects, like NotImplemented or Ellipsis that also have this problem, and need to be similarly fixed.

  • Because of ASLR, hash(None) changes every run; that consequently means the hash of many useful "key" types changes every run, particularly tuples, NamedTuples and frozen dataclasses that have Optional fields.

  • The other source of hash value instability across runs in common "key" types like str or Enum, can be fixed using the PYTHONHASHSEED environment var.

  • other singletons commonly used as (or as part of) mapping keys, True and False already have fixed hash values.

CPython's builtin set classes, as do all other non-concurrent hash-tables, either open or closed, AFAIK, grant the user a certain stability property. Given a specific sequence of initialization and subsequent mutation (if any), and given specific inputs with certain hash values, if one were to "replay" it, the result set will be in the same observable state every time: not only have the same items (correctness), but also they would be retrieved from the set in the same order when iterated.

This property means that code that starts out with identical data, performs computations and makes decisions based on the results will behave identically between runs. For example, if based on some mathematical properties of the input, we have computed a set of N valid choices, they are given integer scores, then we pick the first choice that has maximal score. If the set guarantees the property described above, we are also guaranteed that the exact same choice will be made every time this code runs, even in case of ties. This is very helpful for reproducibility, especially in complex algorithmic code that makes a lot of combinatorial decisions of that kind.

There is a counterargument that we should simply just offer StableSet and StableFrozenSet that guarantee a specific order, the same way that dict does.
A few things to note about that:

  • I've written such set classes as an adapter over dict[T, None], there is a substantial perf overhead to that
  • Is it worth the extra "weight" in code inside the core? That's suspect - why hasn't it been added all those years?
  • In a large codebase, it requires automated code inspection and editing tools to enforce this. It's all too easy, and natural, to add a seemingly harmless set comprehension somewhere and defeat the whole effort
  • The insertion-order-as-iteration-order guarantee is stronger than what we actually require, in order to have the "reproducability" property I've described, so we're paying extra for something we don't really need.

My PR makes a small change to CPython, in objects.c, that sets the tp_hash descriptor of NoneType to a function that simply returns a constant value.

Admittedly, determinism between runs isn't a concern that most users/programs care about. It is rather niche. However, I argue that still, there is no externalized cost to this change.

Previous discussion

https://discuss.python.org/t/constant-hash-for-none/21110

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    type-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions