-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
Description
Feature or enhancement
Fix hash(None)
to a constant value.
Pitch
(Updated 2022.11.18)
-
Under current behavior, the runtime leaks the ASLR offset, since the original address of the
None
singleton is fixed and_Py_HashPointerRaw
is reversible. Admittedly, there are other similar objects, likeNotImplemented
orEllipsis
that also have this problem, and need to be similarly fixed. -
Because of ASLR,
hash(None)
changes every run; that consequently means the hash of many useful "key" types changes every run, particularly tuples, NamedTuples and frozen dataclasses that haveOptional
fields. -
The other source of hash value instability across runs in common "key" types like str or Enum, can be fixed using the
PYTHONHASHSEED
environment var. -
other singletons commonly used as (or as part of) mapping keys,
True
andFalse
already have fixed hash values.
CPython's builtin set classes, as do all other non-concurrent hash-tables, either open or closed, AFAIK, grant the user a certain stability property. Given a specific sequence of initialization and subsequent mutation (if any), and given specific inputs with certain hash values, if one were to "replay" it, the result set will be in the same observable state every time: not only have the same items (correctness), but also they would be retrieved from the set in the same order when iterated.
This property means that code that starts out with identical data, performs computations and makes decisions based on the results will behave identically between runs. For example, if based on some mathematical properties of the input, we have computed a set of N valid choices, they are given integer scores, then we pick the first choice that has maximal score. If the set guarantees the property described above, we are also guaranteed that the exact same choice will be made every time this code runs, even in case of ties. This is very helpful for reproducibility, especially in complex algorithmic code that makes a lot of combinatorial decisions of that kind.
There is a counterargument that we should simply just offer StableSet
and StableFrozenSet
that guarantee a specific order, the same way that dict
does.
A few things to note about that:
- I've written such set classes as an adapter over
dict[T, None]
, there is a substantial perf overhead to that - Is it worth the extra "weight" in code inside the core? That's suspect - why hasn't it been added all those years?
- In a large codebase, it requires automated code inspection and editing tools to enforce this. It's all too easy, and natural, to add a seemingly harmless set comprehension somewhere and defeat the whole effort
- The insertion-order-as-iteration-order guarantee is stronger than what we actually require, in order to have the "reproducability" property I've described, so we're paying extra for something we don't really need.
My PR makes a small change to CPython, in objects.c
, that sets the tp_hash
descriptor of NoneType
to a function that simply returns a constant value.
Admittedly, determinism between runs isn't a concern that most users/programs care about. It is rather niche. However, I argue that still, there is no externalized cost to this change.
Previous discussion
https://discuss.python.org/t/constant-hash-for-none/21110