KEMBAR78
bpo-45250: fix docs regarding `__iter__` and iterators being inconsistently required by CPython by brettcannon · Pull Request #29170 · python/cpython · GitHub
Skip to content

Conversation

@brettcannon
Copy link
Member

@brettcannon brettcannon commented Oct 22, 2021

It is now considered a historical accident that e.g. for loops and the iter() built-in function do not require the iterators they work with to define __iter__, only __next__.

https://bugs.python.org/issue45250

@bedevere-bot bedevere-bot added the docs Documentation in the Doc dir label Oct 22, 2021
@brettcannon brettcannon added the needs backport to 3.10 only security fixes label Oct 22, 2021
@brettcannon brettcannon self-assigned this Oct 22, 2021
@srittau
Copy link
Contributor

srittau commented Oct 25, 2021

I'm strongly opposed to changing the meaning of a term that has had a clear definition for decades (and is implemented as collections.abc.Iterator) by an ambiguous definition. It's clear that some methods accept a super type, instead of requiring a full iterator, but that means the description of those methods should be changed, not the definition.

Addendum: As a typeshed maintainer I see the problem of ill-defined terms and protocols in Python everyday ("file-like objects" are my personal bane). Making existing clear definitions less clear is not the right way forward.

@brettcannon
Copy link
Member Author

The issue is the definition as written down is wrong/inaccurate from the perspective of Python itself. So it isn't that this redefines a term as much as actually makes the glossary reflect the real world as to how Python itself uses and considered what iterators are (i.e. this properly reflects what PyIter_Check() actually checks for). For instance, if a pass an iterable to a for loop and it returns something that only defines __next__(), it will work. If you were to ask someone what that object returned by the iterable is, do you think people would argue that it isn't an iterator simply because it doesn't define __iter__(), or that it is an iterator because a for loop accepts it?

And I don't quite understand how the definition is ambiguous? An iterator defines __next__(). You probably should define __iter__(), but it isn't strictly necessary to meet the iterator protocol, it just so happens the commonly used ABC for showing iterator compliance provides an __iter__() definition for free. If the ABC wasn't used for type hints then I don't think anyone would care, but because the class predates protocols then people have accidentally thought __iter__() was a required part of being an iterator (along with the inaccurate glossary definitions).

What's your proposal otherwise? To create a brand new term of what e.g. a for loop expects an iterable to return that isn't an iterator? To my knowledge everyone considers that an iterator, so trying to redefine that is also a major ask and something that I personally think is less helpful than simply getting the definition to be accurate.

Do note that Guido and other folks agree with this plan in https://mail.python.org/archives/list/python-dev@python.org/thread/3W7TDX5KNVQVGT5CUHBK33M7VNTP25DZ/#3W7TDX5KNVQVGT5CUHBK33M7VNTP25DZ, so this isn't entirely without discussion and some agreement. We can ask the SC to make a final call on this if you really want to (I will abstain from voting on it if it comes to that).

@srittau
Copy link
Contributor

srittau commented Oct 26, 2021

(I'll use the terms "partial iterator" for objects just defining __next__() and "full iterator" for objects also defining __iter__().)

I'm not looking at this from the perspective how this is implemented in Python, but from a user's perspective. I don't think it's correct to say Python considers an iterator to have only __next__(). That may be true for the CPython core, but it's clearly not the case for the standard library (which I consider to be part of Python), where collections.abc.Iterator and typing.Iterator require the __iter__() method.

Practically, the status quo (some function, methods, statements don't require __iter__()) is a non-issue in practice. But saying "__iter__() is optional (but recommended)" turns this into a practical communication issue. Now, if I have one API that returns an iterator and a second API that accepts an iterator, I can't be sure that they are compatible. I can only hope that either the API returning the iterator follows the recommendation or the API consuming it doesn't access __iter__(). In the future, APIs need to document whether they include __iter__() or not. It also means that all existing documentation that mentions "iterators" needs to be updated to clarify whether they mean partial or full iterators.

I see two solutions to this problem:

  1. Introduce a new term as you suggest (e.g. "partial iterator") and define a link between the "iterator" and "partial iterator" glossary entries. This has the advantage that both concepts are clearly named.
  2. My preferred choice: Keep the definition of "iterator" as is and update the documentation where applicable, either with a note that technically __iter__() is not required, or by replacing the term "iterator" with a sentence like "an object implementing the __next__() protocol, e.g. an iterator".

@brettcannon
Copy link
Member Author

collections.abc.Iterator and typing.Iterator require the iter() method.

I think this is where our views are differing. It's an unfortunate side-effect/bug, from my perspective, that because collections.abc.Iterator includes a default implementation for __iter__() that typing considers that method as required (tangent: this is the same reason I want to create protocols for importlib as importlib.abc has classes that are too broad for typing purposes due to the helper methods provided).

In other words I don't view collections.abc.Iterator as the official definition/type of what an iterator is, but instead an ABC that helps you implement an iterator.

Maybe this is suggesting it's time to have a lower-level protocols module or something which gets all of these interfaces correctly based on the language definition, then collections.abc can inherit from that new module? Or perhaps typing.Iterator comes back as a protocol or there's a new typing.SupportsIteration that does the right thing while collections.abc.Iterator inherits from it?

@brettcannon brettcannon reopened this Oct 28, 2021
@ambv
Copy link
Contributor

ambv commented Oct 28, 2021

The relevant python-dev thread doesn't seem to have reached consensus and died out.

@brettcannon
Copy link
Member Author

I was planning to ask the SC to make a call.

@godlygeek
Copy link
Contributor

godlygeek commented Oct 30, 2021

For instance, if a pass an iterable to a for loop and it returns something that only defines __next__(), it will work. If you were to ask someone what that object returned by the iterable is, do you think people would argue that it isn't an iterator simply because it doesn't define __iter__(), or that it is an iterator because a for loop accepts it?

I would absolutely make the argument that it isn't an iterator because it doesn't define __iter__(). It happens to work for a for loop anyway, despite the fact that it doesn't follow the full protocol for an iterator, but that is a perfectly natural thing to happen in a language that uses duck typed protocols: if a class implements only one part of a protocol, it may or may not work in places where that protocol is required, depending on whether or not the parts of the protocol it implements are the only parts needed.

It seems to me that you're proposing a change to the language here, not just to the documentation. As things stand today iter(iter(obj)) has a well defined, consistent meaning, because iterators are required to be iterable. If this proposed change is made, that will no longer be the case, which seems like it could easily break the assumptions of real world code.

For instance, this recipe is in the itertools documentation:

def sliding_window(iterable, n):
    # sliding_window('ABCDEFG', 4) -> ABCD BCDE CDEF DEFG
    it = iter(iterable)
    window = collections.deque(islice(it, n), maxlen=n)
    if len(window) == n:
        yield tuple(window)
    for x in it:
        window.append(x)
        yield tuple(window)

As things are defined today, this function will work for all iterables, because an iterable is defined to return an iterator from __iter__, and because an iterator is defined to itself be an iterable (and therefore valid input to islice). This function would be broken if the iterable's __iter__ returned an object with __next__ but without __iter__, but as the language is defined today, that's an invalid implementation of an iterator, and so the thing that's returning something that isn't quite an iterator is itself not quite an iterable, and so it's invalid input to this function.

If we change the language definition to say that iter() need not return something iterable, then we need a new way to describe the preconditions of this function: it needs not just any iterable as input, but an iterable whose iterators are iterable.

As things stand today, if an iterator has __next__ but not __iter__, it is operating out of contract and therefore not a valid iterator, and libraries like itertools have no obligation to support it. How much code in the wild assumes that iterators are always iterable, and will break if we give people leave to begin making non-iterable iterators? The "roughly equivalent" pure Python implementations of accumulate, dropwhile, islice, and zip_longest all assume that the iterable's iterator is iterable; it's not crazy to think that real world Python code commonly makes the same assumption.

@SebastiaanZ
Copy link
Contributor

SebastiaanZ commented Oct 30, 2021

There are more recipes in the itertools documentation that highlight how common of a pattern it is to rely on iterators having an __iter__ method, such as this one:

def grouper(iterable, n, fillvalue=None):
    "Collect data into non-overlapping fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

While I'd agree that repices show usage of the language and do not define it, I can't help but think that those recipes show that solving iteration problems with patterns that rely on iterators being iterable (as-in having an __iter__ method that returns self) are commonplace.

This makes this change a breaking change and unfortunately one that makes code that uses such patterns have a hidden bug, as most iterators will have an __iter__ method (if people heed the encouragement in the docs), but not all of them, since it's no longer a requirement. Existing code may fail, but it's not going to be obvious right away.

@PythonCHB
Copy link

There are more recipes in the itertools documentation that highlight how common of a pattern it is to rely on iterators having an iter method, such as this one:

def grouper(iterable, n, fillvalue=None):
    "Collect data into non-overlapping fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

Hmm -- the name chosen for that parameter is "iterable", not "iterator". So it seems to me that it's expecting that all iterables will return an iterator when passed to iter() -- which indeed is what an iterable is. However, of course, the rest of itertools (and many builtins) also expect an iteralbe, so if you are going to chain them, like in this example, the iterators involved need to also be iterable.

I will say that it took me years to clearly "get" the distinction myself, but I think there are two things here:

An iterator returns an item when passed to next() e.g. has a __next__ method, and raises StopIteration when there are no more items.

An iterable returns an iterator when passed to the iter() function, e.g. has an __iter__ method.

In addition, it is a very common, very useful, and highly recommended convention that all iterators also be iterables, i.e. an __iter__ method that returns itself.

Personally, I think it's both correct and useful to define things this way -- that is "iterator" is about the iteration, and "iterable" is about being able to produce an iterator.

It is very, very, convenient that in most places in Python, one can pass either an iterator or iterable in places where an iterator is required -- this saves us from having to call iter() directly very often at all. But it does mean that most iterators should be iterable.

But they are still distinct protocols, we should make that clear in the docs.

All that being said: an enormous amount of code (including most of itertools) expects iterables rather than iterators, so the docs should make it very clear that an iterator that does not support iter() will be of little general utility.

@SebastiaanZ
Copy link
Contributor

SebastiaanZ commented Oct 31, 2021

There are more recipes in the itertools documentation that highlight how common of a pattern it is to rely on iterators having an iter method, such as this one:

def grouper(iterable, n, fillvalue=None):
    "Collect data into non-overlapping fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

Hmm -- the name chosen for that parameter is "iterable", not "iterator". So it seems to me that it's expecting that all iterables will return an iterator when passed to iter() -- which indeed is what an iterable is. However, of course, the rest of itertools (and many builtins) also expect an iteralbe, so if you are going to chain them, like in this example, the iterators involved need to also be iterable.

I'm sorry, I should have been clearer. Even if you pass an iterable to the grouper function in this recipe, if that iterable's __iter__ method returns an iterator that does not have an __iter__ method that returns self itself, this recipe will fail. This is because this recipe itself will implicitly call iter twice:

    # here you make an iterator from whatever iterable you pass in
    args = [iter(iterable)] * n 

    # that iterator is then passed to `zip_longest`, which will in turn
    # call `iter` on the iterator this function has already created.
    return zip_longest(*args, fillvalue=fillvalue)

Even if we agree that iterators are not always iterable and therefore not suitable for this recipe, if the actual iterable you pass to this recipe has a non-iterable iterator, this recipe will still fail. This gist demonstrates what I mean.

@godlygeek
Copy link
Contributor

Hmm -- the name chosen for that parameter is "iterable", not "iterator". So it seems to me that it's expecting that all iterables will return an iterator when passed to iter() -- which indeed is what an iterable is.

Yes, but that's not all it's assuming. It's also assuming that the iterator returned by iter() is itself an iterable, because it passes it to zip_longest, which requires iterables. That's what @SebastiaanZ was pointing out.

As things are defined today, that's a valid assumption, because iterators are required to be iterable. If we change the documentation to say that iterators are not required to be iterable, this function will go from working for all iterables to working for only some iterables (those whose iterators are iterable). And we won't have any word to describe the subset of iterables for which the function works.

But they are still distinct protocols, we should make that clear in the docs.

The docs today clearly specify that all iterators must be iterable. This isn't clarifying the requirements for iterators, it's changing them.

@godlygeek
Copy link
Contributor

godlygeek commented Nov 2, 2021

It looks like even the interpreter core assumes that all iterators are iterable.

Unpacking like a, b, *rest = iterable is done by unpack_iterable, which calls PyObject_GetIter() on the iterable, and later passes that iterator to PySequence_List, which treats it as an iterable.

So an iterable whose iterators are not iterable can be used in a for loop or consumed by list, but it can't be unpacked:

>>> class Iterator:
...     def __next__(self):
...         if hasattr(self, "consumed"):
...             raise StopIteration
...         self.consumed = True
...         return 42
...
>>> class Iterable:
...     def __iter__(self):
...         return Iterator()
...
>>> list(Iterable())
[42]
>>> first, *rest = Iterable()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'Iterator' object is not iterable

Which fits with my mental model that this Iterator is not really an iterator because it implements only half of the iterator protocol, and so it will be broken in contexts that require the whole protocol.

@brettcannon
Copy link
Member Author

The SC discussed this and the decision was to update the definitions to say it's a CPython implementation detail that it is inconsistent in its checking whether what an iterable returns is a proper iterator or not (i.e. keep the __iter__ requirement but admit it isn't rigorously enforced by CPython as a historical accident).

I have not decided if I'm going to update this PR or close it and open a new one with those changes.

@brettcannon brettcannon changed the title bpo-45250: fix docs regarding __iter__ and iterators bpo-45250: fix docs regarding __iter__ and iterators being inconsistently required by CPython Nov 20, 2021
@brettcannon brettcannon merged commit be36e06 into python:main Nov 20, 2021
@brettcannon brettcannon deleted the iterator-definition branch November 20, 2021 00:40
@miss-islington
Copy link
Contributor

Thanks @brettcannon for the PR 🌮🎉.. I'm working now to backport this PR to: 3.10.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

@miss-islington
Copy link
Contributor

Sorry @brettcannon, I had trouble checking out the 3.10 backport branch.
Please backport using cherry_picker on command line.
cherry_picker be36e0634060c7d5dee8e8876fb888bbb53d992a 3.10

brettcannon added a commit to brettcannon/cpython that referenced this pull request Nov 20, 2021
…nconsistently required by CPython (pythonGH-29170)

It is now considered a historical accident that e.g. `for` loops and the `iter()` built-in function do not require the iterators they work with to define `__iter__`, only `__next__`.
(cherry picked from commit be36e06)

Co-authored-by: Brett Cannon <brett@python.org>
@bedevere-bot
Copy link

GH-29650 is a backport of this pull request to the 3.10 branch.

@bedevere-bot bedevere-bot removed the needs backport to 3.10 only security fixes label Nov 20, 2021
brettcannon added a commit that referenced this pull request Nov 22, 2021
…nconsistently required by CPython (GH-29170) (GH-29650)

It is now considered a historical accident that e.g. `for` loops and the `iter()` built-in function do not require the iterators they work with to define `__iter__`, only `__next__`.
(cherry picked from commit be36e06)

Co-authored-by: Brett Cannon <brett@python.org>
remykarem pushed a commit to remykarem/cpython that referenced this pull request Dec 7, 2021
…tently required by CPython (pythonGH-29170)

It is now considered a historical accident that e.g. `for` loops and the `iter()` built-in function do not require the iterators they work with to define `__iter__`, only `__next__`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Documentation in the Doc dir

Projects

None yet

Development

Successfully merging this pull request may close these issues.