KEMBAR78
gh-127647: Add typing.Reader and Writer protocols by srittau · Pull Request #127648 · python/cpython · GitHub
Skip to content

Conversation

@srittau
Copy link
Contributor

@srittau srittau commented Dec 5, 2024

@srittau
Copy link
Contributor Author

srittau commented Dec 5, 2024

A few design considerations:

  • I used the names Reader and Writer to match existing ABC names in typing/collections.abc. Alternatives are Readable and Writable, but I think they are used more rarely for these kind of (pseudo-)protocols. SupportsX would work for Writer, but not for Reader, since the latter supports multiple methods. (Any we maybe want to use SupportsRead etc. at some point for tighter protocols.)
  • I would prefer these protocols to be in io, where they fit better thematically, but that would require importing typing with unforseeable (performance) implications for such a foundational module.
  • I deliberated using tighter protocols, but for ease of use reasons – and since they are mostly an alternative to using IO et al. – I went with larger protocols for now.
  • Seekable (with methods seek and tell) would be an interesting addition as well, but I think this should wait for easy protocol composition.

@srittau
Copy link
Contributor Author

srittau commented Dec 5, 2024

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit sad that we can't use covariance and contravariance in the protocols since a subclass could use it with an invariant type.

Small improvements to the docstrings and signature

Protocol for reading from a file or other input stream.

.. method:: read(size=...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. method:: read(size=...)
.. method:: read(size=..., /)

(Same for other methods)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure of how much of the signature to document. For example, would it also make sense to add argument and/or return types? I've added the slashes for now.

@AlexWaygood
Copy link
Member

I would prefer these protocols to be in io, where they fit better thematically, but that would require importing typing with unforseeable (performance) implications for such a foundational module.

io feels like a more natural home to me too. Could you not do something similar to the pseudo-protocol ABCs in collections.abc, contextlib and os.path? The approach there is:

  • Make the classes ABCs rather than protocols
  • But document them as protocols
  • And write __subclasshook__ methods so that they behave identically to runtime-checkable Protocols in their isinstance()/issubclass() behaviour:
    @classmethod
    def __subclasshook__(cls, C):
    if cls is AbstractContextManager:
    return _collections_abc._check_methods(C, "__enter__", "__exit__")
    return NotImplemented
  • And special-case the ABCs in typing.py so that the typing module treats them identically to protocols:

    cpython/Lib/typing.py

    Lines 1940 to 1948 in 023b7d2

    _PROTO_ALLOWLIST = {
    'collections.abc': [
    'Callable', 'Awaitable', 'Iterable', 'Iterator', 'AsyncIterable',
    'AsyncIterator', 'Hashable', 'Sized', 'Container', 'Collection',
    'Reversible', 'Buffer',
    ],
    'contextlib': ['AbstractContextManager', 'AbstractAsyncContextManager'],
    'os': ['PathLike'],
    }
  • And pretend in typeshed that they're protocols, so that type checkers treat them as protocols (I feel like python/typeshed@f554f54 missed the point there slightly)

@srittau
Copy link
Contributor Author

srittau commented Dec 6, 2024

That feels terribly hackish to me, even if there is precedence. And in my experience, hackish stuff often falls on one's feet one way or the other. But I can move it to io if that's the consensus.

@AlexWaygood
Copy link
Member

This has been the way the collections.abc and contextlib ABCs have always worked. As far as I'm aware, it's worked pretty well up to now. And it doesn't actually feel like the most hackish part of collections.abc ;)

_sys.modules['collections.abc'] = _collections_abc
abc = _collections_abc

@srittau
Copy link
Contributor Author

srittau commented Dec 6, 2024

Well, considering that typing.is_protocol uses the _is_protocol marker to determine whether something is a protocol, I guess that's the official way to mark protocols at runtime, not the fact that they are derived from Protocol. Which contradicts the typing spec at the moment. This all seems a bit precarious to me.

@AlexWaygood
Copy link
Member

I'm open to something less hacky if you can find a generalized solution that doesn't make _collections_abc depend on typing (which would significantly slow down startup for all Python applications, since _collections_abc is imported as part of Python's initialisation).

@AlexWaygood
Copy link
Member

considering that typing.is_protocol uses the _is_protocol marker to determine whether something is a protocol, I guess that's the official way to mark protocols at runtime, not the fact that they are derived from Protocol. Which contradicts the typing spec at the moment.

I don't think the typing spec needs to be cognisant of the hacks we use to make things work at runtime in the stdlib! This knowledge is irrelevant for third-party users of protocols, to users of type checkers and to implementers of type checkers. It's definitely not encouraged for any external users to make use of the _is_protocol attribute.

@srittau
Copy link
Contributor Author

srittau commented Dec 6, 2024

CI seems flaky, failures are unrelated.

srittau and others added 2 commits February 27, 2025 13:08
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Copy link
Member

@AlexWaygood AlexWaygood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM now but it might be good to have one or two simple tests covering the new code. It probably doesn't need to be much more than this (added to test_io.py)

def test_supports_index(self):
self.assertIsSubclass(int, typing.SupportsIndex)
self.assertNotIsSubclass(str, typing.SupportsIndex)

and maybe an addition to the test_typing tests similar to this one:

def test_collections_protocols_allowed(self):
@runtime_checkable
class Custom(collections.abc.Iterable, Protocol):
def close(self): ...
class A: pass
class B:
def __iter__(self):
return []
def close(self):
return 0
self.assertIsSubclass(B, Custom)
self.assertNotIsSubclass(A, Custom)
@runtime_checkable
class ReleasableBuffer(collections.abc.Buffer, Protocol):
def __release_buffer__(self, mv: memoryview) -> None: ...
class C: pass
class D:
def __buffer__(self, flags: int) -> memoryview:
return memoryview(b'')
def __release_buffer__(self, mv: memoryview) -> None:
pass
self.assertIsSubclass(D, ReleasableBuffer)
self.assertIsInstance(D(), ReleasableBuffer)
self.assertNotIsSubclass(C, ReleasableBuffer)
self.assertNotIsInstance(C(), ReleasableBuffer)

Copy link
Member

@AlexWaygood AlexWaygood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you! Just a few docs nits remaining

Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
@AlexWaygood
Copy link
Member

AlexWaygood commented Feb 27, 2025

I'll leave it a little bit in case any of the other reviewers here have any remaining concerns before merging

__slots__ = ()

@abc.abstractmethod
def read(self, size=..., /):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this has been discussed before, but I'm unsure on the runtime use of size=... (I didn't notice this earlier in my documentation review, sorry).

Almost every other read(size) method I can find has a default of either None or -1. I also can't find another method in the stdlib with a default of ... (outside of the recently-added protocols in wsgiref.types).

Would it be better to have size=-1, to indicate that the method takes an int? I'm not sure how much we want typeshed-like practices to leak into the standard library.

Suggested change
def read(self, size=..., /):
def read(self, size=-1, /):

A

Copy link
Contributor Author

@srittau srittau Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mandating defaults is not really something you can do in a protocol. I also wouldn't want to mandate that implementors have to use a default of -1, because – as you said – some implementations use None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but this isn't a protocol -- it's an ABC, which do have defaults -- see e.g. collections.abc.Generator. All of the read() methods in io are documented as having read(size=-1, /), and given these ABCs are going into io, I think we should be consistent with that interface, or have good reason to diverge from it (& document why).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's supposed to be a protocol, not an ABC. (Notwithstanding the fact that all protocols are ABCs.) It's just a protocol in the implementation for performance reasons. And using -1 as a default would give users the very wrong impression that they can use read(-1) when that may or may not actually be supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using -1 as a default would give users the very wrong impression that they can use read(-1) when that may or may not actually be supported.

From the documentation of io.RawIOBase.read():

Read up to size bytes from the object and return them. As a convenience, if size is unspecified or -1, all bytes until EOF are returned.

I would expect the io.Reader.read() ABC/protocol to have this same guarantee, for a 'properly' implemented read() method (according to the io expecations). In the proposed documentation, we say:

Read data from the input stream and return it. If size is specified, it should be an integer, and at most size items (bytes/characters) will be read.

This forbids None (good!), but is silent on what happens should size be omitted. I still think we should use -1 instead of ..., but at the very least we should include in the documentation the contract for what happens when read() is called with no arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. -1 is the default for some implementations, but the protocol should not make any default mandatory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure I follow, though -- currently the ABC/protocol has the mandatory default of size=..., which is always invalid and not the correct type. -1 is valid for all interfaces specified by the io documentation, which is where this new type is being added.

I ran the following with mypy --strict and it passed, so I don't think type checkers care about default values (and as discussed, the type forbids using non-integer types):

import abc

class Reader(metaclass=abc.ABCMeta):
    __slots__ = ()

    @abc.abstractmethod
    def read(self, size: int = -1, /) -> bytes: pass

class CustomReader(Reader):
    def read(self, size: int = -1, /) -> bytes:
        return b''

class CustomReaderZero(Reader):
    def read(self, size: int = 0, /) -> bytes:
        return b''

assert issubclass(CustomReader, Reader)
assert issubclass(CustomReaderZero, Reader)
assert isinstance(CustomReader(), Reader)
assert isinstance(CustomReaderZero(), Reader)

def reader_func(r: Reader) -> None:
    r.read()

reader_func(CustomReader())
reader_func(CustomReaderZero())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Sebastian here; we should use ... because the protocol need not mandate any particular default.

-1 is valid for all interfaces specified by the io documentation

The protocol should also match other file-like classes defined elsewhere in the stdlib or even in third-party libraries. When defining a protocol it's often useful to be permissive, so that all objects that are intended to match the protocol actually match it.

@srittau
Copy link
Contributor Author

srittau commented Mar 6, 2025

Could this be merged before the next alpha next week?

@JelleZijlstra JelleZijlstra merged commit c6dd234 into python:main Mar 6, 2025
39 checks passed
@srittau srittau deleted the typing-readable-writable branch March 6, 2025 16:02
@bluetech
Copy link
Contributor

bluetech commented Mar 7, 2025

I only saw the update now, great that it's merged!

I have a few comments. I hope they're not too bothersome - I just think these protocols are important so it's worth the effort.

Can the docs include type annotations? I know that Sphinx supports it. Given that this is meant for static typing, it makes sense to me to include.

I think it is important to document the expected semantics of these methods, otherwise you don't really know what you're going to get when you accept a Reader or Writer. What I can think of:

  • What should happen on EOF?
  • Is EOF sticky?
  • Short reads/writes
  • What does it mean to omit size?
  • What happens on size 0?
  • Is size allowed to be -1 or more generally negative?
  • Can read return value be None? Empty?
  • Can write return value be 0?
  • Expected exceptions -- IOError?

I don't like the parameter name size, because from the caller perspective it is not exactly a size of anything. I would have preferred just n. But, I see this name is already used e.g. IOBase and consistency probably wins the argument here.

The docs say "If size is specified, it should be an integer, and at most size items (bytes/characters) will be read" and "Write data to the output stream and return the number of items (bytes/characters) written". I would drop the "(bytes/characters)" since T is not necessarily that.

The docs say "The following protocols can be used for annotating function and method arguments for simple stream reading or writing operations". I have two nitpicks:

  • What is annotated is the parameter, not the argument (= a value for a parameter in a specific invocation). So would change "arguments" -> "parameters".
  • I don't think it is just for simple operations, so I'd remove "simple".

I think T should have some bound. I think at least Sized, it's hard to interpret the return value without a notion of size.

@srittau
Copy link
Contributor Author

srittau commented Mar 9, 2025

A merged PR is probably not the best place to discuss this, but a few points:

Can the docs include type annotations?

Personally, I'd be fine with, and actually prefer if the docs used type annotations more, but I didn't want to divert from the existing style in the documentation.

I think it is important to document the expected semantics of these methods

While it would make sense to me to document some of those semantics (e.g. size must be >= 1), most are not part of this protocol. Exceptions and EOF are not concepts that are understood by those protocols.

I would drop the "(bytes/characters)" since T is not necessarily that.

It might have been clearer to say "(e.g. bytes or characters)", but I think calling out those specific types here makes the documentation clearer, as in 99% of cases it will be either of those. Otherwise the concept of "item" is a bit nebulous.

I think T should have some bound.

I don't think it's necessary to make the protocol more complex. Whoever is using the protocol will have to choose an type that they can work with.

@srittau
Copy link
Contributor Author

srittau commented Mar 9, 2025

Also in general, protocols are not a replacement for API documentation. It doesn't make too much sense to include expectations around a protocol that can't be checked statically. To document these expectations is the responsibility of the author of a function that uses these protocols, and to follow them is the responsibility of the user. These expectations can vary significantly from function to function.

@bluetech
Copy link
Contributor

(Please feel free to ignore this if you don't want to discuss this further -- I couldn't resist posting another comment...)

If I understand correctly, you prefer a narrow approach to these protocols because 1) they can't be checked statically 2) semantics can vary between uses. My replies:

  1. In an ideal world, it would be possible to convey all semantics in a static way. But if it's not possible, documentation is the alternative. Stated differently, the Platonic ideal of the Reader interface contains all of its (agreed) semantics, but its projection onto the Python type system is incomplete, so the documentation should try its best to fill the gap. That's how I view it at least.

  2. Why should the semantics vary between uses? A single set of semantics has a big advantage - it reduces (N impls * M uses) coordination points to N + M. As an implementer, I can (implicitly or explicitly) follow the protocol and don't need to invent or document my semantics. As a user, I write to a single set of semantics, not needing to handle multiple sets, defensively protect against unknown ones, or document my expectations. As a "middleware" (e.g. a BufferedReader which adds buffering to any reader), I can take any generic reader. What's the advantage of allowing implementations to be inconsistent on something like how EOF is signaled?

@cmaloney
Copy link
Contributor

cmaloney commented Sep 2, 2025

@srittau I've been working on refactoring the I/O tests and I don't think ProtocolsTest in test_io is being run at the moment.

test_io.py (now test_io/test_general.py) currently uses load_tests to customize what tests are found / run rather than the more typical auto discovery based on subclassing and the ProtocolsTest added in this PR but not added to the test list. Adding the test to the list for me it doesn't pass, not sure what your preferred way to get it passing would be; would appreciate if you could have a look.

You should be able to run just the ProtocolsTest by:

  1. Update load_tests test list to include it:
    tests = (CIOTest, PyIOTest, APIMismatchTest,
    CBufferedReaderTest, PyBufferedReaderTest,
    CBufferedWriterTest, PyBufferedWriterTest,
    CBufferedRWPairTest, PyBufferedRWPairTest,
    CBufferedRandomTest, PyBufferedRandomTest,
    StatefulIncrementalDecoderTest,
    CIncrementalNewlineDecoderTest, PyIncrementalNewlineDecoderTest,
    CTextIOWrapperTest, PyTextIOWrapperTest,
    CMiscIOTest, PyMiscIOTest,
    CSignalsTest, PySignalsTest, TestIOCTypes,
    )
  2. ./python -m test test_io.test_general -v -m 'ProtocolsTest'

With that I currently get:

./python -m test test_io.test_general -v -m 'ProtocolsTest'
== CPython 3.15.0a0 (heads/exp/test_bufferedio_split_v0-dirty:f4f150e5bbd, Sep 1 2025, 17:15:19) [Clang 20.1.8 ]
== Linux-6.16.4-arch1-1-x86_64-with-glibc2.42 little-endian
== Python build: debug
== cwd: <build_dir>/build/build/test_python_worker_790373æ
== CPU count: 32
== encodings: locale=UTF-8 FS=utf-8
== resources: all test resources are disabled, use -u option to unskip tests

Using random seed: 1859464968
0:00:00 load avg: 0.74 Run 1 test sequentially in a single process
0:00:00 load avg: 0.74 [1/1] test_io.test_general
test_reader_subclass (test.test_io.test_general.ProtocolsTest.test_reader_subclass) ... ERROR
test_writer_subclass (test.test_io.test_general.ProtocolsTest.test_writer_subclass) ... ERROR

======================================================================
ERROR: test_reader_subclass (test.test_io.test_general.ProtocolsTest.test_reader_subclass)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<source_dir>/cpython/Lib/test/test_io/test_general.py", line 5032, in test_reader_subclass
    self.assertIsSubclass(MyReader, io.Reader[bytes])
                          ^^^^^^^^
NameError: name 'MyReader' is not defined. Did you mean: 'self.MyReader'?

======================================================================
ERROR: test_writer_subclass (test.test_io.test_general.ProtocolsTest.test_writer_subclass)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<source_dir>/cpython/Lib/test/test_io/test_general.py", line 5036, in test_writer_subclass
    self.assertIsSubclass(MyWriter, io.Writer[bytes])
                          ^^^^^^^^
NameError: name 'MyWriter' is not defined. Did you mean: 'self.MyWriter'?

----------------------------------------------------------------------
Ran 2 tests in 0.003s

FAILED (errors=2)
test test_io.test_general failed
0:00:00 load avg: 0.74 [1/1/1] test_io.test_general failed (2 errors)

== Tests result: FAILURE ==

1 test failed:
    test_io.test_general

Total duration: 55 ms
Total tests: run=2 (filtered)
Total test files: run=1/1 (filtered) failed=1
Result: FAILURE

@srittau
Copy link
Contributor Author

srittau commented Sep 2, 2025

@cmaloney Thanks, #138369 should fix that! Do you think it would make sense to add a warning to the top of the file (similar to the warning about testing both the C and Python versions)?

@cmaloney
Copy link
Contributor

cmaloney commented Sep 2, 2025

@srittau PR looks good. I'm hoping to remove the custom load_tests in test_io to resolve. Found this while experimenting with that :).


#138366 gets load_tests to just have a list of tests, I have a couple PRs I'm planning after that to split up the list (into files which use autodetect), then a PR to move to autodetect for what remains in test_general; main change there is making the "helper" classes being found/run by autodetect by removing their direct inheritance from unittest.TestCase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants