KEMBAR78
tarfile's cache balloons in memory when streaming a big tarfile · Issue #102120 · python/cpython · GitHub
Skip to content

tarfile's cache balloons in memory when streaming a big tarfile #102120

@spenczar

Description

@spenczar

Bug report

I've got a bunch of tar files containing millions of small files. Never mind how we got here - I need to process those tars, handling each of the files inside. Furthermore, I need to process a lot of these tars, and I'd like to do it relatively quickly on cheapish hardware, so I'm at least a little sensitive to memory consumption.

The natural thing to do is to iterate through the tarfile, extracting each file one at a time, and carefully closing them when done:

with tarfile.open(filepath, "r:gz") as tar:
    for member in tar:
        file_buf = tar.extractfile(member)
        try:
            handle(file_buf)
        finally:
            file_buf.close()

This looks like it should handle each small file and discard it when done, so memory should stay pretty tame. I was very surprised to discover that this actually uses gigabytes of memory. That's fixed if you do this:

with tarfile.open(filepath, "r:gz") as tar:
    for member in tar:
        file_buf = tar.extractfile(member)
        try:
            handle(file_buf)
        finally:
            file_buf.close()
            tar.members = []  # evil!

That works because tarinfo.TarFile has a cache, self.members. That cache is appended to in TarFile.next(), which in turn is used in TarFile.__iter__.

That cache is storing TarInfos, which are headers describing each file. They're not very large, but with lots and lots of files, those headers can add up.

The net result is that it's not possible to stream a tarfile's contents without memory growing linearly with the number of files in the tarfile. This has been partially addressed in the past (see #46334, from way back in 2008), but never fully resolved. It shows up in StackOverflow and probably elsewhere, with a clumsy recommended solution of resetting tar.members each time, but there ought to be a better way.

Your environment

CPython 3.10, mostly; I don't think OS or architecture etc are relevant.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions