KEMBAR78
Introduce support for PDFs · Issue #7318 · huggingface/datasets · GitHub
Skip to content

Introduce support for PDFs #7318

@yabramuvdi

Description

@yabramuvdi

Feature request

The idea (discussed in the Discord server with @lhoestq ) is to have a Pdf type like Image/Audio/Video. For example Video was recently added and contains how to decode a video file encoded in a dictionary like {"path": ..., "bytes": ...} as a VideoReader using decord. We want to do the same with pdf and get a pypdfium2.PdfDocument.

Motivation

In many cases PDFs contain very valuable information beyond text (e.g. images, figures). Support for PDFs would help create datasets where all the information is preserved.

Your contribution

I can start the implementation of the Pdf type :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions