KEMBAR78
Fix polars cast column image by CloseChoice · Pull Request #7800 · huggingface/datasets · GitHub
Skip to content

Conversation

@CloseChoice
Copy link
Contributor

@CloseChoice CloseChoice commented Oct 8, 2025

Fixes #7765

The problem here is that polars uses pyarrow large_string for images, while pandas and others just use the string type. This PR solves that and adds a test.

import polars as pl
from datasets import Dataset
import pandas as pd
import pyarrow as pa
from pathlib import Path

shared_datadir = Path("tests/features/data")
image_path = str(shared_datadir / "test_image_rgb.jpg")

# Load via polars
df_polars = pl.DataFrame({"image_path": [image_path]})
dataset_polars = Dataset.from_polars(df_polars)
print("Polars DF is large string:", pa.types.is_large_string(df_polars.to_arrow().schema[0].type))
print("Polars DF is string:", pa.types.is_string(df_polars.to_arrow().schema[0].type))

# Load via pandas
df_pandas = pd.DataFrame({"image_path": [image_path]})
dataset_pandas = Dataset.from_pandas(df_pandas)
arrow_table_pd = pa.Table.from_pandas(df_pandas)
print("Pandas DF is large string", pa.types.is_large_string(arrow_table_pd.schema[0].type))
print("Pandas DF is string", pa.types.is_string(arrow_table_pd.schema[0].type))

Outputs:

Polars DF is large string: True
Polars DF is string: False
Pandas DF is large string False
Pandas DF is string True

@CloseChoice CloseChoice marked this pull request as ready for review October 8, 2025 10:02
@lhoestq
Copy link
Member

lhoestq commented Oct 10, 2025

The Image() type is set to have a storage of string for "path" and not large_string. Therefore while your change does work to do the conversion, it can create issues in other places. For example I'm pretty sure you wouldn't be able to concatenate the resulting dataset with a dataset with Image() using string.

Maybe we can convert large_string data to string somehow to make this work ?

@CloseChoice
Copy link
Contributor Author

CloseChoice commented Oct 12, 2025

@lhoestq thanks for the review. Just to be thorough I checked the concat example and this seems to work:

import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))

import pandas as pd
import polars as pl
from datasets import Dataset, Image, concatenate_datasets
import pyarrow as pa

image_path = "tests/features/data/test_image_rgb.jpg"


df_pl = pl.DataFrame({"image": [image_path]})
dset_pl = Dataset.from_polars(df_pl).cast_column("image", Image())


df_pd = pd.DataFrame({"image": [image_path]})
dset_pd = Dataset.from_pandas(df_pd).cast_column("image", Image())


concatenated = concatenate_datasets([dset_pl, dset_pd])
print(concatenated._data)

outputs:

ConcatenationTable
image: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
----
image: [
  -- is_valid: all not null
  -- child 0 type: binary
[null]
  -- child 1 type: string
["tests/features/data/test_image_rgb.jpg"],
  -- is_valid: all not null
  -- child 0 type: binary
[null]
  -- child 1 type: string
["tests/features/data/test_image_rgb.jpg"]]

(not quite sure though if this is a really what you meant). I agree that there could be pro a lot of problems if we rely on implicit conversion therefore I updated the PR. I also checked the exception handling locally and it works, am unsure though if we want to create such large objects in the CI, if desired I can add a test for that.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ! LGTM :)

@lhoestq lhoestq merged commit aa7f2a9 into huggingface:main Oct 13, 2025
14 checks passed
@CloseChoice CloseChoice deleted the fix-polars-cast-column-image branch October 13, 2025 16:01
Sanjaykumar030 added a commit to Sanjaykumar030/datasets that referenced this pull request Oct 18, 2025
@Sanjaykumar030
Copy link
Contributor

Sanjaykumar030 commented Oct 18, 2025

Apologies @lhoestq @CloseChoice , I unintentionally reverted this PR earlier. Leaving it as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

polars dataset cannot cast column to Image/Audio/Video

4 participants