KEMBAR78
HF_DATASETS_CACHE ignored? · Issue #7480 · huggingface/datasets · GitHub
Skip to content

HF_DATASETS_CACHE ignored? #7480

@stephenroller

Description

@stephenroller

Describe the bug

I'm struggling to get things to respect HF_DATASETS_CACHE.

Rationale: I'm on a system that uses NFS for homedir, so downloading to NFS is expensive, slow, and wastes valuable quota compared to local disk. Instead, it seems to rely mostly on HF_HUB_CACHE.

Current version: 3.2.1dev. In the process of testing 3.4.0

Steps to reproduce the bug

[Currently writing using datasets 3.2.1dev. Will follow up with 3.4.0 results]

dump.py:

from datasets import load_dataset
dataset = load_dataset("HuggingFaceFW/fineweb", name="sample-100BT", split="train")

Repro steps

# ensure no cache
$ mv ~/.cache/huggingface ~/.cache/huggingface.bak

$ export HF_DATASETS_CACHE=/tmp/roller/datasets
$ rm -rf ${HF_DATASETS_CACHE}
$ env | grep HF | grep -v TOKEN
HF_DATASETS_CACHE=/tmp/roller/datasets

$ python dump.py
# (omitted for brevity)

# (while downloading) 
$ du -hcs ~/.cache/huggingface/hub
18G     hub
18G     total

# (after downloading)
$ du -hcs ~/.cache/huggingface/hub

It's a shame because datasets supports s3 (which I could really use right now) but hub does not.

Expected behavior

  • ~/.cache/huggingface/hub stays empty
  • /tmp/roller/datasets becomes full of stuff

Environment info

[Currently writing using datasets 3.2.1dev. Will follow up with 3.4.0 results]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions