-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Description
Describe the bug
I'm struggling to get things to respect HF_DATASETS_CACHE.
Rationale: I'm on a system that uses NFS for homedir, so downloading to NFS is expensive, slow, and wastes valuable quota compared to local disk. Instead, it seems to rely mostly on HF_HUB_CACHE.
Current version: 3.2.1dev. In the process of testing 3.4.0
Steps to reproduce the bug
[Currently writing using datasets 3.2.1dev. Will follow up with 3.4.0 results]
dump.py:
from datasets import load_dataset
dataset = load_dataset("HuggingFaceFW/fineweb", name="sample-100BT", split="train")Repro steps
# ensure no cache
$ mv ~/.cache/huggingface ~/.cache/huggingface.bak
$ export HF_DATASETS_CACHE=/tmp/roller/datasets
$ rm -rf ${HF_DATASETS_CACHE}
$ env | grep HF | grep -v TOKEN
HF_DATASETS_CACHE=/tmp/roller/datasets
$ python dump.py
# (omitted for brevity)
# (while downloading)
$ du -hcs ~/.cache/huggingface/hub
18G hub
18G total
# (after downloading)
$ du -hcs ~/.cache/huggingface/hubIt's a shame because datasets supports s3 (which I could really use right now) but hub does not.
Expected behavior
- ~/.cache/huggingface/hub stays empty
- /tmp/roller/datasets becomes full of stuff
Environment info
[Currently writing using datasets 3.2.1dev. Will follow up with 3.4.0 results]
Metadata
Metadata
Assignees
Labels
No labels