KEMBAR78
Concurrent push_to_hub by lhoestq · Pull Request #7708 · huggingface/datasets · GitHub
Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jul 29, 2025

Retry the step that (download + update + upload) the README.md using create_commit(..., parent_commit=...) if there was a commit in the meantime. This should enable concurrent push_to_hub() since it won't overwrite the README.md metadata anymore.

Note: we fixed an issue server side to make this work:

DO NOT MERGE FOR NOW since it seems there is one bug that prevents this logic from working:

I'm using parent_commit to enable concurrent push_to_hub() in datasets for a retry mechanism, but for some reason I always run into a weird situation.
Sometimes create_commit(.., parent_commit=...) returns error 500 but the commit did happen on the Hub side without respecting parent_commit

e.g. request id

huggingface_hub.errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/api/datasets/lhoestq/tmp/commit/main (Request ID: Root=1-6888d8af-2ce517bc60c69cb378b51526;d1b17993-c5d0-4ccd-9926-060c45f9ed61)

fix coming in internal

close #7600

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq marked this pull request as ready for review July 31, 2025 09:53
@lhoestq lhoestq merged commit 0fc5418 into main Jul 31, 2025
13 of 15 checks passed
@lhoestq lhoestq deleted the concurrent-push_to_hub branch July 31, 2025 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

push_to_hub is not concurrency safe (dataset schema corruption)

2 participants