KEMBAR78
[trainer] sharded _load_best_model by stas00 · Pull Request #17150 · huggingface/transformers · GitHub
Skip to content

Conversation

@stas00
Copy link
Contributor

@stas00 stas00 commented May 10, 2022

Looks like a copy-in-paste issue. This code path is probably untested.

@sgugger

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 10, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing. It is currently untested as there is no way to activate checkpoint sharding from the Trainer without training a very large model, which would is unfeasible on any of the CI runners.

@stas00
Copy link
Contributor Author

stas00 commented May 10, 2022

Thank you for explaining why the testing of this path is complicated, Sylvain.

I think I can make it partially tested by using zero3 w/o "stage3_gather_16bit_weights_on_model_save" which would make it fall through and at least test that condition. I will be adding these tests here #17151

@stas00 stas00 merged commit 9aeacfe into main May 10, 2022
@stas00 stas00 deleted the stas00-patch-1 branch May 10, 2022 14:58
ArthurZucker pushed a commit to ArthurZucker/transformers that referenced this pull request May 12, 2022
* [trainer] sharded _load_best_model

probably needs a test?

* undo delete
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
* [trainer] sharded _load_best_model

probably needs a test?

* undo delete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants