Fix RNG reload in resume training from epoch checkpoint #17055

sgugger · 2022-05-02T20:22:15Z

What does this PR do?

This PR fixes the reproducibility in training when checkpoints are saved every epoch. The main reason it was failing (as pointed out in #17032) is that the RNG states were never reloaded. They need to be reloaded exactly before iterating through the new epoch, as the call to this will change the global PyTorch RNG (even if the dataloader uses its own generator...) The new test added makes sure this reproducibility is fully tested.

While debugging this, two issues occurred, which this PR also fixes.

There are multiple warnings for the computation of flos when the model is not an NLP model. This PR reduces it to one.
The test of this reproducibility is flaky on multiple GPUs because it relies on some randomness inside the model, but the PyTorch RNG will be called in random order between the two "copies" of the model executed by DataParallel (an issue that wouldn't be the case with DistributedDataParallel but we would need to execute the test via a launcher in that case). So in the test, we only do PyTorch randomness on one or zero GPU to fix this flakiness.

Fixes #17032

HuggingFaceDocBuilderDev · 2022-05-02T20:39:08Z

The documentation is not available anymore as the PR was closed or merged.

atreyasha · 2022-05-03T11:03:56Z

tests/trainer/test_trainer.py

+        # For more than 1 GPUs, since the randomness is introduced in the model and with DataParallel (which is used
+        # in this test for more than 2 GPUs), the calls to the torch RNG will happen in a random order (sometimes
+        # GPU 0 will call first and sometimes GPU 1).
+        random_torch = torch.cuda.is_available() and torch.cuda.device_count() >= 1


Sorry, just a question regarding this line. AFAICT random_torch would only be True if at least one GPU is available. But this would mean this test case will not cover torch randomness when using the CPU. The unit test before this commit however did test randomness on the CPU, or at least was able to if no GPU was available. Is this change intended?

Good catch! I'll fix this :-)

LysandreJik

LGTM, thanks @sgugger!

…17055) * Fix RNG reload in resume training from epoch checkpoint * Fix test

Fix RNG reload in resume training from epoch checkpoint

1461257

sgugger requested a review from LysandreJik May 2, 2022 20:22

atreyasha reviewed May 3, 2022

View reviewed changes

Fix test

19f6c20

LysandreJik approved these changes May 3, 2022

View reviewed changes

sgugger merged commit 1c9fcd0 into main May 3, 2022

sgugger deleted the randomness_resume_epocj branch May 3, 2022 14:31

stevhliu pushed a commit to stevhliu/transformers that referenced this pull request May 3, 2022

Fix RNG reload in resume training from epoch checkpoint (huggingface#…

ced44b3

…17055) * Fix RNG reload in resume training from epoch checkpoint * Fix test

elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022

Fix RNG reload in resume training from epoch checkpoint (huggingface#…

272dd2a

…17055) * Fix RNG reload in resume training from epoch checkpoint * Fix test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix RNG reload in resume training from epoch checkpoint #17055

Fix RNG reload in resume training from epoch checkpoint #17055

Uh oh!

sgugger commented May 2, 2022

Uh oh!

HuggingFaceDocBuilderDev commented May 2, 2022 •

edited

Loading

Uh oh!

atreyasha May 3, 2022 •

edited

Loading

Uh oh!

sgugger May 3, 2022

Uh oh!

LysandreJik left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix RNG reload in resume training from epoch checkpoint #17055

Fix RNG reload in resume training from epoch checkpoint #17055

Uh oh!

Conversation

sgugger commented May 2, 2022

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented May 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atreyasha May 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgugger May 3, 2022

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HuggingFaceDocBuilderDev commented May 2, 2022 •

edited

Loading

atreyasha May 3, 2022 •

edited

Loading