KEMBAR78
Fix no_trainer examples to properly calculate the number of samples by muellerzr · Pull Request #17046 · huggingface/transformers · GitHub
Skip to content

Conversation

@muellerzr
Copy link
Contributor

Fix number of samples for no_trainer scripts

What does this add?

This PR fixes all of the no_trainer scripts to properly use the right number of training steps after the length of the dataloader was changed with accelerator.prepare

Why is it needed?

Currently in a multi-process setup, the progress bar still shows the old number of samples. As a result the old number of steps before breaking is set at the original amount, even though the length of the dataloaders changed. The progress bar reflects this too.

Simplified example:

If the dataloader starts with 128 batches, if 2 GPUs are used then each dataloader has 64 batches. As a result the progress bar should use 64, and the break condition needs to also know there is only 64. Both currently use 128 still

What parts of the API does this impact?

User-facing:

All scripts have a recalculation of the max_train_steps after accelerate.prepare

Basic Usage Example(s):

    # Prepare everything with our `accelerator`.
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

    # We need to recalculate our total training steps
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
    args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch

When would I use it, and when wouldn't I?

While this is always used, technically it is only needed when the number of nodes > 1.

@muellerzr muellerzr added Examples Which is related to examples in general PyTorch Anything PyTorch labels May 2, 2022
@muellerzr muellerzr requested a review from sgugger May 2, 2022 15:17
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing! LGTM with one nit to propagate!

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 2, 2022

The documentation is not available anymore as the PR was closed or merged.

@muellerzr muellerzr merged commit f275e59 into main May 2, 2022
@muellerzr muellerzr deleted the muellerzr-fix_num_samples branch May 2, 2022 15:56
stevhliu pushed a commit to stevhliu/transformers that referenced this pull request May 3, 2022
@kowndinya-renduchintala
Copy link

kowndinya-renduchintala commented May 30, 2022

Hi @muellerzr, @sgugger, in case I specify the argument max_train_steps instead of num_train_epochs while launching the training script, I need to recalculate the num_train_epochs after accelerate.prepare instead of max_train_steps right? Am I missing something?

@muellerzr
Copy link
Contributor Author

elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Examples Which is related to examples in general PyTorch Anything PyTorch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants