New features for CodeParrot training script #16851

loubnabnl · 2022-04-20T10:49:10Z

This PR adds some features to CodeParrot training script.

Add TFLOPS to logging
Use Accelerate checkpointing and tracking for Wandb and Tensorborad
Fix gradient accumulation for DDP (Fix nlp_example accelerate#106)
Scale loss approprietly for the last batch
Fix typo in the README

HuggingFaceDocBuilderDev · 2022-04-20T11:03:18Z

The documentation is not available anymore as the PR was closed or merged.

lvwerra

Thanks @loubnabnl, looks pretty clean already! I left a few minor comments, mainly to make the code a bit more concise.

Regarding the saving of the state: I think it is great that we can now save it but I think what is missing currently is the mechanism to start the script from a saved state. I don't think we need to do much and you can probably follow the example here:
https://github.com/huggingface/accelerate/blob/main/examples/complete_nlp_example.py

examples/research_projects/codeparrot/scripts/codeparrot_training.py

lvwerra · 2022-04-20T12:05:13Z

examples/research_projects/codeparrot/scripts/codeparrot_training.py

+        elapsed_time_per_iteration = time.time() - t_start
+        checkpoint_factor = 4 if args.gradient_checkpointing else 3
+        batch_size = args.train_batch_size * accelerator.state.num_processes * args.gradient_accumulation_steps
+        factor = (
+            24 * checkpoint_factor * batch_size * args.seq_length * config_model.n_layer * (config_model.n_embd**2)
+        )
+        flops_per_iteration = factor * (
+            1.0
+            + (args.seq_length / (6.0 * config_model.n_embd))
+            + (tokenizer.vocab_size / (16.0 * config_model.n_layer * config_model.n_embd))
+        )
+        tflops = flops_per_iteration / (elapsed_time_per_iteration * accelerator.state.num_processes * (10**12))


could we move that to a dedicated function? e.g. compute_tflops(elapsed_time, accelerator, args)? It would be nice if the main training loop would stay concise to make it clearer what's going on.

lvwerra · 2022-04-20T12:12:21Z

examples/research_projects/codeparrot/scripts/codeparrot_training.py

        accelerator.wait_for_everyone()
        unwrapped_model = accelerator.unwrap_model(model)
        unwrapped_model.save_pretrained(args.save_dir, save_function=accelerator.save)
+        accelerator.save_state(args.save_dir)


What exactly does the save_state save? A bunch of files? Maybe we could add them to a folder e.g. args.save_dir + "/state/".

save_state returns a bunch of files (model, optimizer ..), I'm now saving them in folders corresponding to the steps to be able to resume training from these steps later

and since save_state already saves the model in the folder step, I now use save_pretrained for the unwrapped model only for the last checkpoint to save model in args.save_dir to load direclty from there later

lvwerra · 2022-04-20T12:12:31Z

examples/research_projects/codeparrot/scripts/codeparrot_training.py

 accelerator.wait_for_everyone()
 unwrapped_model = accelerator.unwrap_model(model)
 unwrapped_model.save_pretrained(args.save_dir, save_function=accelerator.save)
+accelerator.save_state(args.save_dir)


Same as above

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

…add tflops function

lvwerra

Only two minor comments and then it is good to go! 🚀

lvwerra · 2022-04-21T15:08:27Z

examples/research_projects/codeparrot/scripts/codeparrot_training.py

+
+
+def compute_tflops(elapsed_time, accelerator, args):
+    config_model = accelerator.unwrap_model(model).config


Minor thing: can you add the link to the formula here? either BigScience or the paper itself. So somebody could find out where that black magic formula actually comes from :)

examples/research_projects/codeparrot/scripts/codeparrot_training.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* add tflops logging and fix grad accumulation * add accelerate tracking and checkpointing * scale loss of last batch correctly * fix typo * compress loss computation Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * add resume from checkpoint argument * add load_state accelerate from checkpoint, register lr scheduler and add tflops function * reformat code * reformat code * add condition on path for resume checkpoint * combine if conditions Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * add source for tflops formula Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

loubnabnl added 4 commits April 20, 2022 10:37

add tflops logging and fix grad accumulation

5f655c1

add accelerate tracking and checkpointing

99f54fa

scale loss of last batch correctly

70c7ba3

fix typo

564a85a

lvwerra self-requested a review April 20, 2022 11:01

lvwerra reviewed Apr 20, 2022

View reviewed changes

loubnabnl and others added 8 commits April 20, 2022 15:13

compress loss computation

e7aa029

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

add resume from checkpoint argument

2628103

add load_state accelerate from checkpoint, register lr scheduler and …

0e42329

…add tflops function

Merge branch 'huggingface:main' into codeparrot-training-script

f9aeb03

reformat code

e2b2550

reformat code

4af2775

Merge branch 'huggingface:main' into codeparrot-training-script

401ea3e

add condition on path for resume checkpoint

8a3a012

lvwerra approved these changes Apr 21, 2022

View reviewed changes

loubnabnl and others added 2 commits April 21, 2022 17:18

combine if conditions

7f65d2a

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

add source for tflops formula

5912e55

lvwerra merged commit d918413 into huggingface:main Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New features for CodeParrot training script #16851

New features for CodeParrot training script #16851

Uh oh!

loubnabnl commented Apr 20, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Apr 20, 2022 •

edited

Loading

Uh oh!

lvwerra left a comment

Uh oh!

Uh oh!

lvwerra Apr 20, 2022

Uh oh!

lvwerra Apr 20, 2022

Uh oh!

loubnabnl Apr 20, 2022

Uh oh!

loubnabnl Apr 20, 2022

Uh oh!

lvwerra Apr 20, 2022

Uh oh!

lvwerra left a comment

Uh oh!

lvwerra Apr 21, 2022

Uh oh!

loubnabnl Apr 21, 2022

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def compute_tflops(elapsed_time, accelerator, args):
		config_model = accelerator.unwrap_model(model).config

New features for CodeParrot training script #16851

New features for CodeParrot training script #16851

Uh oh!

Conversation

loubnabnl commented Apr 20, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Apr 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lvwerra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lvwerra Apr 20, 2022

Choose a reason for hiding this comment

Uh oh!

lvwerra Apr 20, 2022

Choose a reason for hiding this comment

Uh oh!

loubnabnl Apr 20, 2022

Choose a reason for hiding this comment

Uh oh!

loubnabnl Apr 20, 2022

Choose a reason for hiding this comment

Uh oh!

lvwerra Apr 20, 2022

Choose a reason for hiding this comment

Uh oh!

lvwerra left a comment

Choose a reason for hiding this comment

Uh oh!

lvwerra Apr 21, 2022

Choose a reason for hiding this comment

Uh oh!

loubnabnl Apr 21, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HuggingFaceDocBuilderDev commented Apr 20, 2022 •

edited

Loading