Fix t5 shard on TPU Pods #16527

agemagician · 2022-03-31T19:10:10Z

The current script doesn't work properly on a TPU pod because the global batch is not divided correctly per host.
This pull request fixes this issue by dividing the global batch to each host before it is shared on each host.

Fixes # (issue)
#16470

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Models:

t5: @patrickvonplaten, @patil-suraj

The current script doesn't work properly on a TPU pod because the global batch is not divided correctly per host. This pull request fixes this issue by dividing the global batch to each host before it is shared on each host.

HuggingFaceDocBuilderDev · 2022-03-31T19:24:48Z

The documentation is not available anymore as the PR was closed or merged.

patrickvonplaten · 2022-04-06T10:36:16Z

This looks good to me!

@patil-suraj @borisdayma - could you take a look here?

borisdayma · 2022-04-06T13:11:45Z

Yes this approach works!

borisdayma · 2022-04-06T23:23:41Z

Thinking about it I think there could be some issues with last batch so we probably need to ensure they all have same number of items and that they are multiple of the number of local devices.

patil-suraj

Thanks a lot for the PR, LGTM!

patil-suraj · 2022-04-11T14:45:58Z

@borisdayma
This line already makes sure that all batches are of same length.

transformers/examples/flax/language-modeling/run_t5_mlm_flax.py

Line 857 in 2831826

train_batch_idx = generate_batch_splits(train_samples_idx, train_batch_size)

* Fix t5 shard on TPU Pods The current script doesn't work properly on a TPU pod because the global batch is not divided correctly per host. This pull request fixes this issue by dividing the global batch to each host before it is shared on each host. * fix style Co-authored-by: ahmed-elnaggar <ahmed.elnaggar@allianz.com>

Fix t5 shard on TPU Pods

32d4505

The current script doesn't work properly on a TPU pod because the global batch is not divided correctly per host. This pull request fixes this issue by dividing the global batch to each host before it is shared on each host.

fix style

b7a74cc

patrickvonplaten requested a review from patil-suraj April 6, 2022 10:36

patil-suraj approved these changes Apr 11, 2022

View reviewed changes

patil-suraj merged commit 5e68675 into huggingface:main Apr 11, 2022

peregilk mentioned this pull request Nov 26, 2022

Running the run_mlm_flax on TPU v4 pods #20252

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix t5 shard on TPU Pods #16527

Fix t5 shard on TPU Pods #16527

Uh oh!

agemagician commented Mar 31, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Mar 31, 2022 •

edited

Loading

Uh oh!

patrickvonplaten commented Apr 6, 2022

Uh oh!

borisdayma commented Apr 6, 2022

Uh oh!

borisdayma commented Apr 6, 2022

Uh oh!

patil-suraj left a comment

Uh oh!

patil-suraj commented Apr 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix t5 shard on TPU Pods #16527

Fix t5 shard on TPU Pods #16527

Uh oh!

Conversation

agemagician commented Mar 31, 2022

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Apr 6, 2022

Uh oh!

borisdayma commented Apr 6, 2022

Uh oh!

borisdayma commented Apr 6, 2022

Uh oh!

patil-suraj left a comment

Choose a reason for hiding this comment

Uh oh!

patil-suraj commented Apr 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HuggingFaceDocBuilderDev commented Mar 31, 2022 •

edited

Loading