fix eval when using subset host loading data #951

aireenmei · 2024-10-06T19:16:20Z

b/371572923
Tested on v4-128: https://cloudlogging.app.goo.gl/YqjMDsc27SxXHSLaA

khatwanimohit

LGTM!

MaxText/pyconfig.py

ZhiyuLi-goog · 2024-10-08T01:14:41Z

Hi, @aireenmei

Awesome and thank you for the fix!

Do we expect to unblock eval_per_device_batch_size < 1. with this PR?
And do you happen to have some test results?

aireenmei · 2024-10-09T04:28:30Z

@ZhiyuLi-goog It should support eval_per_device_batch_size < 1. But let me actually run the tests to make sure. Will get back to you.

aireenmei · 2024-10-11T12:43:46Z

@khatwanimohit @ZhiyuLi-goog
I made more chnages. Now it supports eval_per_device_batch_size < 1. Could both of you review again?

Cases tested on v4-128:

M_EXPANSION_FACTOR_REAL_DATA=2, per_device_batch_size=per_eval_device_batch_size=8.0
https://console.cloud.google.com/kubernetes/service/us-central2/v4-128-bodaborg-us-central2-b/default/aireen-20241011-12-10/details?project=cloud-tpu-multipod-dev
per_device_batch_size=eval_per_device_batch_size=0.5 ici_tensor_parallelism=4, expansion_factor=-1: https://console.cloud.google.com/kubernetes/service/us-central2/v4-128-bodaborg-us-central2-b/default/aireen-20241010-18-28/details?project=cloud-tpu-multipod-dev
per_device_batch_size=1.0 eval_per_device_batch_size=0.5, ici_tensor_parallelism=4, expansion_factor=-1: https://console.cloud.google.com/kubernetes/service/us-central2/v4-128-bodaborg-us-central2-b/default/aireen-20241010-19-02/details?project=cloud-tpu-multipod-dev
per_device_batch_size=0.5 eval_per_device_batch_size=1, ici_tensor_parallelism=4, expansion_factor=-1: https://console.cloud.google.com/kubernetes/service/us-central2/v4-128-bodaborg-us-central2-b/default/aireen-20241010-18-36/details?project=cloud-tpu-multipod-dev

ZhiyuLi-goog · 2024-10-11T16:15:41Z

@khatwanimohit @ZhiyuLi-goog I made more chnages. Now it supports eval_per_device_batch_size < 1. Could both of you review again?

Cases tested on v4-128:

M_EXPANSION_FACTOR_REAL_DATA=2, per_device_batch_size=per_eval_device_batch_size=8.0
https://console.cloud.google.com/kubernetes/service/us-central2/v4-128-bodaborg-us-central2-b/default/aireen-20241011-12-10/details?project=cloud-tpu-multipod-dev

per_device_batch_size=eval_per_device_batch_size=0.5 ici_tensor_parallelism=4, expansion_factor=-1: https://console.cloud.google.com/kubernetes/service/us-central2/v4-128-bodaborg-us-central2-b/default/aireen-20241010-18-28/details?project=cloud-tpu-multipod-dev

per_device_batch_size=1.0 eval_per_device_batch_size=0.5, ici_tensor_parallelism=4, expansion_factor=-1: https://console.cloud.google.com/kubernetes/service/us-central2/v4-128-bodaborg-us-central2-b/default/aireen-20241010-19-02/details?project=cloud-tpu-multipod-dev

per_device_batch_size=0.5 eval_per_device_batch_size=1, ici_tensor_parallelism=4, expansion_factor=-1: https://console.cloud.google.com/kubernetes/service/us-central2/v4-128-bodaborg-us-central2-b/default/aireen-20241010-18-36/details?project=cloud-tpu-multipod-dev

Thank you @aireenmei for those experiments!

Look great to me!

ZhiyuLi-goog

Thank you @aireenmei. Awesome.
LGTM!

khatwanimohit

LGTM. Thank you @aireenmei

khatwanimohit · 2024-10-11T21:29:13Z

end_to_end/tpu/test_convergence_1b_params.sh

 TRAIN_CMD="python3 MaxText/train.py MaxText/configs/base.yml \
        steps=$STEPS eval_steps=$EVAL_STEPS eval_interval=$STEPS \
-        per_device_batch_size=8.0 learning_rate=3e-4 enable_checkpointing=false \
+        per_device_batch_size=1.0 learning_rate=3e-4 enable_checkpointing=false \


do we want to change the per device batch size here ?

Thank you for catching it! This is for my testing only and shouldn't be checked in. I removed it.

add process_indices_eval

aireenmei · 2024-10-14T17:22:12Z

@ZhiyuLi-goog I reliazed _tfds_data_processing_c4_mlperf.py also needs update. Please review my latest push^.

aireenmei force-pushed the aireen/fix_eval branch from 3c96a3b to 781eb0e Compare October 6, 2024 19:33

aireenmei marked this pull request as ready for review October 7, 2024 17:31

aireenmei requested review from bvandermoon, gobbleturk, jonb377, khatwanimohit and vipannalla as code owners October 7, 2024 17:31

aireenmei requested a review from ZhiyuLi-goog October 7, 2024 17:32

khatwanimohit approved these changes Oct 7, 2024

View reviewed changes

github-actions bot added the pull ready label Oct 7, 2024

ZhiyuLi-goog reviewed Oct 7, 2024

View reviewed changes

MaxText/pyconfig.py Show resolved Hide resolved

aireenmei removed the pull ready label Oct 11, 2024

ZhiyuLi-goog approved these changes Oct 11, 2024

View reviewed changes

khatwanimohit approved these changes Oct 11, 2024

View reviewed changes

aireenmei force-pushed the aireen/fix_eval branch 2 times, most recently from e400f7b to 45fc74d Compare October 13, 2024 22:19

github-actions bot added the pull ready label Oct 13, 2024

fix eval when using subset host loading data

28741d6

add process_indices_eval

aireenmei force-pushed the aireen/fix_eval branch from 45fc74d to 28741d6 Compare October 14, 2024 04:13

copybara-service bot merged commit 20f57fb into main Oct 14, 2024
13 checks passed

copybara-service bot deleted the aireen/fix_eval branch October 14, 2024 16:22

aireenmei restored the aireen/fix_eval branch February 24, 2025 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix eval when using subset host loading data #951

fix eval when using subset host loading data #951

Uh oh!

aireenmei commented Oct 6, 2024 •

edited

Loading

Uh oh!

khatwanimohit left a comment •

edited

Loading

Uh oh!

Uh oh!

ZhiyuLi-goog commented Oct 8, 2024

Uh oh!

aireenmei commented Oct 9, 2024

Uh oh!

aireenmei commented Oct 11, 2024

Uh oh!

ZhiyuLi-goog commented Oct 11, 2024

Uh oh!

ZhiyuLi-goog left a comment

Uh oh!

khatwanimohit left a comment

Uh oh!

khatwanimohit Oct 11, 2024

Uh oh!

aireenmei Oct 13, 2024

Uh oh!

Uh oh!

aireenmei commented Oct 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix eval when using subset host loading data #951

fix eval when using subset host loading data #951

Uh oh!

Conversation

aireenmei commented Oct 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khatwanimohit left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ZhiyuLi-goog commented Oct 8, 2024

Uh oh!

aireenmei commented Oct 9, 2024

Uh oh!

aireenmei commented Oct 11, 2024

Uh oh!

ZhiyuLi-goog commented Oct 11, 2024

Uh oh!

ZhiyuLi-goog left a comment

Choose a reason for hiding this comment

Uh oh!

khatwanimohit left a comment

Choose a reason for hiding this comment

Uh oh!

khatwanimohit Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

aireenmei Oct 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aireenmei commented Oct 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aireenmei commented Oct 6, 2024 •

edited

Loading

khatwanimohit left a comment •

edited

Loading