KEMBAR78
fix eval when using subset host loading data by aireenmei · Pull Request #951 · AI-Hypercomputer/maxtext · GitHub
Skip to content

Conversation

aireenmei
Copy link
Collaborator

@aireenmei aireenmei commented Oct 6, 2024

@aireenmei aireenmei marked this pull request as ready for review October 7, 2024 17:31
@aireenmei aireenmei requested a review from ZhiyuLi-goog October 7, 2024 17:32
Copy link
Collaborator

@khatwanimohit khatwanimohit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ZhiyuLi-goog
Copy link
Collaborator

Hi, @aireenmei

Awesome and thank you for the fix!

Do we expect to unblock eval_per_device_batch_size < 1. with this PR?
And do you happen to have some test results?

@aireenmei
Copy link
Collaborator Author

@ZhiyuLi-goog It should support eval_per_device_batch_size < 1. But let me actually run the tests to make sure. Will get back to you.

@aireenmei
Copy link
Collaborator Author

@khatwanimohit @ZhiyuLi-goog
I made more chnages. Now it supports eval_per_device_batch_size < 1. Could both of you review again?

Cases tested on v4-128:

@ZhiyuLi-goog
Copy link
Collaborator

@khatwanimohit @ZhiyuLi-goog I made more chnages. Now it supports eval_per_device_batch_size < 1. Could both of you review again?

Cases tested on v4-128:

Thank you @aireenmei for those experiments!

Look great to me!

Copy link
Collaborator

@ZhiyuLi-goog ZhiyuLi-goog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @aireenmei. Awesome.
LGTM!

Copy link
Collaborator

@khatwanimohit khatwanimohit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @aireenmei

TRAIN_CMD="python3 MaxText/train.py MaxText/configs/base.yml \
steps=$STEPS eval_steps=$EVAL_STEPS eval_interval=$STEPS \
per_device_batch_size=8.0 learning_rate=3e-4 enable_checkpointing=false \
per_device_batch_size=1.0 learning_rate=3e-4 enable_checkpointing=false \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to change the per device batch size here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching it! This is for my testing only and shouldn't be checked in. I removed it.

@aireenmei aireenmei force-pushed the aireen/fix_eval branch 2 times, most recently from e400f7b to 45fc74d Compare October 13, 2024 22:19
@copybara-service copybara-service bot merged commit 20f57fb into main Oct 14, 2024
13 checks passed
@copybara-service copybara-service bot deleted the aireen/fix_eval branch October 14, 2024 16:22
@aireenmei
Copy link
Collaborator Author

@ZhiyuLi-goog I reliazed _tfds_data_processing_c4_mlperf.py also needs update. Please review my latest push^.

@aireenmei aireenmei restored the aireen/fix_eval branch February 24, 2025 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants