-
Notifications
You must be signed in to change notification settings - Fork 158
feat: optimize refit by preparing refit info ahead of time #638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@yuki-666 |
61abcf9 to
de56749
Compare
de56749 to
f57e799
Compare
f57e799 to
fc4d64e
Compare
Yup, I added it in ebb874a. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @yuki-666. LGTM!
9c0e833 to
e9b22fd
Compare
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
…mcore for speedup Signed-off-by: Yuki Huang <yukih@nvidia.com>
c685241 to
f152f61
Compare
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
…Mo#638) Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
)" This reverts commit 8f7d71e
Signed-off-by: Yuki Huang <yukih@nvidia.com>
…Mo#638) Signed-off-by: Yuki Huang <yukih@nvidia.com>
…Mo#638) Signed-off-by: Yuki Huang <yukih@nvidia.com>
…Mo#638) Signed-off-by: Yuki Huang <yukih@nvidia.com>
…Mo#638) Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>
Separate the refit process changes from #613.
What does this PR do ?
e_score_correction_bias) will change during training, have some special handle with it, andrefit_param_info_mcoreis not cached for now because of this.Test Result
convergence
time cost
In mcore w/ packing (dsv3 w/ 64 tp)
*The ~20s overhead is due to offload.

Refit Process Changes
Colocated
Previous
prepare_weights_for_ipcin train side.get_weights_ipc_handlesin train side andupdate_weights_from_ipc_handlesin inference side.Now
prepare_refit_infoin train side.prepare_weights_for_ipcin train side.get_weights_ipc_handlesin train side andupdate_weights_from_ipc_handlesin inference side.Non-colocated
Previous
init_collectivein both train and inference side.prepare_info_for_collectivein train side.broadcast_weights_for_collectivein train side andupdate_weights_from_collectivein inference side.Now
init_collectivein both train and inference side.prepare_refit_infoin both train and inference side.broadcast_weights_for_collectivein train side andupdate_weights_from_collectivein inference side.