-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[Reland2][DDP] Merge work and future_work in reducer #59574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Remove `work` attribute from Reducer class in favor of `future_work`. Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input. 1) Compared with the reverted #58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow. 2) Compared with the reverted #59520, disabled `test_DistributedDataParallel_non_default_stream` on AMD, because now applying division first hurts the gradient averaging accuracy on AMD. See [07:48:26]: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/1129/console #Original PR Issue: #41266 Differential Revision: [D28940800](https://our.internmc.facebook.com/intern/diff/D28940800/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 65c1c2a (more details on the Dr. CI page):
1 failure not recognized by patterns:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Remove `work` attribute from Reducer class in favor of `future_work`. Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input. 1) Compared with the reverted #58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow. 2) Compared with the reverted #59520, disabled `test_DistributedDataParallel_non_default_stream` on AMD, because now applying division first hurts the gradient averaging accuracy on AMD. See [07:48:26]: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/1129/console #Original PR Issue: #41266 Differential Revision: [D28940800](https://our.internmc.facebook.com/intern/diff/D28940800/) ghstack-source-id: 130752393 Pull Request resolved: #59574
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to unblock for now, would be great to file an issue to investigate why it fails on ROCm. Thanks!
|
A couple of questions: |
|
@jithunnair-amd See [07:48:26]: |
|
This pull request has been merged in 6575975. |
Summary: Pull Request resolved: pytorch#59574 Remove `work` attribute from Reducer class in favor of `future_work`. Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input. 1) Compared with the reverted pytorch#58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow. 2) Compared with the reverted pytorch#59520, disabled `test_DistributedDataParallel_non_default_stream` on AMD, because now applying division first hurts the gradient averaging accuracy on AMD. See [07:48:26]: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/1129/console #Original PR Issue: pytorch#41266 ghstack-source-id: 130752393 Test Plan: buck test caffe2/test/distributed:distributed_gloo_fork -- test_accumulate_gradients_no_sync buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16 buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_non_default_stream Reviewed By: rohan-varma Differential Revision: D28940800 fbshipit-source-id: 1ba727ac951ebc1e7875dc1a1be8108a2c8d9462
…n when no comm hook is specified The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]
…n when no comm hook is specified The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) ghstack-source-id: 133169350 Pull Request resolved: #61379
… of fusing copy and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]
…and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]
…n when no comm hook is specified Pull Request resolved: #61379 The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. ghstack-source-id: 133174301 Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/)
… of fusing copy and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]
…and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]
…n when no comm hook is specified Pull Request resolved: #61379 The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. ghstack-source-id: 133277170 Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/)
… of fusing copy and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]
…and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]
…n when no comm hook is specified Pull Request resolved: #61379 The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. ghstack-source-id: 133282616 Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/)
… of fusing copy and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]
…and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]
…n when no comm hook is specified Pull Request resolved: #61379 The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. ghstack-source-id: 133288529 Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/)
…n when no comm hook is specified (#61379) Summary: Pull Request resolved: #61379 The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. ghstack-source-id: 133288529 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16 buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_non_default_stream buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_sparse_gradient buck test mode/dev-nosan caffe2/test/distributed:c10 -- test_ddp_checkpointing_once buck test mode/dev-nosan caffe2/test/distributed:c10 -- test_ddp_checkpointing_twice Reviewed By: rohan-varma Differential Revision: D29597614 fbshipit-source-id: 2434e4fd4e6abad7871cfe47886fe97b6e4ba28f
|
May I ask for your advice: |
Stack from ghstack:
Remove
workattribute from Reducer class in favor offuture_work.Additionally, remove
copy_grad_to_bucketmethod since now it's only one-line implementation, and created a new C++ comm hook called_AllReduceCommHookWithDivFactorto replace allreduce and also support handling uneven input.Compared with the reverted [DDP] Merge work and future_work in reducer #58937, updated
_AllReduceCommHookWithDivFactorindefault_comm_hooks.cppto apply division first and hence avoid FP16 overflow.Compared with the reverted [Reland][DDP] Merge work and future_work in reducer #59520, disabled
test_DistributedDataParallel_non_default_streamon AMD, because now applying division first hurts the gradient averaging accuracy on AMD.See [07:48:26]:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/1129/console
#Original PR Issue: #41266
Differential Revision: D28940800