Add overlap with DDP to ZeRO (two approaches) #62157

awgu · 2021-07-25T00:51:28Z

Overview:
This adds two approaches to overlapping DistributedDataParallel.backward() with ZeroRedundancyOptimizer.step() by providing two hook constructors: hook_with_zero_step() and hook_with_zero_step_interleaved(). The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration.

Both approaches can share changes to ZeroRedundancyOptimizer. A user should pass overlap_with_ddp=True to ZeroRedundancyOptimizer, construct a DDP communication hook using either hook_with_zero_step() or hook_with_zero_step_interleaved(), and register that communication hook. ZeroRedundancyOptimizer.step() should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the ZeroRedundancyOptimizer's local optimizer based on that bucketing.

Test Plan:
The existing ZeroRedundancyOptimizer tests pass, and new unit tests for both hooks pass:

~~test_ddp_with_zero_step_parity_cpu~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with hook_with_zero_step_interleaved())
test_ddp_with_zero_step_parity_gpu
test_ddp_with_zero_step_interleaved_parity_gpu

These were tested on the AI AWS cluster.

An analogous test_ddp_with_zero_step_interleaved_parity_cpu is missing due to existing bugs with Gloo. See #62302.

Both approaches have been verified using an internal accuracy benchmark.

facebook-github-bot · 2021-07-25T00:51:34Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/62157
📄 Preview docs built from this PR

💊 CI failures summary and remediations

As of commit ab54c52 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

mrshenli

Would I be correct if I assume this does not have impact on non-overlapping ZeRO perfs and correctness? If so, I don't have concerns for landing this. We can land and improve in followup PRs.

mrshenli · 2021-07-28T02:28:38Z

torch/distributed/optim/zero_redundancy_optimizer.py

This map also lives in DistributedOptimizer. Shall we consolidate these two into a single source of truth?

pytorch/torch/distributed/optim/optimizer.py

Lines 196 to 209 in 73f1e2d

# dict to map a user passed in optimizer_class to a functional

# optimizer class if we have already defined inside the

# distributed.optim package, this is so that we hide the

# functional optimizer to user and still provide the same API.

functional_optim_map = {

optim.Adagrad: _FunctionalAdagrad,

optim.Adam: _FunctionalAdam,

optim.AdamW: _FunctionalAdamW,

optim.SGD: _FunctionalSGD,

optim.Adadelta: _FunctionalAdadelta,

optim.RMSprop: _FunctionalRMSprop,

optim.Rprop: _FunctionalRprop,

optim.Adamax: _FunctionalAdamax,

}

mrshenli · 2021-07-28T02:33:40Z

torch/distributed/optim/zero_redundancy_optimizer.py

This looks fine for now. But let's add a followup PR to automatically do the translation for users, similar to how DistributedOptimizer handles this.

mrshenli · 2021-07-28T02:38:51Z

torch/distributed/optim/zero_redundancy_optimizer.py

nit (can be done in follow PRs): curious, callbacks on buckets will fire in order, so this can be simplified to a list of GradBucket objects?

For a given rank, we only need to store references to the GradBuckets that the rank uses to update. As such, even though the buckets fire in order, the "keys" for this data structure are only a subset of all indices.

If the total number of buckets is not large, using a list and either storing all buckets or storing Nones in unused positions may be faster than storing a dict, but I used a dict for now.

Ah, I see, makes sense.

mrshenli · 2021-07-28T02:41:15Z

torch/distributed/optim/zero_redundancy_optimizer.py

nit (can be done in followup PRs): looks like each Future in this dict is just read once, so this cache probably won't save on Python-CPP context switch cost. Maybe this can be deleted, and directly use bucket.get_future()?

In general, I would try to avoid the state vars unless that would give us either perf gain or considerable code simplification. As more vars might make it harder to understand and maintain.

I may be misunderstanding, but the reason I stored this dict is to have a reference to all relevant Futures when running the hook for the last bucket to be able to call wait() on them. bucket is of type GradBucket, which does not offer a get_future() method. Rather, c10d.ProcessGroup.work objects provide a get_future() method. However, at the moment, this seems a bit backward. The hook passed into hook_then_zero_step() (most typically allreduce_hook()) returns a Future, where that Future is exactly gotten via dist.all_reduce().get_future(). Either way, I do not see how to get around storing at least something per bucket to update that gives access to the bucket's corresponding Future to give the rank access to the Future when executing the hook for the last bucket.

mrshenli · 2021-07-28T02:45:56Z

torch/distributed/optim/zero_redundancy_optimizer.py

curious, why this is device_to_params instead of device_to_buckets?

Good point -- I will change it. It was from before I created the class _DDPBucket and instead had a List[torch.Tensor] as the inner-most element in the dict value.

mrshenli · 2021-07-28T02:49:19Z

torch/distributed/optim/zero_redundancy_optimizer.py

(can be done in followup PRs) This conceptually overlaps with DDP's Gradient Bucket. We probably want to let DDP expose its GradBuckets instead of creating new classes?

I think that it may be possible to get rid of this class altogether with how DDP's GradBucket is exposed currently. I can investigate this in a follow-up PR.

mrshenli · 2021-07-28T02:53:16Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

This is still the allreduce future, right? Meaning that waiting on this future (by DDP backward()) won't guarantee that the ZeRO broadcast is done? We probably need to return torch.futures.collect_all(all_zero_broadcast_futures) for the last bucket instead.

Among these reviews, this is the most important one, as it might have impact on correctness.

Resolved offline.

@andwgu Do you mind describing how this was resolved?

https://github.com/andwgu/pytorch/blob/ab54c52fa54dac017ff6c9eadae49a97aa117432/torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py#L175
That line should provide the necessary synchronization already. (The wait() on the returned future in finalize_backward() will not guarantee that the ZeRO broadcasts are done, but these additional wait()s will.)

Also, as a side note, the torch.futures.collect_all(all_zero_broadcast_futures) would change the hook semantics since it would then return a future for the broadcasts instead of a future for the bucket all-reduce.

facebook-github-bot · 2021-07-28T20:31:11Z

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli

This is a great amount of work! Amazing job! Thanks a lot for putting significant effort to explore, diagnose, optimize, different approaches and deliver high-quality code and docs. The code look good to me.

Not stamping yet, mainly because we need tests ;)

mrshenli · 2021-07-29T01:46:18Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

state (object): ... ?

Actually, I was reading this: https://stackoverflow.com/questions/39817081/typing-any-vs-object

Even though the type annotation for state in register_comm_hook() in distributed.py is object, I wonder if that should be Any instead. If I am understanding correctly, if you type annotate state with object, then to appease a type checker, every operation involving state must work for any general object type. In other words, annotating with object is actually restrictive rather than all-encompassing.

mrshenli · 2021-07-29T01:48:42Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

curious, is this the conventional way to document function signature? Curious, why this does not match Callable[[Any, dist.GradBucket], torch.futures.Future]?

Good catch. I accidentally used the type signature style from the language Standard ML.

mrshenli · 2021-07-29T01:57:40Z

torch/distributed/optim/zero_redundancy_optimizer.py

:meth:step -> :meth:step` ?

mrshenli · 2021-07-29T02:09:45Z

torch/distributed/optim/zero_redundancy_optimizer.py

print a warning if parameters_as_bucket_view=True?

Added a warning using logging.warning() in that case.

mrshenli · 2021-07-29T02:12:36Z

torch/distributed/optim/zero_redundancy_optimizer.py

add some explanation that ZeroRedundancyOptimizer is fully initialized after DDP buckets are rebuilt?

mrshenli · 2021-07-29T02:34:23Z

torch/distributed/optim/zero_redundancy_optimizer.py

assert overlap_with_ddp == False?

mrshenli · 2021-07-29T02:54:21Z

torch/distributed/optim/zero_redundancy_optimizer.py

IIUC, the recommended new API for this is param.set_(). Let's do that change in a followup PR, as there might be subtle differences.

https://pytorch.org/docs/stable/generated/torch.Tensor.set_.html#torch.Tensor.set_

One issue with using set_() is that we must set requires_grad=False and later reset it to True since otherwise we get the error saying that a leaf Variable that requires grad is being used in an in-place operation.

More importantly, I am unable to get it to manage the strides correctly. It seems like if you specify offset, then you must also specify size and stride as arguments. The method uses that stride when accessing the shared underlying storage. However, we want param to preserve its non-flattened stride, while sharing the storage with the flattened bucket. This suggests the need for additional view manipulation (which I have not figured out how to get correct). Hence, as far as I can see, I do not think that using set_() will be any simpler for this copying.

mrshenli · 2021-07-29T02:57:16Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

Do we need to error out if the backend is not NCCL? (as gloo might hang today)

Added check.

mrshenli · 2021-07-29T02:59:21Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

this is a bit different from the above hook, which also checks assert overlap_info.status == _OverlapStatus.UNINITIALIZED

mrshenli · 2021-07-29T03:00:21Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

This means there is no optimizer step in the first 2 iterations and gradients are discarded? Do we need to mention this in the doc, or do you plan to change this in followup PRs?

I added warnings to mention this in both hook functions and ZeroRedundancyOptimizer.

The motivation for this is so that we can avoid initializing a non-functional local optimizer and then re-initializing a functional optimizer soon thereafter. Even if we did that, we would have to transfer optimizer state correctly.
We do not know until DDP buckets are built which parameters a given rank is assigned to update. I have not yet come up with a clean solution that allows not throwing away the first two inputs.

Small note: The first three inputs are unused if static_graph=True for the DDP model.

(can be done in followup PRs)

We do not know until DDP buckets are built which parameters a given rank is assigned to update. I have not yet come up with a clean solution that allows not throwing away the first two inputs.

Since ZeRO's overlap states knows whether the buckets are rebuilt or not, is it possible that, in the first 2-3 iterations, ZeroRedundancyOptimizer.step() still performs comp + comm the same way as no-overlap cases? And only after the buckets are rebuilt, ZeroRedundancyOptimizer.step() becomes no-op?

Yes, we can do that. Originally, I was averse to the idea of re-initializing the local optimizer, but upon further thought, providing an equivalent functionality to the user is definitely more important.

What we can do is assume parameter_as_bucket_view=False behavior until buckets are rebuilt. This means that the parameters will be partitioned according to the normal sorted-greedy algorithm and a local optimizer will be initialized as normal at construction time. Then, when buckets are rebuilt, the parameter partitioning will change and the local optimizer will be reinitialized. I can do this in a follow-up PR.

awgu · 2021-07-30T02:09:19Z

torch/distributed/optim/zero_redundancy_optimizer.py

I made these uniform across the file to use two colons. I checked the render, and the warning still appears as desired.

awgu · 2021-07-30T02:11:12Z

Unit tests should be coming tomorrow.

awgu · 2021-07-30T17:46:54Z

Unit tests should be coming tomorrow.

I added unit tests. The tests are based on those by @rohan-varma:

pytorch/torch/testing/_internal/distributed/distributed_test.py

Line 3835 in c9204d0

    
           def _test_ddp_hook_with_optimizer_parity(self, grad_as_bucket_view, static_graph):

However, I chose to not de-duplicating them since there were still some notable differences in the setup and since we may want to keep ZeroRedundancyOptimizer tests fully separate from general distributed tests.

awgu · 2021-07-30T17:58:49Z

torch/distributed/optim/zero_redundancy_optimizer.py

This will not actually be rendered since the method is private, but just in case, I think the docstring render expects a newline before any bulleted list.

awgu · 2021-07-30T17:58:57Z

torch/distributed/optim/zero_redundancy_optimizer.py

Similar thing here.

facebook-github-bot · 2021-07-30T18:00:29Z

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-30T20:39:00Z

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

awgu · 2021-07-30T22:33:16Z

torch/distributed/optim/zero_redundancy_optimizer.py

This rename is to prevent implicit shadowing of the method's argument params_per_rank.

mrshenli · 2021-07-31T02:08:07Z

test/distributed/optim/test_zero_redundancy_optimizer.py

If not, this should error out, as it might hang with Gloo?

NVM, I saw, there are TODOs to cover CPU tests as well.

mrshenli · 2021-07-31T02:13:27Z

test/distributed/optim/test_zero_redundancy_optimizer.py

Curious, why do we need a barrier here?

Good point. Removed.

mrshenli · 2021-07-31T02:16:10Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

mrshenli · 2021-07-31T02:18:17Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

Do we have to explicitly import distributed_c10d? If the goal was to use Backend.NCCL, will dist.Backend.NCCL work?

https://pytorch.org/docs/master/distributed.html#torch.distributed.Backend

Changed to dist.get_backend() and dist.Backend.NCCL.

mrshenli · 2021-07-31T02:19:38Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

Let's cite issue #62300 here

mrshenli · 2021-07-31T02:22:37Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

(can be done in followup PRs)

We do not know until DDP buckets are built which parameters a given rank is assigned to update. I have not yet come up with a clean solution that allows not throwing away the first two inputs.

Since ZeRO's overlap states knows whether the buckets are rebuilt or not, is it possible that, in the first 2-3 iterations, ZeroRedundancyOptimizer.step() still performs comp + comm the same way as no-overlap cases? And only after the buckets are rebuilt, ZeroRedundancyOptimizer.step() becomes no-op?

mrshenli

LGTM. Left some minor comments. Stamp to unblock. Please wait for tests.

mrshenli · 2021-07-31T02:26:02Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

Shall we add a check here to error out if this assumption was violated? (maybe by remembering the last bucket index)

This is a good idea. Currently, I am a bit uncertain on how to do this. I will address it in a follow-up PR.

Remembering the last bucket index might be circular since the way we currently tell if it is the last one is by its index. If the buckets fire in order, then we should be guaranteed that the CPU part of each hook executes in order right? In that case, we can reuse the bucket_indices_seen data structures used in the other hook and make sure that the current bucket index is 1 more than the previous index seen (clearing it after each iteration).

mrshenli · 2021-07-31T02:26:54Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

Let's add a short comment to explain why this cannot be chained as a callback to the returned future.

Added:

# NOTE: This should not be chained as a callback to the last bucket's # all-reduce future since that would add synchronization that delays # all optimizer computation to wait for that last all-reduce

mrshenli · 2021-07-31T02:31:50Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

nit (can be done in followup PR, and this is just my personal preference, please feel free to ignore): The _OverlapInfo type is not owned by this hook function and lives a bit far away from this file. It might be better to wrap its state manipulations into member method in _OverlapInfo, instead of directly manipulating them here and above.

I think this definitely makes sense. I will try to refactor this in a follow-up PR.

mrshenli · 2021-07-31T02:35:01Z

torch/distributed/optim/zero_redundancy_optimizer.py

Ah, I see, makes sense.

mrshenli

Test failures on ROCm are real. If this is ROCm only, it's OK to skip tests on ROCm for now.

21:51:57 ======================================================================
21:51:57 ERROR [100.077s]: test_ddp_with_zero_step_interleaved_parity_gpu (__main__.TestZeroRedundancyOptimizerDistributed)
21:51:57 ----------------------------------------------------------------------
21:51:57 Traceback (most recent call last):
21:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 406, in wrapper
21:51:57     self._join_processes(fn)
21:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 608, in _join_processes
21:51:57     self._check_return_codes(elapsed_time)
21:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 653, in _check_return_codes
21:51:57     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
21:51:57 RuntimeError: Process 0 terminated or timed out after 100.057626247406 seconds
21:51:57 
21:51:57 ======================================================================
21:51:57 ERROR [100.061s]: test_ddp_with_zero_step_parity_gpu (__main__.TestZeroRedundancyOptimizerDistributed)
21:51:57 ----------------------------------------------------------------------
21:51:57 Traceback (most recent call last):
21:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 406, in wrapper
21:51:57     self._join_processes(fn)
21:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 608, in _join_processes
21:51:57     self._check_return_codes(elapsed_time)
21:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 653, in _check_return_codes
21:51:57     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
21:51:57 RuntimeError: Process 0 terminated or timed out after 100.03943800926208 seconds

mrshenli · 2021-07-31T02:41:04Z

Lint failures are real. Please fix before landing.

awgu · 2021-07-31T03:38:43Z

Test failures on ROCm are real. If this is ROCm only, it's OK to skip tests on ROCm for now.

21:51:57 ======================================================================
21:51:57 ERROR [100.077s]: test_ddp_with_zero_step_interleaved_parity_gpu (__main__.TestZeroRedundancyOptimizerDistributed)
21:51:57 ----------------------------------------------------------------------
21:51:57 Traceback (most recent call last):
21:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 406, in wrapper
21:51:57     self._join_processes(fn)
21:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 608, in _join_processes
21:51:57     self._check_return_codes(elapsed_time)
21:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 653, in _check_return_codes
21:51:57     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
21:51:57 RuntimeError: Process 0 terminated or timed out after 100.057626247406 seconds
21:51:57 
21:51:57 ======================================================================
21:51:57 ERROR [100.061s]: test_ddp_with_zero_step_parity_gpu (__main__.TestZeroRedundancyOptimizerDistributed)
21:51:57 ----------------------------------------------------------------------
21:51:57 Traceback (most recent call last):
21:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 406, in wrapper
21:51:57     self._join_processes(fn)
21:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 608, in _join_processes
21:51:57     self._check_return_codes(elapsed_time)
21:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 653, in _check_return_codes
21:51:57     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
21:51:57 RuntimeError: Process 0 terminated or timed out after 100.03943800926208 seconds

I will ask @rohan-varma why he skips if ROCm for his tests:

pytorch/torch/testing/_internal/distributed/distributed_test.py

Line 3938 in c9204d0

@skip_if_rocm

It is likely a similar issue.

awgu · 2021-07-31T04:57:39Z

Lint failures are real. Please fix before landing.

Lint issues were resulting from the .pyi stub file. TIL about the Python stub file :)

ghstack-source-id: 9dd5d6e Pull Request resolved: pytorch#62052

facebook-github-bot · 2021-07-31T05:08:19Z

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-31T13:50:56Z

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

awgu · 2021-07-31T13:54:58Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

    ddp: DistributedDataParallel,
    zero: ZeroRedundancyOptimizer,
-) -> Callable[[Any, dist.GradBucket], torch.futures.Future]:
+) -> Callable[[Any, dist.GradBucket], torch.futures.Future[torch.Tensor]]:


I ran all of the tests via ci-all, and for a few of the tests, I saw:

ValueError: Communication hook: return annotation should be torch.futures.Future[torch.Tensor].

As a result, I changed torch.futures.Future -> torch.futures.Future[torch.Tensor].

codecov · 2021-07-31T17:56:29Z

Codecov Report

Merging #62157 (ab54c52) into master (32b37ba) will decrease coverage by 0.11%.
The diff coverage is 0.61%.

@@            Coverage Diff             @@
##           master   #62157      +/-   ##
==========================================
- Coverage   60.32%   60.21%   -0.12%     
==========================================
  Files         660      661       +1     
  Lines       86440    86710     +270     
==========================================
+ Hits        52148    52215      +67     
- Misses      34292    34495     +203

facebook-github-bot · 2021-08-02T15:35:35Z

@andwgu merged this pull request in 51f687f.

**Overview:** This refactors some commonalities between the two approaches to overlapping DDP with ZeRO. This also partially addresses this comment: #62157 (comment) **Test Plan:** ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` Differential Revision: [D30058543](https://our.internmc.facebook.com/intern/diff/D30058543) [ghstack-poisoned]

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 25, 2021

awgu force-pushed the overlap2 branch from 9934a59 to 8bca285 Compare July 28, 2021 02:24

mrshenli reviewed Jul 28, 2021

View reviewed changes

awgu changed the title ~~Add overlap with DDP to ZeRO (ZeRO comp after bwd)~~ Add overlap with DDP to ZeRO (two approaches) Jul 28, 2021

mrshenli reviewed Jul 29, 2021

View reviewed changes

awgu commented Jul 30, 2021

View reviewed changes

awgu mentioned this pull request Jul 30, 2021

Add overlap with DDP to ZeRO #62052

Closed

awgu commented Jul 30, 2021

View reviewed changes

torch/distributed/optim/zero_redundancy_optimizer.py Outdated

Copy link

Collaborator Author

awgu Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar thing here.

awgu commented Jul 30, 2021

View reviewed changes

awgu marked this pull request as ready for review July 31, 2021 01:30

awgu requested review from H-Huang, cbalioglu, mingzhe09088, pritamdamania87, rohan-varma, wayi1 and zhaojuanmao as code owners July 31, 2021 01:30

mrshenli reviewed Jul 31, 2021

View reviewed changes

mrshenli approved these changes Jul 31, 2021

View reviewed changes

mrshenli reviewed Jul 31, 2021

View reviewed changes

Andrew Gu added 7 commits July 31, 2021 05:00

Add overlap with DDP to ZeRO (ZeRO comp after bwd)

69aef3f

ghstack-source-id: 9dd5d6e Pull Request resolved: pytorch#62052

Merge two approaches

e3505af

Address feedback

6153e28

Add unit tests

1492252

Prohibit Gloo for approach 2 as well

45cef3f

Address feedback

c547b8d

Fix mypy

86b797f

awgu force-pushed the overlap2 branch from 0654da3 to 86b797f Compare July 31, 2021 05:01

awgu mentioned this pull request Jul 31, 2021

Ci all/andwgu #60580

Closed

Add torch.Tensor annotation as type of returned future

ab54c52

awgu commented Jul 31, 2021

View reviewed changes

facebook-github-bot closed this in 51f687f Aug 2, 2021

facebook-github-bot added the Merged label Aug 2, 2021

seemethere mentioned this pull request Aug 2, 2021

[DDP Communication Hook] Rename 4 Methods of GradBucket Class #62510

Closed

This was referenced Aug 2, 2021

Adopt new GradBucket API #62587

Closed

Refactor commonalities between two approaches #62624

Closed

Add invariant check (bucket indices: 0, 1, ..., k-1) #62623

Closed

awgu deleted the overlap2 branch February 3, 2022 00:44

	# dict to map a user passed in optimizer_class to a functional
	# optimizer class if we have already defined inside the
	# distributed.optim package, this is so that we hide the
	# functional optimizer to user and still provide the same API.
	functional_optim_map = {
	optim.Adagrad: _FunctionalAdagrad,
	optim.Adam: _FunctionalAdam,
	optim.AdamW: _FunctionalAdamW,
	optim.SGD: _FunctionalSGD,
	optim.Adadelta: _FunctionalAdadelta,
	optim.RMSprop: _FunctionalRMSprop,
	optim.Rprop: _FunctionalRprop,
	optim.Adamax: _FunctionalAdamax,
	}

Add overlap with DDP to ZeRO (two approaches) #62157

Add overlap with DDP to ZeRO (two approaches) #62157

Uh oh!

Conversation

awgu commented Jul 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jul 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 28, 2021

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgu commented Jul 25, 2021 •

edited

Loading

facebook-github-bot commented Jul 25, 2021 •

edited

Loading

awgu Aug 3, 2021 •

edited

Loading