Make _Join, _Joinable, _JoinHook public #62605

awgu · 2021-08-02T20:35:44Z

Overview:
This removes the preceding _ from _Join, _Joinable, and _JoinHook in preparation for adding the generic join context manager tutorial (see here). This also adds a docs page, which can be linked from the tutorial. Here is a render of the docs page.

Test Plan:
DistributedDataParallel.join():

touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception

ZeroRedundancyOptimizer:

gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py

NOTE: DDP overlap tests are failing due to a landing race. See #62592. Once the fix is landed, I will rebase, and tests should be passing.

Join:

gpurun4 python test/distributed/algorithms/test_join.py

facebook-github-bot · 2021-08-02T20:35:50Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/62605
📄 Preview docs built from this PR

💊 CI failures summary and remediations

As of commit e0cc3ac (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

awgu · 2021-08-02T20:38:02Z

Should Joinable._join_hook(), Joinable._join_device(), and Joinable._join_process_group() be made public as well (i.e. have their preceding _ removed)?

facebook-github-bot · 2021-08-02T22:09:13Z

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli

LGTM. Added some minor comments. Thanks!

mrshenli · 2021-08-03T01:56:35Z

torch/distributed/algorithms/join.py

        ``throw_on_early_termination`` is enabled, both of which using an all-
        reduce.

    Arguments:


This renders a bit differently than other modules. E.g., below is DDP's parameters:

And this is Join's parameters:

Here is how DDP's args docstring, though I am not sure if changing Arguments to Args is sufficient. But this is a minor thing, we can fix that in followup PRs.

pytorch/torch/nn/parallel/distributed.py

Lines 385 to 388 in c4196be

Args:

module (Module): module to be parallelized

device_ids (list of int or torch.device): CUDA devices.

I have not looked into it too deeply, but I think sphinx may have updated recently (#61601). When I look at DistributedDataParallel 's render from my local build, it is similar, and changing Arguments to Args does not make a difference.

mrshenli · 2021-08-03T01:59:22Z

torch/distributed/algorithms/join.py


    @abstractmethod
-    def _join_hook(self, **kwargs) -> _JoinHook:
+    def _join_hook(self, **kwargs) -> JoinHook:


Do we need to make join_hook, join_device, and join_process_group public?

#62605 (comment)
I was wondering that. I will make them public.

mrshenli · 2021-08-03T02:00:12Z

torch/distributed/algorithms/join.py

        """
        ...




I couldn't comment on line 81, so adding comments here. Any reason _join_process_group 's return type is Any? Is it because ProcessGroup is not a public type?

Yes, that is the reason. I think for now, we have to type all process groups as Any.

mrshenli · 2021-08-03T02:02:56Z

torch/distributed/algorithms/join.py

    To implement a join hook for the generic join context manager, define a
-    class that inherits from :class:`_JoinHook`, override ``main_hook()`` and
+    class that inherits from :class:`JoinHook`, override ``main_hook()`` and
    ``post_hook()`` as appropriate, and override ``device()`` and


device() and process_group() methods are not available in JoinHook. Do you mean join_device() and join_process_group() in Joinable?

Good catch. This is leftover from when device and process_group were part of JoinHook. I will fix this.

mrshenli · 2021-08-03T02:04:11Z

docs/source/distributed.algorithms.join.rst

+
+Generic Join Context Manager
+============================
+


Shall we add a short paragraph describing the purpose of this join context manager?

TODO: when this lands, and when the tutorial lands, let's also add a link to this doc page to pointing to the tutorial page.

facebook-github-bot · 2021-08-03T03:47:48Z

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-08-03T19:22:06Z

@andwgu merged this pull request in 62a90c2.

Summary: Addresses: #62605 (comment) Pull Request resolved: #62785 Test Plan: I checked the render, and the link redirects as desired. Reviewed By: mrshenli Differential Revision: D30133229 Pulled By: andwgu fbshipit-source-id: baefe0d1f1b78ece44bb42e67629bc130dbf8e9a

awgu requested review from H-Huang, cbalioglu, mingzhe09088, mrshenli, pritamdamania87, rohan-varma, wayi1 and zhaojuanmao as code owners August 2, 2021 20:35

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Aug 2, 2021

Make _Join, _Joinable, _JoinHook public

fc268de

awgu force-pushed the public_join branch from b0858cf to fc268de Compare August 3, 2021 01:02

mrshenli approved these changes Aug 3, 2021

View reviewed changes

Make _join_hook, _join_process_group, and _join_device public

e0cc3ac

facebook-github-bot closed this in 62a90c2 Aug 3, 2021

facebook-github-bot added the Merged label Aug 3, 2021

This was referenced Aug 3, 2021

Add Tutorial for Generic Join Context Manager pytorch/tutorials#1610

Merged

Add tutorial link #62785

Closed

awgu deleted the public_join branch February 3, 2022 00:44


	Args:
	module (Module): module to be parallelized
	device_ids (list of int or torch.device): CUDA devices.

Make _Join, _Joinable, _JoinHook public #62605

Make _Join, _Joinable, _JoinHook public #62605

Uh oh!

Conversation

awgu commented Aug 2, 2021

Uh oh!

facebook-github-bot commented Aug 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

awgu commented Aug 2, 2021

Uh oh!

facebook-github-bot commented Aug 2, 2021

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgu Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 3, 2021

Uh oh!

facebook-github-bot commented Aug 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

facebook-github-bot commented Aug 2, 2021 •

edited

Loading

awgu Aug 3, 2021 •

edited

Loading