KEMBAR78
Make _Join, _Joinable, _JoinHook public by awgu · Pull Request #62605 · pytorch/pytorch · GitHub
Skip to content

Conversation

@awgu
Copy link
Collaborator

@awgu awgu commented Aug 2, 2021

Overview:
This removes the preceding _ from _Join, _Joinable, and _JoinHook in preparation for adding the generic join context manager tutorial (see here). This also adds a docs page, which can be linked from the tutorial. Here is a render of the docs page.

Test Plan:
DistributedDataParallel.join():

touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception

ZeroRedundancyOptimizer:

gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py

NOTE: DDP overlap tests are failing due to a landing race. See #62592. Once the fix is landed, I will rebase, and tests should be passing.

Join:

gpurun4 python test/distributed/algorithms/test_join.py

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 2, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit e0cc3ac (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@awgu
Copy link
Collaborator Author

awgu commented Aug 2, 2021

Should Joinable._join_hook(), Joinable._join_device(), and Joinable._join_process_group() be made public as well (i.e. have their preceding _ removed)?

@facebook-github-bot
Copy link
Contributor

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Added some minor comments. Thanks!

``throw_on_early_termination`` is enabled, both of which using an all-
reduce.
Arguments:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This renders a bit differently than other modules. E.g., below is DDP's parameters:

Screen Shot 2021-08-02 at 9 53 40 PM

And this is Join's parameters:

Screen Shot 2021-08-02 at 9 53 51 PM

Here is how DDP's args docstring, though I am not sure if changing Arguments to Args is sufficient. But this is a minor thing, we can fix that in followup PRs.

Args:
module (Module): module to be parallelized
device_ids (list of int or torch.device): CUDA devices.

Copy link
Collaborator Author

@awgu awgu Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ddp_render

I have not looked into it too deeply, but I think sphinx may have updated recently (#61601). When I look at DistributedDataParallel 's render from my local build, it is similar, and changing Arguments to Args does not make a difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see


@abstractmethod
def _join_hook(self, **kwargs) -> _JoinHook:
def _join_hook(self, **kwargs) -> JoinHook:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make join_hook, join_device, and join_process_group public?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#62605 (comment)
I was wondering that. I will make them public.

"""
...


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't comment on line 81, so adding comments here. Any reason _join_process_group 's return type is Any? Is it because ProcessGroup is not a public type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the reason. I think for now, we have to type all process groups as Any.

To implement a join hook for the generic join context manager, define a
class that inherits from :class:`_JoinHook`, override ``main_hook()`` and
class that inherits from :class:`JoinHook`, override ``main_hook()`` and
``post_hook()`` as appropriate, and override ``device()`` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device() and process_group() methods are not available in JoinHook. Do you mean join_device() and join_process_group() in Joinable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is leftover from when device and process_group were part of JoinHook. I will fix this.


Generic Join Context Manager
============================

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a short paragraph describing the purpose of this join context manager?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: when this lands, and when the tutorial lands, let's also add a link to this doc page to pointing to the tutorial page.

@facebook-github-bot
Copy link
Contributor

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@andwgu merged this pull request in 62a90c2.

facebook-github-bot pushed a commit that referenced this pull request Aug 6, 2021
Summary:
Addresses: #62605 (comment)

Pull Request resolved: #62785

Test Plan: I checked the render, and the link redirects as desired.

Reviewed By: mrshenli

Differential Revision: D30133229

Pulled By: andwgu

fbshipit-source-id: baefe0d1f1b78ece44bb42e67629bc130dbf8e9a
@awgu awgu deleted the public_join branch February 3, 2022 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants