KEMBAR78
[RFC] Allow lazy global init + eager subgroup init · Issue #137018 · pytorch/pytorch · GitHub
Skip to content

[RFC] Allow lazy global init + eager subgroup init #137018

@kwen2501

Description

@kwen2501

🚀 The feature, motivation and pitch

Today eager init is enabled by providing a device_id kwarg to init_process_group:

device_id (torch.device, optional) – a single, specific device to “bind” this process to, allowing for backend-specific optimizations. Currently this has two effects, only under NCCL: the communicator is immediately formed (calling ncclCommInit* immediately rather than the normal lazy call) and sub-groups will use ncclCommSplit when possible to avoid unnecessary overhead of group creation. If you want to know NCCL initialization error early, you can also use this field.

Providing it at init_process_group would eagerly init all groups, including the global one.

However, in 3D training, eagerly init the global group may not be desired if there is no communication on it.

We should provide a way for user to eagerly init the subgroups only.

A potential way is to add a device kwarg to the new_group API as well.

Alternatives

No response

Additional context

No response

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Metadata

Metadata

Labels

module: c10dIssues/PRs related to collective communications and process groupsoncall: distributedAdd this issue/PR to distributed oncall triage queue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions