[RFC] Allow lazy global init + eager subgroup init

### 🚀 The feature, motivation and pitch

Today eager init is enabled by providing a `device_id` kwarg to `init_process_group`:

> device_id ([torch.device](https://pytorch.org/docs/stable/tensor_attributes.html#torch.device), optional) – a single, specific device to “bind” this process to, allowing for backend-specific optimizations. Currently this has two effects, only under NCCL: the communicator is immediately formed (calling ncclCommInit* immediately rather than the normal lazy call) and sub-groups will use ncclCommSplit when possible to avoid unnecessary overhead of group creation. If you want to know NCCL initialization error early, you can also use this field.

Providing it at `init_process_group` would eagerly init ***all*** groups, including the global one.

However, in 3D training, eagerly init the global group may not be desired if there is no communication on it.

We should provide a way for user to eagerly init the subgroups **only**.

A potential way is to add a `device` kwarg to the `new_group` API as well.

### Alternatives

_No response_

### Additional context

_No response_

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Allow lazy global init + eager subgroup init #137018

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Allow lazy global init + eager subgroup init #137018

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions