-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
🚀 The feature, motivation and pitch
Today eager init is enabled by providing a device_id kwarg to init_process_group:
device_id (torch.device, optional) – a single, specific device to “bind” this process to, allowing for backend-specific optimizations. Currently this has two effects, only under NCCL: the communicator is immediately formed (calling ncclCommInit* immediately rather than the normal lazy call) and sub-groups will use ncclCommSplit when possible to avoid unnecessary overhead of group creation. If you want to know NCCL initialization error early, you can also use this field.
Providing it at init_process_group would eagerly init all groups, including the global one.
However, in 3D training, eagerly init the global group may not be desired if there is no communication on it.
We should provide a way for user to eagerly init the subgroups only.
A potential way is to add a device kwarg to the new_group API as well.
Alternatives
No response
Additional context
No response
cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o