-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Open
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queue
Description
🐛 Describe the bug
In the source code of distributed_c10d.py, there is an inconsistency between the code comments and the actual behavior. According to the code, when the default value Backend.UNDEFINED is used, only the backend for that accelerator will be initialized, rather than both the gloo and nccl backends.
I also found a previous merge that mentions modifying the default behavior to initialize only a single communication backend: #142216
def _new_process_group_helper(
group_size,
group_rank,
global_ranks_in_group,
backend,
store,
group_name,
backend_options=None,
timeout=None,
pg_tag=None,
device_id=None,
group_desc=None,
):
···
if "," not in str(backend) and ":" not in str(backend):
assert backend in Backend.backend_type_map, f"Unknown backend type {backend}"
if backend == Backend.UNDEFINED:
# Currently when backend is UNDEFINED, both ``gloo`` and ``nccl`` backends
# will be created, we use nccl(if cuda is available) or gloo as default
# backend so we can correctly call getDefaultBackend which in ProcessGroup.
if Backend.NCCL in backend_config.get_device_backend_map().values():
pg._set_default_backend(ProcessGroup.BackendType.NCCL)
else:
pg._set_default_backend(ProcessGroup.BackendType.GLOO)
else:
pg._set_default_backend(Backend.backend_type_map[backend])Versions
Pytorch 2.8.0
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim
Metadata
Metadata
Assignees
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queue