KEMBAR78
The code comments are inconsistent with the source code logic in distributed_c10d.py · Issue #162157 · pytorch/pytorch · GitHub
Skip to content

The code comments are inconsistent with the source code logic in distributed_c10d.py #162157

@haochen-shen

Description

@haochen-shen

🐛 Describe the bug

In the source code of distributed_c10d.py, there is an inconsistency between the code comments and the actual behavior. According to the code, when the default value Backend.UNDEFINED is used, only the backend for that accelerator will be initialized, rather than both the gloo and nccl backends.

I also found a previous merge that mentions modifying the default behavior to initialize only a single communication backend: #142216

def _new_process_group_helper(
    group_size,
    group_rank,
    global_ranks_in_group,
    backend,
    store,
    group_name,
    backend_options=None,
    timeout=None,
    pg_tag=None,
    device_id=None,
    group_desc=None,
):
···
  if "," not in str(backend) and ":" not in str(backend):
          assert backend in Backend.backend_type_map, f"Unknown backend type {backend}"
          if backend == Backend.UNDEFINED:
              # Currently when backend is UNDEFINED, both ``gloo`` and ``nccl`` backends
              # will be created, we use nccl(if cuda is available) or gloo as default
              # backend so we can correctly call getDefaultBackend which in ProcessGroup.
              if Backend.NCCL in backend_config.get_device_backend_map().values():
                  pg._set_default_backend(ProcessGroup.BackendType.NCCL)
              else:
                  pg._set_default_backend(ProcessGroup.BackendType.GLOO)
          else:
              pg._set_default_backend(Backend.backend_type_map[backend])

Versions

Pytorch 2.8.0

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: distributedAdd this issue/PR to distributed oncall triage queue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions