-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Closed
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
From documentation:
ncclResult_tncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t* newcomm, ncclConfig_t* config)
Ranks which pass the same color value will be part of the same group; color must be a non-negative value.
If it is passed as NCCL_SPLIT_NOCOLOR, it means that the rank will not be part of any group.
However, today's code can give negative color to NCCL API.
Repro:
import torch
import os
import torch.distributed as dist
def repro(rank, world_size):
device=torch.device("cuda", rank)
dist.init_process_group(
"nccl",
rank=rank,
world_size=world_size,
device_id=device,
)
device_mesh = dist.init_device_mesh(
"cuda", (2, world_size // 2)
)
dist.destroy_process_group()
print("clean exit")
if __name__ == "__main__":
repro(int(os.environ["RANK"]), int(os.environ["WORLD_SIZE"]))
TORCH_CPP_LOG_LEVEL=INFO TORCH_NCCL_USE_COMM_NONBLOCKING=1 torchrun --nproc-per-node 4 repro.py
We can see:
[rank2]: File "/data/users/kw2501/nb_mesh/repro.py", line 13, in repro
[rank2]: device_mesh = dist.init_device_mesh(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/device_mesh.py", line 958, in init_device_mesh
[rank2]: device_mesh = DeviceMesh(
[rank2]: ^^^^^^^^^^^
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/device_mesh.py", line 453, in __init__
[rank2]: self._init_process_groups()
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/device_mesh.py", line 556, in _init_process_groups
[rank2]: dim_group = new_group(
[rank2]: ^^^^^^^^^^
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/c10d_logger.py", line 97, in wrapper
[rank2]: func_return = func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py", line 4675, in new_group
[rank2]: return _new_group_with_tag(
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py", line 4758, in _new_group_with_tag
[rank2]: pg, pg_store = _new_process_group_helper(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py", line 1960, in _new_process_group_helper
[rank2]: eager_backend.eager_connect_single_device(device_id)
[rank2]: RuntimeError: Color must be a non-negative value or NCCL_SPLIT_NOCOLOR (-1), but got -2057847794
Versions
main as of 10132024
cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o
Metadata
Metadata
Assignees
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module