KEMBAR78
GitHub Β· Where software is built
Skip to content

Dreambooth Flux training failed on saving a checkpointΒ #9500

@kopyl

Description

@kopyl

Describe the bug

I run the training but get this error

Reproduction

Run accelerate config

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true
!git clone https://github.com/huggingface/diffusers
%cd diffusers

%cd diffusers

!pip install -e .
!pip install -r examples/dreambooth/requirements_flux.txt
!pip install prodigyopt

import huggingface_hub
huggingface_hub.notebook_login()

MODEL_NAME="black-forest-labs/FLUX.1-dev"
INSTANCE_DIR="/dreambooth-datasets/yaremovaa"
OUTPUT_DIR="/flux-dreambooth-outputs/dreamboot-yaremovaa"

!accelerate launch examples/dreambooth/train_dreambooth_flux.py \
  --pretrained_model_name_or_path={MODEL_NAME}  \
  --instance_data_dir={INSTANCE_DIR} \
  --output_dir={OUTPUT_DIR} \
  --mixed_precision="bf16" \
  --instance_prompt="a photo of sks girl" \
  --resolution=512 \
  --train_batch_size=1 \
  --guidance_scale=1 \
  --gradient_accumulation_steps=4 \
  --optimizer="prodigy" \
  --learning_rate=1. \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1000 \
  --seed="0"

Logs

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
09/23/2024 14:04:51 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
09/23/2024 14:04:51 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: bf16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 15947.92it/s]
Downloading shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 17476.27it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00,  7.41it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00,  3.43it/s]
Fetching 3 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 9446.63it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Fetching 3 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 77672.30it/s]
Using decoupled weight decay
Using decoupled weight decay
09/23/2024 14:05:08 - INFO - __main__ - ***** Running training *****
09/23/2024 14:05:08 - INFO - __main__ -   Num examples = 10
09/23/2024 14:05:08 - INFO - __main__ -   Num batches each epoch = 5
09/23/2024 14:05:08 - INFO - __main__ -   Num Epochs = 500
09/23/2024 14:05:08 - INFO - __main__ -   Instantaneous batch size per device = 1
09/23/2024 14:05:08 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 8
09/23/2024 14:05:08 - INFO - __main__ -   Gradient Accumulation steps = 4
09/23/2024 14:05:08 - INFO - __main__ -   Total optimization steps = 1000
Steps:   0%|                                           | 0/1000 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Steps:   0%|                           | 0/1000 [00:25<?, ?it/s, loss=0.4, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                         | 0/1000 [00:26<?, ?it/s, loss=0.416, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                         | 0/1000 [00:27<?, ?it/s, loss=0.327, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 1/1000 [00:30<8:20:36, 30.07s/it, loss=0.592, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 2/1000 [00:32<3:50:25, 13.85s/it, loss=0.456, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 2/1000 [00:33<3:50:25, 13.85s/it, loss=0.563, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 2/1000 [00:34<3:50:25, 13.85s/it, loss=0.355, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 2/1000 [00:36<3:50:25, 13.85s/it, loss=0.399, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 3/1000 [00:38<2:51:47, 10.34s/it, loss=0.438, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 4/1000 [00:40<1:58:38,  7.15s/it, loss=0.347, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 4/1000 [00:42<1:58:38,  7.15s/it, loss=0.585, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 4/1000 [00:43<1:58:38,  7.15s/it, loss=0.652, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 4/1000 [00:44<1:58:38,  7.15s/it, loss=0.336, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 5/1000 [00:46<1:51:42,  6.74s/it, loss=0.323, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   1%|               | 6/1000 [00:49<1:28:16,  5.33s/it, loss=0.622, lr=1]Passing `txt_ids` ...
Steps:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–       | 499/1000 [35:19<37:43,  4.52s/it, loss=0.531, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ       | 500/1000 [35:21<32:10,  3.86s/it, loss=0.531, lr=1]09/23/2024 14:40:30 - INFO - accelerate.accelerator - Saving current state to /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-500
09/23/2024 14:40:30 - INFO - accelerate.accelerator - Saving FSDP model
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[rank0]:[E923 14:50:30.723437511 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=_ALLGATHER_BASE, NumelIn=32062496, NumelOut=64124992, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
[rank0]:[E923 14:50:30.723605561 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 217211, last enqueued NCCL work: 217211, last completed NCCL work: 217210.
[rank1]:[E923 14:50:30.752543374 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
[rank1]:[E923 14:50:30.752732295 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 217211, last enqueued NCCL work: 217211, last completed NCCL work: 217210.
[rank1]:[E923 14:50:30.965671382 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 217211, last enqueued NCCL work: 217211, last completed NCCL work: 217210.
[rank1]:[E923 14:50:30.965698772 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E923 14:50:30.965704392 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E923 14:50:30.975447887 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f05df318f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f0532b0e8f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f0532b15333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0532b1771c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f05e0957df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f05e4690609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f05e47ca353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f05df318f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f0532b0e8f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f0532b15333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0532b1771c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f05e0957df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f05e4690609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f05e47ca353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f05df318f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f05327a0a84 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f05e0957df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f05e4690609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f05e47ca353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank0]:[E923 14:50:31.301026792 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 217211, last enqueued NCCL work: 217211, last completed NCCL work: 217210.
[rank0]:[E923 14:50:31.301055303 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E923 14:50:31.301060703 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E923 14:50:31.302001924 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=_ALLGATHER_BASE, NumelIn=32062496, NumelOut=64124992, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f80188dbf86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7f6c0d18f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f7f6c0d8333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7f6c0da71c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f8019f1adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f801dc53609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f801dd8d353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=_ALLGATHER_BASE, NumelIn=32062496, NumelOut=64124992, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f80188dbf86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7f6c0d18f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f7f6c0d8333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7f6c0da71c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f8019f1adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f801dc53609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f801dd8d353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f80188dbf86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f7f6bd63a84 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f8019f1adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f801dc53609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f801dd8d353 in /lib/x86_64-linux-gnu/libc.so.6)

W0923 14:51:08.002416 140631144875840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 97042 closing signal SIGTERM
W0923 14:51:38.003378 140631144875840 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 97042 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0923 14:51:44.734884 140631144875840 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 97041) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-23_14:51:08
  host      : x2-h100.internal.cloudapp.net
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 97041)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 97041
======================================================

System Info

Ubuntu 20.04
x2 NVIDIA H100
CUDA 12.2
torch==2.4.1
torchvision==0.19.1
Diffusers commit: ba5af5a

Who can help?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions