KEMBAR78
Add DDP for controlnetLLLite by sdbds · Pull Request #1363 · kohya-ss/sd-scripts · GitHub
Skip to content

Conversation

@sdbds
Copy link
Contributor

@sdbds sdbds commented Jun 5, 2024

Prevent multi-card training from reporting errors.

@kohya-ss
Copy link
Owner

kohya-ss commented Jun 6, 2024

Thank you for this. However, because this unwraps the model from DDP wrapper, so I wonder the gradient might not be accumulated from distributed GPUs. In addition, this also unwraps Accelerate's wrapper, so I also wonder it may break other distribution strategies and the mixed precision wrapper.

Without this update, what kind of the error occur?

@sdbds
Copy link
Contributor Author

sdbds commented Jun 6, 2024

Thank you for this. However, because this unwraps the model from DDP wrapper, so I wonder the gradient might not be accumulated from distributed GPUs. In addition, this also unwraps Accelerate's wrapper, so I also wonder it may break other distribution strategies and the mixed precision wrapper.

Without this update, what kind of the error occur?

I find this solve from train_controlnet.py and it works.

Original error:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

kohya-ss added a commit that referenced this pull request Jun 9, 2024
kohya-ss added a commit that referenced this pull request Jun 9, 2024
@kohya-ss
Copy link
Owner

kohya-ss commented Jun 9, 2024

Thank you for details! I'm not sure why train_controlnet.py works with this update, but the error seems to say there are some issues in the code.

So I made an investigation, and found some issues:

  1. One of the parameter is used multiple-times in LLLite training. This is intended, but seems to cause an error by default. So I add _set_static_graph() to avoid the error.
  2. The forward and backward pathes in LLLiteLinear/Conv2D are different. This may be a potential bug even in the single GPU training.

The dev branch is updated, so please test with the latest version of the dev branch. Thank you!

@sdbds sdbds closed this Jun 23, 2024
@sdbds sdbds deleted the DDP-for-LLLite branch June 23, 2024 19:29
@zixuzhuang
Copy link

zixuzhuang commented Jul 9, 2024

Hi, I test the newest dev version, but the lllite still can not train with both ddp and gradient checkpointing.
I met this issue when accelerator.backward(loss):
RuntimeError: expect_autograd_hooks_ INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/reducer.cpp":1591, please report a bug to PyTorch.

KohakuBlueleaf added a commit to KohakuBlueleaf/sd-scripts that referenced this pull request Jul 31, 2024
* Final implementation

* Skip the final 1 step

* fix alpha mask without disk cache closes kohya-ss#1351, ref kohya-ss#1339

* update for corner cases

* Bump crate-ci/typos from 1.19.0 to 1.21.0, fix typos, and updated _typos.toml (Close kohya-ss#1307)

* set static graph flag when DDP ref kohya-ss#1363

* make forward/backward pathes same ref kohya-ss#1363

* update README

* add grad_hook after restore state closes kohya-ss#1344

* fix to work cache_latents/text_encoder_outputs

* show file name if error in load_image ref kohya-ss#1385

---------

Co-authored-by: Kohya S <ykumeykume@gmail.com>
Co-authored-by: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Co-authored-by: Yuta Hayashibe <yuta@hayashibe.jp>
feffy380 added a commit to feffy380/sd-scripts that referenced this pull request Oct 13, 2024
Squashed commit of the following:

commit 56bb81c
Author: Kohya S <ykumeykume@gmail.com>
Date:   Wed Jun 12 21:39:35 2024 +0900

    add grad_hook after restore state closes kohya-ss#1344

commit 22413a5
Merge: 3259928 18d7597
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Tue Jun 11 19:52:03 2024 +0900

    Merge pull request kohya-ss#1359 from kohya-ss/train_resume_step

    Train resume step

commit 18d7597
Author: Kohya S <ykumeykume@gmail.com>
Date:   Tue Jun 11 19:51:30 2024 +0900

    update README

commit 4a44188
Merge: 4dbcef4 3259928
Author: Kohya S <ykumeykume@gmail.com>
Date:   Tue Jun 11 19:27:37 2024 +0900

    Merge branch 'dev' into train_resume_step

commit 3259928
Merge: 1a104dc 5bfe5e4
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun Jun 9 19:26:42 2024 +0900

    Merge branch 'dev' of https://github.com/kohya-ss/sd-scripts into dev

commit 1a104dc
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun Jun 9 19:26:36 2024 +0900

    make forward/backward pathes same ref kohya-ss#1363

commit 58fb648
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun Jun 9 19:26:09 2024 +0900

    set static graph flag when DDP ref kohya-ss#1363

commit 5bfe5e4
Merge: e5bab69 4ecbac1
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Thu Jun 6 21:23:24 2024 +0900

    Merge pull request kohya-ss#1361 from shirayu/update/github_actions/crate-ci/typos-1.21.0

    Bump crate-ci/typos from 1.19.0 to 1.21.0, fix typos, and updated _typos.toml (Close kohya-ss#1307)

commit 4ecbac1
Author: Yuta Hayashibe <yuta@hayashibe.jp>
Date:   Wed Jun 5 16:31:44 2024 +0900

    Bump crate-ci/typos from 1.19.0 to 1.21.0, fix typos, and updated _typos.toml (Close kohya-ss#1307)

commit 4dbcef4
Author: Kohya S <ykumeykume@gmail.com>
Date:   Tue Jun 4 21:26:55 2024 +0900

    update for corner cases

commit 321e24d
Merge: e5bab69 3eb27ce
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Tue Jun 4 19:30:11 2024 +0900

    Merge pull request kohya-ss#1353 from KohakuBlueleaf/train_resume_step

    Resume correct step for "resume from state" feature.

commit e5bab69
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun Jun 2 21:11:40 2024 +0900

    fix alpha mask without disk cache closes kohya-ss#1351, ref kohya-ss#1339

commit 3eb27ce
Author: Kohaku-Blueleaf <59680068+KohakuBlueleaf@users.noreply.github.com>
Date:   Fri May 31 12:24:15 2024 +0800

    Skip the final 1 step

commit b2363f1
Author: Kohaku-Blueleaf <59680068+KohakuBlueleaf@users.noreply.github.com>
Date:   Fri May 31 12:20:20 2024 +0800

    Final implementation

commit 0d96e10
Merge: ffce3b5 fc85496
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Mon May 27 21:41:16 2024 +0900

    Merge pull request kohya-ss#1339 from kohya-ss/alpha-masked-loss

    Alpha masked loss

commit fc85496
Author: Kohya S <ykumeykume@gmail.com>
Date:   Mon May 27 21:25:06 2024 +0900

    update docs for masked loss

commit 2870be9
Merge: 71ad3c0 ffce3b5
Author: Kohya S <ykumeykume@gmail.com>
Date:   Mon May 27 21:08:43 2024 +0900

    Merge branch 'dev' into alpha-masked-loss

commit 71ad3c0
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Mon May 27 21:07:57 2024 +0900

    Update masked_loss_README-ja.md

    add sample images

commit ffce3b5
Merge: fb12b6d d50c1b3
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Mon May 27 21:00:46 2024 +0900

    Merge pull request kohya-ss#1349 from rockerBOO/patch-4

    Update issue link

commit a4c3155
Author: Kohya S <ykumeykume@gmail.com>
Date:   Mon May 27 20:59:40 2024 +0900

    add doc for mask loss

commit 58cadf4
Merge: e8cfd4b fb12b6d
Author: Kohya S <ykumeykume@gmail.com>
Date:   Mon May 27 20:02:32 2024 +0900

    Merge branch 'dev' into alpha-masked-loss

commit d50c1b3
Author: Dave Lage <rockerboo@gmail.com>
Date:   Mon May 27 01:11:01 2024 -0400

    Update issue link

commit e8cfd4b
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 26 22:01:37 2024 +0900

    fix to work cond mask and alpha mask

commit fb12b6d
Merge: febc5c5 00513b9
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Sun May 26 19:45:03 2024 +0900

    Merge pull request kohya-ss#1347 from rockerBOO/lora-plus-log-info

    Add LoRA+ LR Ratio info message to logger

commit 00513b9
Author: rockerBOO <rockerboo@gmail.com>
Date:   Thu May 23 22:27:12 2024 -0400

    Add LoRA+ LR Ratio info message to logger

commit da6fea3
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 19 21:26:18 2024 +0900

    simplify and update alpha mask to work with various cases

commit f2dd43e
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 19 19:23:59 2024 +0900

    revert kwargs to explicit declaration

commit db67529
Author: u-haru <40634644+u-haru@users.noreply.github.com>
Date:   Sun May 19 19:07:25 2024 +0900

    画像のアルファチャンネルをlossのマスクとして使用するオプションを追加 (kohya-ss#1223)

    * Add alpha_mask parameter and apply masked loss

    * Fix type hint in trim_and_resize_if_required function

    * Refactor code to use keyword arguments in train_util.py

    * Fix alpha mask flipping logic

    * Fix alpha mask initialization

    * Fix alpha_mask transformation

    * Cache alpha_mask

    * Update alpha_masks to be on CPU

    * Set flipped_alpha_masks to Null if option disabled

    * Check if alpha_mask is None

    * Set alpha_mask to None if option disabled

    * Add description of alpha_mask option to docs

commit febc5c5
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 19 19:03:43 2024 +0900

    update README

commit 4c79812
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 19 19:00:32 2024 +0900

    update README

commit 38e4c60
Merge: e4d9e3c fc37437
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Sun May 19 18:55:50 2024 +0900

    Merge pull request kohya-ss#1277 from Cauldrath/negative_learning

    Allow negative learning rate

commit e4d9e3c
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 19 17:46:07 2024 +0900

    remove dependency for omegaconf #ref 1284

commit de0e0b9
Merge: c68baae 5cb145d
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Sun May 19 17:39:15 2024 +0900

    Merge pull request kohya-ss#1284 from sdbds/fix_traincontrolnet

    Fix train controlnet

commit c68baae
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 19 17:21:04 2024 +0900

    add `--log_config` option to enable/disable output training config

commit 47187f7
Merge: e3ddd1f b886d0a
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Sun May 19 16:31:33 2024 +0900

    Merge pull request kohya-ss#1285 from ccharest93/main

    Hyperparameter tracking

commit e3ddd1f
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 19 16:26:10 2024 +0900

    update README and format code

commit 0640f01
Merge: 2f19175 793aeb9
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Sun May 19 16:23:01 2024 +0900

    Merge pull request kohya-ss#1322 from aria1th/patch-1

    Accelerate: fix get_trainable_params in controlnet-llite training

commit 2f19175
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 19 15:38:37 2024 +0900

    update README

commit 146edce
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sat May 18 11:05:04 2024 +0900

    support Diffusers' based SDXL LoRA key for inference

commit 153764a
Author: Kohya S <ykumeykume@gmail.com>
Date:   Wed May 15 20:21:49 2024 +0900

    add prompt option '--f' for filename

commit 589c2aa
Author: Kohya S <ykumeykume@gmail.com>
Date:   Mon May 13 21:20:37 2024 +0900

    update README

commit 16677da
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 22:15:07 2024 +0900

    fix create_network_from_weights doesn't work

commit a384bf2
Merge: 1c296f7 8db0cad
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Sun May 12 21:36:56 2024 +0900

    Merge pull request kohya-ss#1313 from rockerBOO/patch-3

    Add caption_separator to output for subset

commit 1c296f7
Merge: e96a521 dbb7bb2
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Sun May 12 21:33:12 2024 +0900

    Merge pull request kohya-ss#1312 from rockerBOO/patch-2

    Fix caption_separator missing in subset schema

commit e96a521
Merge: 39b82f2 fdbb03c
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Sun May 12 21:14:50 2024 +0900

    Merge pull request kohya-ss#1291 from frodo821/patch-1

    removed unnecessary `torch` import on line 115

commit 39b82f2
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 20:58:45 2024 +0900

    update readme

commit 3701507
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 20:56:56 2024 +0900

    raise original error if error is occured in checking latents

commit 7802093
Merge: 9ddb4d7 040e26f
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Sun May 12 20:46:25 2024 +0900

    Merge pull request kohya-ss#1278 from Cauldrath/catch_latent_error_file

    Display name of error latent file

commit 9ddb4d7
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 17:55:08 2024 +0900

    update readme and help message etc.

commit 8d1b1ac
Merge: 02298e3 64916a3
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Sun May 12 17:43:44 2024 +0900

    Merge pull request kohya-ss#1266 from Zovjsra/feature/disable-mmap

    Add "--disable_mmap_load_safetensors" parameter

commit 02298e3
Merge: 1ffc0b3 4419041
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Sun May 12 17:04:58 2024 +0900

    Merge pull request kohya-ss#1331 from kohya-ss/lora-plus

    Lora plus

commit 4419041
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 17:01:20 2024 +0900

    update docs etc.

commit 3c8193f
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 17:00:51 2024 +0900

    revert lora+ for lora_fa

commit c6a4370
Merge: e01e148 1ffc0b3
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 16:18:57 2024 +0900

    Merge branch 'dev' into lora-plus

commit 1ffc0b3
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 16:18:43 2024 +0900

    fix typo

commit e01e148
Merge: e9f3a62 7983d3d
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 16:17:52 2024 +0900

    Merge branch 'dev' into lora-plus

commit e9f3a62
Merge: 3fd8cdc c1ba0b4
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 16:17:27 2024 +0900

    Merge branch 'dev' into lora-plus

commit 7983d3d
Merge: c1ba0b4 bee8cee
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Sun May 12 15:09:39 2024 +0900

    Merge pull request kohya-ss#1319 from kohya-ss/fused-backward-pass

    Fused backward pass

commit bee8cee
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 15:08:52 2024 +0900

    update README for fused optimizer

commit f3d2cf2
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 15:03:02 2024 +0900

    update README for fused optimizer

commit 6dbc23c
Merge: 607e041 c1ba0b4
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 14:21:56 2024 +0900

    Merge branch 'dev' into fused-backward-pass

commit c1ba0b4
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 14:21:10 2024 +0900

    update readme

commit 607e041
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sun May 12 14:16:41 2024 +0900

    chore: Refactor optimizer group

commit 793aeb9
Author: AngelBottomless <aria1th@naver.com>
Date:   Tue May 7 18:21:31 2024 +0900

    fix get_trainable_params in controlnet-llite training

commit b56d5f7
Author: Kohya S <ykumeykume@gmail.com>
Date:   Mon May 6 21:35:39 2024 +0900

    add experimental option to fuse params to optimizer groups

commit 017b82e
Author: Kohya S <ykumeykume@gmail.com>
Date:   Mon May 6 15:05:42 2024 +0900

    update help message for fused_backward_pass

commit 2a359e0
Merge: 0540c33 4f203ce
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Mon May 6 15:01:56 2024 +0900

    Merge pull request kohya-ss#1259 from 2kpr/fused_backward_pass

    Adafactor fused backward pass and optimizer step, lowers SDXL (@ 1024 resolution) VRAM usage to BF16(10GB)/FP32(16.4GB)

commit 3fd8cdc
Author: Kohya S <ykumeykume@gmail.com>
Date:   Mon May 6 14:03:19 2024 +0900

    fix dylora loraplus

commit 7fe8150
Author: Kohya S <ykumeykume@gmail.com>
Date:   Mon May 6 11:09:32 2024 +0900

    update loraplus on dylora/lofa_fa

commit 52e64c6
Author: Kohya S <ykumeykume@gmail.com>
Date:   Sat May 4 18:43:52 2024 +0900

    add debug log

commit 58c2d85
Author: Kohya S <ykumeykume@gmail.com>
Date:   Fri May 3 22:18:20 2024 +0900

    support block dim/lr for sdxl

commit 8db0cad
Author: Dave Lage <rockerboo@gmail.com>
Date:   Thu May 2 18:08:28 2024 -0400

    Add caption_separator to output for subset

commit dbb7bb2
Author: Dave Lage <rockerboo@gmail.com>
Date:   Thu May 2 17:39:35 2024 -0400

    Fix caption_separator missing in subset schema

commit 969f82a
Author: Kohya S <ykumeykume@gmail.com>
Date:   Mon Apr 29 20:04:25 2024 +0900

    move loraplus args from args to network_args, simplify log lr desc

commit 834445a
Merge: 0540c33 68467bd
Author: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Date:   Mon Apr 29 18:05:12 2024 +0900

    Merge pull request kohya-ss#1233 from rockerBOO/lora-plus

    Add LoRA+ support

commit fdbb03c
Author: frodo821 <sakaic2003@gmail.com>
Date:   Tue Apr 23 14:29:05 2024 +0900

    removed unnecessary `torch` import on line 115

    as per kohya-ss#1290

commit 040e26f
Author: Cauldrath <bnjmnhanes@gmail.com>
Date:   Sun Apr 21 13:46:31 2024 -0400

    Regenerate failed file
    If a latent file fails to load, print out the path and the error, then return false to regenerate it

commit 5cb145d
Author: 青龍聖者@bdsqlsz <qinglongshengzhe@gmail.com>
Date:   Sat Apr 20 21:56:24 2024 +0800

    Update train_util.py

commit b886d0a
Author: Maatra <ccharest93@hotmail.com>
Date:   Sat Apr 20 14:36:47 2024 +0100

    Cleaned typing to be in line with accelerate hyperparameters type resctrictions

commit 4477116
Author: 青龍聖者@bdsqlsz <qinglongshengzhe@gmail.com>
Date:   Sat Apr 20 21:26:09 2024 +0800

    fix train controlnet

commit 2c9db5d
Author: Maatra <ccharest93@hotmail.com>
Date:   Sat Apr 20 14:11:43 2024 +0100

    passing filtered hyperparameters to accelerate

commit fc37437
Author: Cauldrath <bnjmnhanes@gmail.com>
Date:   Thu Apr 18 23:29:01 2024 -0400

    Allow negative learning rate
    This can be used to train away from a group of images you don't want
    As this moves the model away from a point instead of towards it, the change in the model is unbounded
    So, don't set it too low. -4e-7 seemed to work well.

commit feefcf2
Author: Cauldrath <bnjmnhanes@gmail.com>
Date:   Thu Apr 18 23:15:36 2024 -0400

    Display name of error latent file
    When trying to load stored latents, if an error occurs, this change will tell you what file failed to load
    Currently it will just tell you that something failed without telling you which file

commit 64916a3
Author: Zovjsra <4703michael@gmail.com>
Date:   Tue Apr 16 16:40:08 2024 +0800

    add disable_mmap to args

commit 4f203ce
Author: 2kpr <96332338+2kpr@users.noreply.github.com>
Date:   Sun Apr 14 09:56:58 2024 -0500

    Fused backward pass

commit 68467bd
Author: rockerBOO <rockerboo@gmail.com>
Date:   Thu Apr 11 17:33:19 2024 -0400

    Fix unset or invalid LR from making a param_group

commit 75833e8
Author: rockerBOO <rockerboo@gmail.com>
Date:   Mon Apr 8 19:23:02 2024 -0400

    Fix default LR, Add overall LoRA+ ratio, Add log

    `--loraplus_ratio` added for both TE and UNet
    Add log for lora+

commit 1933ab4
Author: rockerBOO <rockerboo@gmail.com>
Date:   Wed Apr 3 12:46:34 2024 -0400

    Fix default_lr being applied

commit c769160
Author: rockerBOO <rockerboo@gmail.com>
Date:   Mon Apr 1 15:43:04 2024 -0400

    Add LoRA-FA for LoRA+

commit f99fe28
Author: rockerBOO <rockerboo@gmail.com>
Date:   Mon Apr 1 15:38:26 2024 -0400

    Add LoRA+ support
nana0304 pushed a commit to nana0304/sd-scripts that referenced this pull request Jun 4, 2025
nana0304 pushed a commit to nana0304/sd-scripts that referenced this pull request Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants