Fix gradients synchronization for multi-GPUs training #989

Isotr0py · 2023-12-06T08:45:47Z

Related issue: About GPU utilization #965
Fix [Bug] Gradients not synchronized #924
Remove train_util.transform_if_model_is_DDP, this caused the gradients sync bug on DDP. Now model will be unwrapped by accelerator.unwrap_model() manually.
Clean accelerator.prepare() codes in train_network.py
Sync lora network gradients in train_network.py manually.

For train_network.py, since we apply the lora through replacing the forward() method instead of the module, the DDP lora won't sync gradients automatically when calling accelerator.backward(loss). Hence we need to use accelerator.reduce() to sync grad manually.

Dev

Isotr0py · 2023-12-06T08:56:24Z

Gradients and parameters after all_reduced should be synchronized now.

kohya-ss · 2023-12-07T13:01:20Z

Thank you for this! I didn't test multi-gpu training, but the PR seems to be very important.

deepdelirious · 2023-12-14T23:54:36Z

This seems to have broken sampling -

Traceback (most recent call last):
  File "/workspace/train.py", line 44, in <module>
    train(args)
  File "/workspace/sdxl_train.py", line 610, in train
    sdxl_train_util.sample_images(
  File "/workspace/library/sdxl_train_util.py", line 367, in sample_images
    return train_util.sample_images_common(SdxlStableDiffusionLongPromptWeightingPipeline, *args, **kwargs)
  File "/workspace/library/train_util.py", line 4788, in sample_images_common
    latents = pipeline(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/library/sdxl_lpw_stable_diffusion.py", line 926, in __call__
    dtype = self.unet.dtype
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'dtype'

kohya-ss · 2023-12-15T13:36:58Z

@deepdelirious I've updated the dev branch. I didn't test with multiple GPUs, but I think it will fix the sampling.

FurkanGozukara · 2023-12-28T19:43:32Z

I dont see these in code where ?

ddp_bucket_view
ddp_gradient_as_bucket_view

* delete DDP wrapper * fix train_db vae and train_network * fix train_db vae and train_network unwrap * network grad sync --------- Co-authored-by: Kohya S <52813779+kohya-ss@users.noreply.github.com>

kohya-ss and others added 5 commits December 3, 2023 18:27

Merge pull request kohya-ss#978 from kohya-ss/dev

0908c54

Dev

delete DDP wrapper

91847ec

fix train_db vae and train_network

5211e4b

fix train_db vae and train_network unwrap

829150b

network grad sync

31e66b5

Isotr0py marked this pull request as draft December 6, 2023 08:47

Merge branch 'dev' into dev

e6322bb

Isotr0py marked this pull request as ready for review December 6, 2023 08:53

Isotr0py changed the title ~~Fix gradients synchronization~~ Fix gradients synchronization for multi-GPUs training Dec 6, 2023

kohya-ss merged commit db84530 into kohya-ss:dev Dec 7, 2023

Isotr0py deleted the dev branch December 7, 2023 13:22

fauzanardh mentioned this pull request Dec 11, 2023

"Parameter indices which did not receive grad for rank x", Multi-GPU SDXL Training (unet + both text encoders) #997

Closed

Isotr0py mentioned this pull request Dec 12, 2023

Sync steps to prevent unexpected crash #998

Closed

kohya-ss added a commit that referenced this pull request Dec 15, 2023

fix sampling in training with mutiple gpus ref #989

3efd90b

bmaltais mentioned this pull request Dec 28, 2023

v22.4.0 bmaltais/kohya_ss#1815

Merged

KohakuBlueleaf mentioned this pull request Jan 23, 2024

Avoid grad sync on each step even when doing accumulation #1064

Merged

kohya-ss mentioned this pull request Feb 29, 2024

the code of new version is using more VRAM #1141

Open

nana0304 pushed a commit to nana0304/sd-scripts that referenced this pull request Jun 4, 2025

fix sampling in training with mutiple gpus ref kohya-ss#989

86d43df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix gradients synchronization for multi-GPUs training #989

Fix gradients synchronization for multi-GPUs training #989

Uh oh!

Isotr0py commented Dec 6, 2023

Uh oh!

Isotr0py commented Dec 6, 2023 •

edited

Loading

Uh oh!

kohya-ss commented Dec 7, 2023

Uh oh!

deepdelirious commented Dec 14, 2023

Uh oh!

kohya-ss commented Dec 15, 2023

Uh oh!

FurkanGozukara commented Dec 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Fix gradients synchronization for multi-GPUs training #989

Fix gradients synchronization for multi-GPUs training #989

Uh oh!

Conversation

Isotr0py commented Dec 6, 2023

Uh oh!

Isotr0py commented Dec 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kohya-ss commented Dec 7, 2023

Uh oh!

deepdelirious commented Dec 14, 2023

Uh oh!

kohya-ss commented Dec 15, 2023

Uh oh!

FurkanGozukara commented Dec 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Isotr0py commented Dec 6, 2023 •

edited

Loading