-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Enable distributed sample image generation on multi-GPU enviroment #1061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable distributed sample image generation on multi-GPU enviroment #1061
Conversation
Modifying to attempt enable multi GPU inference
additional VRAM checking, refactor check_vram_usage to return string for use with accelerator.print
remove sample image debug outputs
simplify per process prompt
|
I am still skeptical about distributed sample generation. If we are doing a large-scale training, we would probably have separate resources to evaluate the saved models, and with Colab and Kaggle, we would probably want to do more steps of training instead of sample output... However, I think the code is much simpler now. May I merge this PR? However, there are a few points of concern now, and I would rewrite the code after merging, even if it is redundant, for the sake of clarity. I would appreciate your understanding and would ask you to test the changes again. |
|
Sure! Of course you are free to modify the code as you like! It's your code that I modified in the first place. |
|
I merged this to the new branch Thank you again for this PR! |
|
Hey, Did a test run in Kaggle enviroment on textual inversion, and at the sample at first run, I encountered VRAM OOM when it when to the latents to image step. As mentioned in #1019 the workaround I came up with was to insert a call to torch.cuda.empty_cache() after the latents have been generated, and before the latents are converted into images. like below: |
|
Testing on Kaggle when training LoRAs works fine though |
|
Thank you for testing! I've added cuda.empty_cache at that position. |
|
yesterday i helped one of my patreon supporter and he had dual rtx 4090 on linux 1 gpu training speed was 1.2 it / s when 2 gpus used the training speed dropped to 2 second / it literally became slower than single card cumulatively |
Revert bitsandbytes-windows update
…ohya-ss#1061) * Update train_util.py Modifying to attempt enable multi GPU inference * Update train_util.py additional VRAM checking, refactor check_vram_usage to return string for use with accelerator.print * Update train_network.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py remove sample image debug outputs * Update train_util.py * Update train_util.py * Update train_network.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_network.py * Update train_util.py * Update train_network.py * Update train_network.py * Update train_network.py * Cleanup of debugging outputs * adopt more elegant coding Co-authored-by: Aarni Koskela <akx@iki.fi> * Update train_util.py Fix leftover debugging code attempt to refactor inference into separate function * refactor in function generate_per_device_prompt_list() generation of distributed prompt list * Clean up missing variables * fix syntax error * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * true random sample image generation update code to reinitialize random seed to true random if seed was set * true random sample image generation * simplify per process prompt * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_util.py * Update train_network.py * Update train_network.py * Update train_network.py --------- Co-authored-by: Aarni Koskela <akx@iki.fi>
Modified the sample_images_common function to make use of Accelerator PartialState to feed the list of sample images prompt to all available GPUs.
Tested working fine in single GPU Google Colab enviroment and dual GPU Kaggle enviroment.
Possible side effect of using mutliple GPUs to generate sample images is that the file creation time may not sync with the order of the prompt from the original prompt file. Attempted some mitigation by spliting prompts to passed to each GPU process in the in the order that the GPU process is called. However, if the sample image prompts have different samplers and/or number of steps, this would likely break the workaround as generation times would be out of sync.
Might be able to artificially force syncronization by making the sample image process wait for all other processes to complete the image generation step before continuing to the next sample image by using
accelerator.wait_for_everyone()but I imagine efficient use of GPU time would be more important than perfectly sorted sample images based on image creation time.