Mastering Image Generation with Stable Diffusion

Mastering Image Generation
with Stable Diffusion
Raphaël Semeteys
bbTeX
23/01/2025

Use Case
Locally generate accurate images of Yoga Poses
• Images of yoga poses must be precise
• Photography is not always the best option
• Images or photos from the internet cannot be reused
How could Generative AI help?

Stable Diffusion
From German Labs to London-based Startup
• Collaboration of several companies and German Labs
• Latent Diffusion Model with embedding space in 2021
• CLIP-guided diffusion
• LIAON dataset
• Runway and EleutherAI participation
• Stability AI
• Compute donation to the project
• Hired most of initial researchers
• Now official maintainer of Stable Diffusion models
“Open” Licenses
• Responsible AI: OpenRAIL
• Version 3.5: Enterprises with
1M+ revenue must pay

Stable Diffusion
Very dynamic contributing Communities
Models
• Fine-tuning: Custom Models for
specific styles or themes
• Refiners, Upscalers, ControlNets
• Model extensions (LoRA)
Tools
• User-Friendly Interfaces:
Automatic1111 Web UI, ComfyUI
• Fine-tuning tools: Dreambooth,
Kohya SS
Sharing Communities
• Portals to share models, prompts, images and tutorials: Hugging Face, Civit.ai…
• Stable Horde: crowdsourced distributed cluster of generation workers

GUI for local Stable Diffusion workflows
• Intuitive, modular and customizable
• Flexible node-based workflows
• Text-to-Image Generation
• Image-to-Image Processing
• Custom Node Management
• Community-Driven
• GPL 3 License
• Contributions, plugins, doc

Let’s start with a simple demo
Generate an image of a girl
doing a yoga pose

What does Stable Diffusion?
• Starts with Random Noise: Begins with a noisy,
unrecognizable image
• Refines Step-by-Step: Gradually removes noise, adding
details
• Learns from Real Images: Uses patterns from trained
images
• Text-Guided Creation: Follows prompts like “girl doing yoga
in a park"
• Denoising Process: Clarifies image layer by layer
• Final Image Output: Produces a clear, detailed image
matching the prompt

Models
• Most used Stability.ai Models
• SD 1.5
• SDXL
• Fine-tuned Models
• Specialized: style, subject
• Shared by communities (like civitai.com)

Prompts
• CLIP Model (Contrastive Language–Image Pretraining)
• Connect descriptive text and images
• Help generate images matching specific prompts
• Can handle a wide range of prompts
• Developed by OpenAI in 2021
• Usable under the MIT license
• Trained on 400M image-text pairs from the Internet
• Positive & Negative
• Textual or Short syntax

Embeddings (Textual Inversions)
• Vector representations of text
• “Instructions” for image generation
• Style, theme, texture, pose, character features, etc.
• Small files containing additional concepts
• To be injected in prompts
• Community provides many presets
• Must aligned with Stable Diffusion version

Demo – T2I + Embedding (Textual Inversion)

Embeddings (Textual Inversions)
Ghibli
Fantasy
Comic
3D Render
Analog Film Cinematic Cyberpunk Digital Art
No Embedding
Vector Art

Latent Space
• Latent Space
• Abstract, compressed representation of the image
• Handles encoded features such as shapes, colors,
textures and general structure
• Manipulation of embedding vectors
• Iterative and refining generation
• Random noise is introduced into the latent space
• At each step the model adjusts the features to match
the prompt
• VAE (Variational Autoencoder)
• Convert Image Pixels → Latent Space

Denoising Process
• Seed
• Random seed used to create initial noise
• Fixing it allows to see impact of other parameters
• Samplers
• Algorithms guiding the iterative image generation
• Differ in Speed and Quality
• Schedulers
• Control how noise is removed at each step
• Also impact Speed and Quality, Karras is well balanced
• Other Parameters
• #steps, CFG (adherence to prompt), %denoising

I can tweak generation
but I don’t control the pose…
Camel pose
Tree pose
Lotus pose Shoulder Stand pose
Text-to-Image generation is not enough!

Let’s move on to Image-to-Image
Generate a image of a girl
doing a yoga pose based on
an existing image

Image-to-Image Generation
CFG 20 CFG 8
• Input Image
• Replace the Empty Latent Image with a real one
• Need a VAE Encode (from the model)
• Play with % denoising
• Prompt has less impact
• Increasing CFG only reduces quality
denoise 0.55 denoise 0.70

ControlNets
• Specialized Neural Networks
• Additional control and guidance to primary model
• Use reference images to transfer structural information
or inject features
→ Hybrid approach with both text and visual references
• Control methods
• Structural: pose, edge detection, segmentation, depth
• Texture & Detail: scribble/sketch, stylization from edges
• Content & Layout: bounding boxes, inpainting masks
• Abstract & Style: color maps, textural fields

Preprocessors for ControlNets
Initial Image Line Art Color Map Open Pose
Segmentation Depth Map Scribble
Straight Lines

More abstract input images
• Design poses in 3D with image export
• Use of JustSketchMe tool (webapp & PWA)
• Design poses based on my own knowledge
• Several angles of view
• (waiting for 3D GenAI Models)

How can I achieve greater
consistency for the character?
Create images featuring the
same facial identity

LoRA
• Low-Rank Adaptation
• Lightweight Model Adaptation
• Update a small subset of model parameters
• Very efficient
• Small File Size, use significantly less memory
• Faster Training
• Usage
• Specific styles, poses, characters, or concepts
• Triggered by keywords in the prompt
• Many LoRAs are provided by the community

Embeddings + 2 ControlNets + 2 LoRAs

Embeddings + ControlNet + LoRA

Create our own Haj3r LoRA
Dreambooth
• SDXL base
• 28 input images
• 2 epochs
• Google Colab Notebook
• 1h30 remote training
Kohya_ss tool
• PowerPuffMix base
• 15 input images
• 20 epochs
• 3h30 local training
• Embeddings

Embeddings + 2 ControlNets + Haj3r LoRA

Demo – I2I + 2 ControlNets + Haj3r LoRA + Transparency

Embeddings + ControlNet + Haj3r LoRA

FaceID + FaceDetailer
• Image Prompt Adapters
• Enable to generate with image prompt
• Pre-trained control networks different from SD
• A sort of one-image LoRA
• FaceID IPAdapter
• Face recognition model instead of CLIP
• LoRA to improve ID consistency
• FaceDetailer
• Face enhancement tool (eyes, nose, lips, expression)
• Post-processing AI model

Demo – 2 ControlNets + FaceID + FaceDetailer

Embeddings + ControlNet + FaceID + FaceDetailer

Easier to change model Cheyenne v2

Summary
t2i + i2i
t2i t2i + embeddings
t2i + i2i + embeddings +
ControlNet + LoRA
t2i + i2i + embeddings + ControlNet
t2i + i2i + embeddings +
ControlNet + Haj3r LoRA +
FaceDetailer
t2i + embeddings + ControlNet +
FaceID + FaceDetailer

Conclusion
Image Generation is both
Science & Art
A lot of parameters to tune
Add input & components to control output
My use case
is implementable
Precise and homogeneous images
Cherry on the cake: more inclusivity

Yoga Sūtra II.46
The posture should be Stable and Comfortable
The Yoga of Image Generation

Mastering Image Generation with Stable Diffusion

More Related Content

Similar to Mastering Image Generation with Stable Diffusion

More from Raphaël Semeteys

Recently uploaded

Mastering Image Generation with Stable Diffusion