KEMBAR78
Mastering Image Generation with Stable Diffusion | PDF
Mastering Image Generation
with Stable Diffusion
Raphaël Semeteys
bbTeX
23/01/2025
Use Case
Locally generate accurate images of Yoga Poses
• Images of yoga poses must be precise
• Photography is not always the best option
• Images or photos from the internet cannot be reused
How could Generative AI help?
Stable Diffusion
From German Labs to London-based Startup
• Collaboration of several companies and German Labs
• Latent Diffusion Model with embedding space in 2021
• CLIP-guided diffusion
• LIAON dataset
• Runway and EleutherAI participation
• Stability AI
• Compute donation to the project
• Hired most of initial researchers
• Now official maintainer of Stable Diffusion models
“Open” Licenses
• Responsible AI: OpenRAIL
• Version 3.5: Enterprises with
1M+ revenue must pay
Stable Diffusion
Very dynamic contributing Communities
Models
• Fine-tuning: Custom Models for
specific styles or themes
• Refiners, Upscalers, ControlNets
• Model extensions (LoRA)
Tools
• User-Friendly Interfaces:
Automatic1111 Web UI, ComfyUI
• Fine-tuning tools: Dreambooth,
Kohya SS
Sharing Communities
• Portals to share models, prompts, images and tutorials: Hugging Face, Civit.ai…
• Stable Horde: crowdsourced distributed cluster of generation workers
GUI for local Stable Diffusion workflows
• Intuitive, modular and customizable
• Flexible node-based workflows
• Text-to-Image Generation
• Image-to-Image Processing
• Custom Node Management
• Community-Driven
• GPL 3 License
• Contributions, plugins, doc
Let’s start with a simple demo
Generate an image of a girl
doing a yoga pose
Demo – Text-to-Image (T2I)
What does Stable Diffusion?
• Starts with Random Noise: Begins with a noisy,
unrecognizable image
• Refines Step-by-Step: Gradually removes noise, adding
details
• Learns from Real Images: Uses patterns from trained
images
• Text-Guided Creation: Follows prompts like “girl doing yoga
in a park"
• Denoising Process: Clarifies image layer by layer
• Final Image Output: Produces a clear, detailed image
matching the prompt
Models
• Most used Stability.ai Models
• SD 1.5
• SDXL
• Fine-tuned Models
• Specialized: style, subject
• Shared by communities (like civitai.com)
Prompts
• CLIP Model (Contrastive Language–Image Pretraining)
• Connect descriptive text and images
• Help generate images matching specific prompts
• Can handle a wide range of prompts
• Developed by OpenAI in 2021
• Usable under the MIT license
• Trained on 400M image-text pairs from the Internet
• Positive & Negative
• Textual or Short syntax
Embeddings (Textual Inversions)
• Vector representations of text
• “Instructions” for image generation
• Style, theme, texture, pose, character features, etc.
• Small files containing additional concepts
• To be injected in prompts
• Community provides many presets
• Must aligned with Stable Diffusion version
Demo – T2I + Embedding (Textual Inversion)
Embeddings (Textual Inversions)
Ghibli
Fantasy
Comic
3D Render
Analog Film Cinematic Cyberpunk Digital Art
No Embedding
Vector Art
Latent Space
• Latent Space
• Abstract, compressed representation of the image
• Handles encoded features such as shapes, colors,
textures and general structure
• Manipulation of embedding vectors
• Iterative and refining generation
• Random noise is introduced into the latent space
• At each step the model adjusts the features to match
the prompt
• VAE (Variational Autoencoder)
• Convert Image Pixels → Latent Space
Denoising Process
• Seed
• Random seed used to create initial noise
• Fixing it allows to see impact of other parameters
• Samplers
• Algorithms guiding the iterative image generation
• Differ in Speed and Quality
• Schedulers
• Control how noise is removed at each step
• Also impact Speed and Quality, Karras is well balanced
• Other Parameters
• #steps, CFG (adherence to prompt), %denoising
I can tweak generation
but I don’t control the pose…
Camel pose
Tree pose
Lotus pose Shoulder Stand pose
Text-to-Image generation is not enough!
Let’s move on to Image-to-Image
Generate a image of a girl
doing a yoga pose based on
an existing image
Demo – Image-to-Image (I2I)
Image-to-Image Generation
CFG 20 CFG 8
• Input Image
• Replace the Empty Latent Image with a real one
• Need a VAE Encode (from the model)
• Play with % denoising
• Prompt has less impact
• Increasing CFG only reduces quality
denoise 0.55 denoise 0.70
ControlNets
• Specialized Neural Networks
• Additional control and guidance to primary model
• Use reference images to transfer structural information
or inject features
→ Hybrid approach with both text and visual references
• Control methods
• Structural: pose, edge detection, segmentation, depth
• Texture & Detail: scribble/sketch, stylization from edges
• Content & Layout: bounding boxes, inpainting masks
• Abstract & Style: color maps, textural fields
Demo – I2I + ControlNet
Preprocessors for ControlNets
Initial Image Line Art Color Map Open Pose
Segmentation Depth Map Scribble
Straight Lines
More abstract input images
• Design poses in 3D with image export
• Use of JustSketchMe tool (webapp & PWA)
• Design poses based on my own knowledge
• Several angles of view
• (waiting for 3D GenAI Models)
Demo – T2I + 2 ControlNets
How can I achieve greater
consistency for the character?
Create images featuring the
same facial identity
LoRA
• Low-Rank Adaptation
• Lightweight Model Adaptation
• Update a small subset of model parameters
• Very efficient
• Small File Size, use significantly less memory
• Faster Training
• Usage
• Specific styles, poses, characters, or concepts
• Triggered by keywords in the prompt
• Many LoRAs are provided by the community
Demo – ControlNet + LoRA
Embeddings + 2 ControlNets + 2 LoRAs
Embeddings + ControlNet + LoRA
Create our own Haj3r LoRA
Dreambooth
• SDXL base
• 28 input images
• 2 epochs
• Google Colab Notebook
• 1h30 remote training
Kohya_ss tool
• PowerPuffMix base
• 15 input images
• 20 epochs
• 3h30 local training
• Embeddings
Embeddings + 2 ControlNets + Haj3r LoRA
Demo – I2I + 2 ControlNets + Haj3r LoRA + Transparency
Embeddings + ControlNet + Haj3r LoRA
FaceID + FaceDetailer
• Image Prompt Adapters
• Enable to generate with image prompt
• Pre-trained control networks different from SD
• A sort of one-image LoRA
• FaceID IPAdapter
• Face recognition model instead of CLIP
• LoRA to improve ID consistency
• FaceDetailer
• Face enhancement tool (eyes, nose, lips, expression)
• Post-processing AI model
Demo – 2 ControlNets + FaceID + FaceDetailer
Embeddings + ControlNet + FaceID + FaceDetailer
Easier to change model Cheyenne v2
Easier to change persona
Summary
t2i + i2i
t2i t2i + embeddings
t2i + i2i + embeddings +
ControlNet + LoRA
t2i + i2i + embeddings + ControlNet
t2i + i2i + embeddings +
ControlNet + Haj3r LoRA +
FaceDetailer
t2i + embeddings + ControlNet +
FaceID + FaceDetailer
Conclusion
Image Generation is both
Science & Art
A lot of parameters to tune
Add input & components to control output
My use case
is implementable
Precise and homogeneous images
Cherry on the cake: more inclusivity
Yoga Sūtra II.46
The posture should be Stable and Comfortable
The Yoga of Image Generation
Thank you
raphiki.github.io

Mastering Image Generation with Stable Diffusion

  • 1.
    Mastering Image Generation withStable Diffusion Raphaël Semeteys bbTeX 23/01/2025
  • 2.
    Use Case Locally generateaccurate images of Yoga Poses • Images of yoga poses must be precise • Photography is not always the best option • Images or photos from the internet cannot be reused How could Generative AI help?
  • 3.
    Stable Diffusion From GermanLabs to London-based Startup • Collaboration of several companies and German Labs • Latent Diffusion Model with embedding space in 2021 • CLIP-guided diffusion • LIAON dataset • Runway and EleutherAI participation • Stability AI • Compute donation to the project • Hired most of initial researchers • Now official maintainer of Stable Diffusion models “Open” Licenses • Responsible AI: OpenRAIL • Version 3.5: Enterprises with 1M+ revenue must pay
  • 4.
    Stable Diffusion Very dynamiccontributing Communities Models • Fine-tuning: Custom Models for specific styles or themes • Refiners, Upscalers, ControlNets • Model extensions (LoRA) Tools • User-Friendly Interfaces: Automatic1111 Web UI, ComfyUI • Fine-tuning tools: Dreambooth, Kohya SS Sharing Communities • Portals to share models, prompts, images and tutorials: Hugging Face, Civit.ai… • Stable Horde: crowdsourced distributed cluster of generation workers
  • 5.
    GUI for localStable Diffusion workflows • Intuitive, modular and customizable • Flexible node-based workflows • Text-to-Image Generation • Image-to-Image Processing • Custom Node Management • Community-Driven • GPL 3 License • Contributions, plugins, doc
  • 6.
    Let’s start witha simple demo Generate an image of a girl doing a yoga pose
  • 7.
  • 8.
    What does StableDiffusion? • Starts with Random Noise: Begins with a noisy, unrecognizable image • Refines Step-by-Step: Gradually removes noise, adding details • Learns from Real Images: Uses patterns from trained images • Text-Guided Creation: Follows prompts like “girl doing yoga in a park" • Denoising Process: Clarifies image layer by layer • Final Image Output: Produces a clear, detailed image matching the prompt
  • 9.
    Models • Most usedStability.ai Models • SD 1.5 • SDXL • Fine-tuned Models • Specialized: style, subject • Shared by communities (like civitai.com)
  • 10.
    Prompts • CLIP Model(Contrastive Language–Image Pretraining) • Connect descriptive text and images • Help generate images matching specific prompts • Can handle a wide range of prompts • Developed by OpenAI in 2021 • Usable under the MIT license • Trained on 400M image-text pairs from the Internet • Positive & Negative • Textual or Short syntax
  • 11.
    Embeddings (Textual Inversions) •Vector representations of text • “Instructions” for image generation • Style, theme, texture, pose, character features, etc. • Small files containing additional concepts • To be injected in prompts • Community provides many presets • Must aligned with Stable Diffusion version
  • 12.
    Demo – T2I+ Embedding (Textual Inversion)
  • 13.
    Embeddings (Textual Inversions) Ghibli Fantasy Comic 3DRender Analog Film Cinematic Cyberpunk Digital Art No Embedding Vector Art
  • 14.
    Latent Space • LatentSpace • Abstract, compressed representation of the image • Handles encoded features such as shapes, colors, textures and general structure • Manipulation of embedding vectors • Iterative and refining generation • Random noise is introduced into the latent space • At each step the model adjusts the features to match the prompt • VAE (Variational Autoencoder) • Convert Image Pixels → Latent Space
  • 15.
    Denoising Process • Seed •Random seed used to create initial noise • Fixing it allows to see impact of other parameters • Samplers • Algorithms guiding the iterative image generation • Differ in Speed and Quality • Schedulers • Control how noise is removed at each step • Also impact Speed and Quality, Karras is well balanced • Other Parameters • #steps, CFG (adherence to prompt), %denoising
  • 16.
    I can tweakgeneration but I don’t control the pose… Camel pose Tree pose Lotus pose Shoulder Stand pose Text-to-Image generation is not enough!
  • 17.
    Let’s move onto Image-to-Image Generate a image of a girl doing a yoga pose based on an existing image
  • 18.
  • 19.
    Image-to-Image Generation CFG 20CFG 8 • Input Image • Replace the Empty Latent Image with a real one • Need a VAE Encode (from the model) • Play with % denoising • Prompt has less impact • Increasing CFG only reduces quality denoise 0.55 denoise 0.70
  • 20.
    ControlNets • Specialized NeuralNetworks • Additional control and guidance to primary model • Use reference images to transfer structural information or inject features → Hybrid approach with both text and visual references • Control methods • Structural: pose, edge detection, segmentation, depth • Texture & Detail: scribble/sketch, stylization from edges • Content & Layout: bounding boxes, inpainting masks • Abstract & Style: color maps, textural fields
  • 21.
    Demo – I2I+ ControlNet
  • 22.
    Preprocessors for ControlNets InitialImage Line Art Color Map Open Pose Segmentation Depth Map Scribble Straight Lines
  • 23.
    More abstract inputimages • Design poses in 3D with image export • Use of JustSketchMe tool (webapp & PWA) • Design poses based on my own knowledge • Several angles of view • (waiting for 3D GenAI Models)
  • 24.
    Demo – T2I+ 2 ControlNets
  • 25.
    How can Iachieve greater consistency for the character? Create images featuring the same facial identity
  • 26.
    LoRA • Low-Rank Adaptation •Lightweight Model Adaptation • Update a small subset of model parameters • Very efficient • Small File Size, use significantly less memory • Faster Training • Usage • Specific styles, poses, characters, or concepts • Triggered by keywords in the prompt • Many LoRAs are provided by the community
  • 27.
  • 28.
    Embeddings + 2ControlNets + 2 LoRAs
  • 29.
  • 30.
    Create our ownHaj3r LoRA Dreambooth • SDXL base • 28 input images • 2 epochs • Google Colab Notebook • 1h30 remote training Kohya_ss tool • PowerPuffMix base • 15 input images • 20 epochs • 3h30 local training • Embeddings
  • 31.
    Embeddings + 2ControlNets + Haj3r LoRA
  • 32.
    Demo – I2I+ 2 ControlNets + Haj3r LoRA + Transparency
  • 33.
  • 34.
    FaceID + FaceDetailer •Image Prompt Adapters • Enable to generate with image prompt • Pre-trained control networks different from SD • A sort of one-image LoRA • FaceID IPAdapter • Face recognition model instead of CLIP • LoRA to improve ID consistency • FaceDetailer • Face enhancement tool (eyes, nose, lips, expression) • Post-processing AI model
  • 35.
    Demo – 2ControlNets + FaceID + FaceDetailer
  • 36.
    Embeddings + ControlNet+ FaceID + FaceDetailer
  • 37.
    Easier to changemodel Cheyenne v2
  • 38.
  • 39.
    Summary t2i + i2i t2it2i + embeddings t2i + i2i + embeddings + ControlNet + LoRA t2i + i2i + embeddings + ControlNet t2i + i2i + embeddings + ControlNet + Haj3r LoRA + FaceDetailer t2i + embeddings + ControlNet + FaceID + FaceDetailer
  • 40.
    Conclusion Image Generation isboth Science & Art A lot of parameters to tune Add input & components to control output My use case is implementable Precise and homogeneous images Cherry on the cake: more inclusivity
  • 41.
    Yoga Sūtra II.46 Theposture should be Stable and Comfortable The Yoga of Image Generation
  • 42.