KEMBAR78
Stable Diffusion | PDF | Machine Learning | Computational Neuroscience
0% found this document useful (0 votes)
113 views58 pages

Stable Diffusion

Uploaded by

Seuneedhi Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views58 pages

Stable Diffusion

Uploaded by

Seuneedhi Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

Stable

Diffusion
What’s the deal with all these
pictures?
These pictures were generated by Stable
Diffusion, a recent diffusion generative model.
It can turn text prompts (e.g. “an astronaut riding
a horse”) into images.
It can also do a variety of other things!

DALL·E 2, which works in a similar way.


Why should we
care?
Could be a model of imagination.

Similar techniques could be used to


generate any number of things (e.g. neural
data).
"a lovely cat running
in the desert in Van
Gogh style, trending
It’s
art." cool!
How does it
work?
It’s complicated…
but here’s the high-level
idea.

“Batman eating pizza


in a diner"
What do we “bad stick figure
drawing"
need?
Example pictures of
people

1. Method of learning to generate new stuff given many


examples
What do we need?
2. Way to link text and
images

“cool professor person”

3. Way to compress images


(for speed in training and
generation)
What do we need?
4. Way to add in good image-related inductive biases…

…since when you’re generating something new, you need


a way to safely go beyond the images you’ve seen before.

What is Inductive Bias?


• Inductive bias refers to the set of assumptions or preferences that a machine learning algorithm brings to the
table when learning from data. These assumptions guide the algorithm in making predictions about unseen
data.
• Why it's Important: Without inductive bias, a machine learning model would be unable to generalize from
its training data. It would have no way to choose between the infinite number of possible hypotheses that
could explain the data.
What do we need?
1. Method of learning to generate new Forward/reverse
stuff dffusion
2. Way to link text and Text-image representation
images model

3. Way to compress Autoencode


images r

4. Way to add in good inductive U-net + ‘attention’


architecture
biases
Making a ‘good’ generative model is about making all these parts work together
well!
Stable Diffusion in Action
Cartoon with StableDiffusion + Cartoon

https://www.reddit.com/r/Sta
bleDiffusion/comments/xcjj7u
/sd_img2img_after_effects_i
_generated_2_images_and/
Outlin
e
• Build Stable Diffusion “from Scratch”
• Principle of Diffusion models (sampling, learning)
• Diffusion for Images – UNet architecture
• Understanding prompts – Word as vectors, CLIP
• Let words modulate diffusion – Conditional Diffusion, Cross
Attention
• Diffusion in latent space – AutoEncoderKL
• Training on Massive Dataset. – LAION 5Billion
Principle of Diffusion
Models
Learning to generate by iterative
denoising.
“Creating noise from data is
easy; Creating data from noise is generative
modeling.”
-- Song,
Yang
Diffusion models
• Forward diffusion (noising)
• 𝑥0 → 𝑥1 → ⋯ 𝑥𝑇
• Take a data distribution 𝑥0 ~𝑝(𝑥), turn it into noise
by diffusion 𝑥 𝑇 ~𝒩 0, 𝜎 2 𝐼

𝒙 𝒙𝟏 𝒙
𝒙𝑻−
𝟎 𝟏 𝑻
• Reverse diffusion
(denoising)
• 𝑥𝑇 → 𝑥𝑇−1 → ⋯
𝑥0
Math Formalism
• For a forward diffusion process
𝑑𝒙 = 𝑓 𝒙, 𝑡 𝑑𝑡 + 𝑔 𝑡 𝑑𝒘

• There is a backward diffusion process that reverse the

𝑑𝒙 = 𝑓 𝑥, 𝑡 − 𝑔 𝑡 2∇𝑥 log 𝑝(𝒙, 𝑡) 𝑑𝑡


time

+𝑔 𝑡 𝑑𝒘
• If we know the time-dependent score function ∇𝑥 log 𝑝(𝒙, 𝑡)
• Then we can reverse the diffusion process.
Modelling Score
function over Image
Domain
Introducing
UNet
Convolutional Neural Network
Features of larger scale (larger • CNN parametrizes
RF) function over images
Higher abstraction level.
• Motivation
• Features are translational
invariant
• Extract feature at different
scale / abstraction level

• Key modules
• Convolution
• Downsamping (Max-pool)

VGG
CNN + inverted CNN ⇒
UNet
• Inverted CNN
(generator)
can generate
images.
• CNN + inverted CNN


Down Up could model Image
Sampling Sampling
Convolution TransposedConvolution
Image
function.
UNet: a natural architecture for image-
to- image function
Skip connection
Transporting information
at the same resolution.

Down Up
(sampling) (sampling)
side side
Encode Decoder
r
Key Ingredients of
UNet
• Convolution operation
• Save parameter,
spatial invariant

• Down/Up sampling
• Multiscale / Hierarchy
• Learn modulation at
multi scale
and multi-abstraction
levels.

• Skip
connection
• No bottleneck
• Route feature
of the same
scaledirectly.
Note: Add Time Dependency
• The score function is time-
𝑡
dependent.
• Target: 𝑠 𝑥, 𝑡 = ∇𝑥 log 𝑝(𝑥,
embedding

𝑡) ⊕

Linear/
MLP

• Add time Conv
[𝐬𝐢𝐧

tensor

𝝎𝒊 𝒕 ,
dependency
𝐜𝐨𝐬
• Assume time dependency is

𝝎𝒊 𝒕 ,
spatially
…]
homogeneous.
• Add one scalar value per channel 𝑓(𝑡)

• Parametrize 𝑓(𝑡) by MLP / linear of Fourier


basis.
How to understand
prompts?
Language
CLIP!
/ Multimodal Transformer,
Word as Vectors: Language Model 101
• Unlike pixel, meaning of word
are not explicitly in the Words in
a I cats and dogs .
characters. sentence love
Token 328 793 3989, 537 3255,
• Word can be represented as Index , , , 269

index in dictionary
• But index is also meaning less. Word
Vectors

• Represent words in a vector


space
• Vector geometry => semantic
relation.
Word Vector in N layers
……
Context: RNN /
Transformers
• Meaning of word depends on
context, not always the same.
Transformer
• “I a
book
ticket to buy that book.” Block

• Word vectors should depend on


context.

• Transformers let each word “absorb” Transformer


Block
influence from other words to be
“contextualized”

More on attention
later…
Learning Word
Vectors: GPT & BERT
& CLIP
• Self-supervised learning of
Downstream Classifier can
decode:
Part of speech, Sentiment, …
word representation

• Predicting missing / next words


in a sentence. (BERT, GPT)

• Contrastive Learning,
matching image and text.
(CLIP)

MLM — Sentence-Transformers documentation (sbert.net)


Joint Representation for Vision and Language
: CLIP
Transformer • Learn a joint
encoding space for
text caption and
image
• Maximize
Vision representation similarity
Transformer
between an image and
its caption.

• Minimize other pairs

CLIP paper
Choice of text encoding
• Encoder in Stable Diffusion: pre-trained CLIP ViT-L/14 text encoder

• Word vector can be randomly initialized and learned online.

• Representing other conditional signals

• Object categories (e.g. Shark, Trout, etc.):


• 1 vector per class

• Face attributes (e.g. {female, blonde hair, with glasses, …}, {male, short hair, dark
skin}):
• set of vectors, 1 vector per attributes

• Time to be creative!!
How does text affect
diffusion?
Incoming Cross
Attention
Origin of Attention:
Machine Translation
(Seq2Seq)
Original French
sentence I love cats and dogs Translation J'adore les chats et les
. chiens.
Encoder Decoder
𝑒1 𝑒2 𝑒3 𝑒4 𝑒5 𝑒6 ℎ ℎ2 ℎ3
hidden hidden
state state 1
(Word (Word
Vector Vectors
s) )

• Use Attention to retrieve useful info from a batch of


vectors.
From Dictionary to
Dictionary:
AttentionHard-
indexing
• Keys 1,2,3
• `dic = {1 : 𝑣
𝑣11,, 2 : 𝑣 2 , 3 : 𝑣3}` 𝟏 𝟐
• Values
𝑣2 , 𝑣3 𝟑 0
• `dic[2]` × =
𝑣1 𝑣 2 𝑣 3
1
• Query 2 𝑣2
• Find 2 in keys
0

• Get corresponding
value.
• Retrieving values as matrix vector
product
• One hot vector over the keys
• Matrix vector product
From Dictionary to
Attention
Attention: Soft-indexing
• Soft indexing
• Define an attention distribution 𝟏 𝟐

𝑎 over the keys 𝟑 0.1

× 0.8 =
𝑣1 𝑣 2 𝑣 3 0.1
• Matrix vector product. 0.8 𝑣 2 +0.1

𝑣 1 +0.1 𝑣 3

• Distribution based on
similarity of query and
key.
QKV attention
• Query : what I need (J’adore : “I want subject pronoun &
verb”)
• Key : what the target provide (I : “Here is the subject”)
• Value : the information to be retrieved (latent related to Je or
J’ )

• Linear projection of “word vector”


• Query𝑞𝑖 = 𝑊𝑞 ℎ𝑖
• Key 𝑘𝑗 = 𝑊𝑘𝑒𝑗
• Value 𝑣𝑗 = 𝑊𝑣 𝑒𝑗
Attention mechanism
• Compute the inner product (similarity) of key 𝑘 and query
𝑞

𝑘� 𝑞
• SoftMax the normalized score as𝑇attention distribution.
𝑎𝑖 = , ෍ 𝑖𝑎 =
SoftMax
𝑗 𝑙𝑖 𝑒𝑛( 1� 𝑗

𝑞) �

values 𝑣.
• Use attention distribution to weighted average

𝑐𝑖 = ෍ 𝑎𝑖𝑗𝑣𝑗
𝑗
Attention matrix 𝒂𝒊𝒋
Visualizing

• French 2 English
• “Learnt to pay Attention”
• “la zone
economique
europeenne” -> “the
European Economic
Area”

• “a ete signe” -> “was


signed”
Cross & Self Attention
• Cross Attention
• Tokens in one language pay
French
attention to tokens in another. Translation J'adore les chats

• Self Attention (𝑒 𝑖 = ℎ𝑖 )
Decoder
ℎ ℎ2 ℎ3
hidden
state 1
• Tokens in a language pay attention (Word
Vectors
to )
each other.
“A robot must obey the order given
it.”

https://jalammar.github.io/illustrated-gpt2/
“A robot must obey the order given
it.”

https://jalammar.github.io/illustrated-gpt2/
https://jalammar.github.io/illustrated-gpt2/
Note: Feed Forward
network
• Attention is usually followed by
a 2-layer MLP and
Normalization

• Learn nonlinear transform.


Text2Image as
translation
Source language: Target language:
Images
Words
Spatial
Dimensions
Sequence
Dimensions Channel
Dimensions
Encoded Latent State
Word of
Vectors Image
Patch
“ A ballerina chasing her cat Vectors!

running
on the grass in the style of Monet
"
Text2Image as Spatial
translation
Sequence
Dimensions
Channel
Dimensions
Dimensions
Encoded Latent State
Word of
Vectors Image

“ A ballerina her cat


chasing running
on the grass in the style of Monet
"

Cross Self
Attention: Attention:
Image to Image to
Words Image
Spatial Transformer
• Rearrange spatial tensor
to sequence.
• Cross Attention
• Self Attention
• FFN
• Rearrange back to spatial
tensor
(same shape)
Tips: Implementing attention `einops` lib
• `einops.rearrange` function
• Shift order of axes
• Split / combine dimension.

• `torch.einsum` function
• Multiply & sum tensors
along axes.
• Down • Up
UNet = Giant Sandwich blocks
ResBlock blocks
ResBlock
ResBlock
of Spatial transformer + SpatialTransformer
ResBlock
ResBlock
UpSample

ResBlock (Conv layer) SpatialTransformer


DownSample
ResBlock
SpatialTransformer
ResBlock ResBlock
SpatialTransformer
SpatialTransformer
ResBlock
ResBlock
SpatialTransformer
SpatialTransformer UpSample
DownSample ResBlock
ResBlock SpatialTransformer
SpatialTransformer ResBlock
SpatialTransformer
ResBlock
ResBlock
SpatialTransformer
SpatialTransformer
DownSample UpSample
ResBlock ResBlock
ResBlock SpatialTransformer
ResBlock
SpatialTransformer
ResBlock
SpatialTransformer
Spatial transformer + ResBlock (Conv
layer)
Time


embedding

𝟐𝟖𝟎

Latent
Spatial Spatial
𝟒,
tensor Resblock Resblock
Transformer Transformer
𝟔𝟒, 𝟔𝟒

Word

𝑳𝒔𝒆𝒒,
Vectors

𝟕𝟖𝟒
• Alternating Time and Word Modulation
• Alternating Local and Nonlocal
Diffusion in Latent
Space
Adding in
AutoEncoder

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-
resolution image synthesis with latent diffusion models, CVPR
Diffusion in latent DownSampling

space

32 pix 180
pix
Motivation:
• Natural images are high dimensional
• but have many redundant details that could
be
compressed / statistically filled out
𝑑 = 𝑑 =
2352 97200
• Division of
• Diffusion model -> Generate low resolution
labor
sketch
• AutoEncoder -> Fill out high resolution details

• Train a VAE model to compress images into


latent space.
• 𝑥→𝑧→𝑥 𝑧
• Train diffusion models in latent space of 𝑧. 𝑥
[4,512/𝑓,
512/𝑓] ��ො
[3,512,51 [3,512,51
2] 2]
Spatial Compression Tradeoff
• LDM-{𝑓}. 𝑓 = Spatial downsampling factor
• Higher 𝑓 leads to faster sampling, with degraded image quality (FID ↑)
• Fewer sampling steps leads to faster sampling, with lower quality (FID
↑)

Face
CelebA- ImageNet
HQ
Spatial Compression Tradeoff
• LDM-{𝑓}. 𝑓 = Spatial downsampling factor
• Too little compression 𝑓 = 1,2 or too much compression 𝑓 = 32,
makes
diffusion hard to train.
Details in Stable Diffusion
• In stable diffusion, spatial downsampling 𝑓 =
8

• 𝑥 is (3, 512, 512) image tensor


• 𝑧 is (4, 64, 64) latent tensor
Regularizing the Latent Space
• KL regularizer
• Similar to VAE, make latent distribution like Gaussian distribution.

• VQ regularizer
• Make the latent representation quantized to be a set of discrete
tokens.
Let the GPUs
roar!
Training
details.
data &
Large Data Training
• SD is trained on ~ 2 Billion image – caption (English)
pairs.

• Scraped from web, filtered by CLIP.

• https://laion.ai/blog/laion-5b/
Diffusion Process Visualized
Meaning of latent space
• Latent state contains a “sketch version” of the
image.

𝑧[0:
3, : , : ]

You might also like