Diffusion model based image generation and
model maintenance
Aryansh Saxena, Bijayan Ray, Krishanu Bandyopadhyay
May 8, 2025
Contents
Dataset preparation
Diffusion model description (simple unet)
Diffusion model description
Configurations
Model training
Results
Dataset preparation I
▶ A custom dataset class named FlowerDataset is defined,
inheriting from torch.utils.data.Dataset.
▶ The constructor takes the directory path of the images and an
optional transform.
▶ It lists all files in the given directory that have a .jpg
extension.
▶ Class names are extracted from each file name by splitting at
the underscore (assuming file names are formatted as
class index.jpg).
▶ A sorted list of unique class names is created, and a mapping
from class names to numeric indices is generated.
▶ The len method returns the number of image files in the
dataset.
Dataset preparation II
▶ The getitem method opens an image file, converts it to
RGB format, applies the transform (if provided), extracts the
class label from the file name, maps it to its corresponding
index, and returns the image and its label.
▶ An instance of the dataset is created using the image
directory /kaggle/input/flower-image-dataset/flowers
and a specified transform.
▶ A DataLoader is created with the dataset, a specified batch
size, shuffling enabled, and 2 worker processes for data
loading.
Diffusion model description (simple unet) I
▶ The model is a U-Net-style architecture with both time and
text conditioning, suitable for denoising in diffusion models.
The input is a noisy image x ∈ RB×C ×H×W , a timestep
t ∈ NB , and an optional text embedding etext ∈ RB×dtext .
▶ Time embedding block:
▶ A learnable positional embedding layer maps timestep t to an
embedding vector:
Emb(t) ∈ RB×dtime
where Emb : N → Rdtime is a learned embedding function.
▶ The time embedding is passed through a multilayer perceptron
(MLP) consisting of:
▶ A linear layer: W1 ∈ R4dtime ×dtime
▶ SiLU activation: SiLU(x) = x · σ(x)
▶ Another linear layer: W2 ∈ Rdtime ×4dtime
▶ Output:
et = W2 · SiLU(W1 · Emb(t)) ∈ RB×dtime
Diffusion model description (simple unet) II
▶ Text conditioning:
▶ If available, text embedding etext ∈ RB×dtext is linearly
projected:
′
etext = Wtext · etext ∈ RB×dtime
▶ Combined with the time embedding as:
′
econd = et + etext
▶ Residual block structure:
▶ Input: x ∈ RB×Cin ×H×W and econd ∈ RB×dtime
▶ Time embedding projection:
eproj = Wt · econd ∈ RB×Cout
▶ Two convolutional layers with GroupNorm and SiLU:
▶ First layer: GroupNorm → SiLU → Conv2d(Cin → Cout )
▶ Second layer: GroupNorm → SiLU → Conv2d(Cout → Cout )
Diffusion model description (simple unet) III
▶ After the second convolution, add time embedding:
h = h + eproj [:, :, None, None]
▶ Residual connection:
▶ If Cin ̸= Cout : apply 1 × 1 convolution.
▶ Otherwise, use identity.
▶ Final output:
y = h + ResConv(x)
▶ Encoder pathway (downsampling):
▶ Input x passes through the first residual block:
d1 = ResBlock1 (x, econd )
▶ Max pooling by a 2 × 2 kernel:
d1′ = MaxPool(d1 )
Diffusion model description (simple unet) IV
▶ Pass through second residual block:
d2 = ResBlock2 (d1′ , econd )
▶ Another downsampling step:
d2′ = MaxPool(d2 )
▶ Bottleneck:
▶ Apply one residual block at the lowest resolution:
b = ResBlockbot (d2′ , econd )
▶ Decoder pathway (upsampling):
▶ Upsample the bottleneck output:
u1′ = Upsample(b)
▶ Concatenate with d2 along channel dimension:
u1′′ = Concat(u1′ , d2 ) ∈ RB×(Cb +Cd2 )×H×W
Diffusion model description (simple unet) V
▶ Apply residual block:
u1 = ResBlock3 (u1′′ , econd )
▶ Repeat upsampling:
u2′ = Upsample(u1 )
▶ Concatenate with d1 and apply final residual block:
u2 = ResBlock4 (Concat(u2′ , d1 ), econd )
▶ Output head:
▶ GroupNorm → SiLU → Conv2d with kernel size 1 × 1:
x̂ = Conv2d1×1 (SiLU(GroupNorm(u2 ))) ∈ RB×C ×H×W
Diffusion model description I
▶ Sinusoidal Positional Embedding:
▶ The time embedding is generated using sine and cosine
functions.
▶ Frequencies are defined as:
2i
fi = 10000− d
where i is the dimension index, and d is the total
dimensionality.
▶ The time embedding for timestep t is computed as:
emb(t) = sin(t · f1 ), cos(t · f1 ), . . . , sin(t · fd/2 ), cos(t · fd/2 )
▶ This embedding encodes timestep information, enabling the
network to understand temporal dependencies during training.
▶ Conditional Dropout:
▶ Dropout is applied during training to regularize the model.
▶ A mask is created based on the dropout probability p, and
features are zeroed out accordingly.
Diffusion model description II
▶ Dropout is not applied during inference; the model behaves
deterministically during testing.
▶ If the dropout probability p is zero, the input embeddings
remain unchanged.
▶ This technique prevents overfitting by forcing the model to rely
less on specific features during training.
▶ Residual Blocks:
▶ A residual block consists of a series of operations designed to
improve gradient flow.
▶ It contains a group normalization layer followed by a
convolutional layer.
▶ After the convolution, a non-linearity (SiLU) is applied.
▶ A second convolutional layer is applied to refine the features.
▶ The input x is added back to the output of the convolutions
(skip connection):
output = ResidualBlock(x) = Conv(Norm(x)) + x
Diffusion model description III
▶ Skip connections mitigate the vanishing gradient problem,
improving model training.
▶ Cross Attention:
▶ Cross-attention allows the model to condition on text
information while processing image features.
▶ The query Q, key K , and value V are derived from the image
features and text embeddings.
▶ The attention scores are computed by√ taking the dot product
between the query and key, scaled by d, where d is the
dimension of the key:
QK T
attn scores = √
d
▶ Softmax is applied to the attention scores to normalize them.
▶ The attention output is computed as the weighted sum of the
value matrix V :
attn output = Softmax(attn scores)V
Diffusion model description IV
▶ This output is used to refine the image features by
incorporating text-driven information.
▶ Self Attention:
▶ Self-attention allows each pixel in the image to attend to every
other pixel, capturing long-range dependencies.
▶ The query, key, and value matrices are derived from the image
features.
▶ Attention scores are computed in the same manner as
cross-attention:
QK T
attn scores = √
d
▶ Softmax is applied to the scores, and the output is computed
as:
attn output = Softmax(attn scores)V
▶ The output is reshaped back into the original image
dimensions and passed through a projection layer.
▶ The projection layer ensures that the output matches the
expected number of channels.
Diffusion model description V
▶ DownBlock:
▶ The DownBlock is a part of the encoder, which reduces the
spatial dimensions of the image.
▶ It consists of a residual block followed by an optional
cross-attention layer.
▶ After the residual and attention layers, a second residual block
is applied to refine the features.
▶ The final operation in the DownBlock is a convolution with
stride 2, which reduces the image dimensions by half.
▶ The output consists of the processed feature map and the
downsampled version of the feature map.
▶ UpBlock:
▶ The UpBlock is part of the decoder and is responsible for
increasing the spatial resolution of the feature map.
▶ The first operation is a transposed convolution, which
upsamples the feature map.
Diffusion model description VI
▶ The upsampled feature map is concatenated with the
corresponding skip connection from the encoder to retain
fine-grained spatial details.
▶ The concatenated feature map is passed through a residual
block and an optional cross-attention layer.
▶ A second residual block is applied to further refine the features.
▶ The final output of the UpBlock is the refined feature map
after upsampling.
▶ TextGuidedUNet:
▶ The TextGuidedUNet extends the standard UNet by
incorporating text-guided conditioning during image
generation.
▶ The model consists of an encoder-decoder architecture with
additional modules for time embedding and text-based
attention.
▶ In the encoder, two DownBlocks are used to process the image
and progressively downsample it.
Diffusion model description VII
▶ The bottleneck includes residual blocks and self-attention to
capture global dependencies.
▶ The decoder uses two UpBlocks to restore the spatial
resolution of the image.
▶ Skip connections are used to preserve high-resolution details
from the encoder.
▶ The final output is generated through a 1 × 1 convolutional
layer, reducing the feature map to the desired output channels.
▶ Forward Pass:
▶ During the forward pass, the model first computes time
embeddings for the input timestep t using the sinusoidal
positional embedding.
▶ The text embedding is conditioned using the conditional
dropout mechanism, which randomly drops certain features
during training.
▶ The image is processed through the encoder (comprising
DownBlocks) and the bottleneck, where residual and
self-attention layers are applied.
Diffusion model description VIII
▶ The feature map is passed through the decoder (comprising
UpBlocks), where cross-attention and residual processing are
used.
▶ Skip connections from the encoder are concatenated with the
upsampled feature maps during decoding to retain fine-grained
spatial information.
▶ The final output is produced by a 1 × 1 convolution, which
maps the features back to the image space (e.g., RGB
channels).
Configurations I
▶ The computation device is set to use CUDA if available;
otherwise, it falls back to CPU.
▶ The batch size for training is set to 128.
▶ The number of training epochs is 50.
▶ The learning rate is set to 2 × 10−4 .
▶ The input image size is set to 56 pixels.
▶ The number of image channels is 3, corresponding to RGB
images.
▶ The number of diffusion timesteps T is set to 1000.
▶ The dimensionality of the time embedding is 128.
▶ The dimensionality of the text embedding is 512.
▶ The conditional dropout probability is set to 0.1.
Model training I
▶ Initialization:
▶ Initialize empty lists:
loss hist ← [], acc hist ← []
▶ Loop over epochs:
for epoch ∈ {0, 1, . . . , epochs − 1}
▶ Set model to training mode:
model.train()
This enables gradient computation and training-specific layers
like dropout or batch normalization (if present).
▶ Initialize epoch statistics:
▶ Total loss accumulator: total loss ← 0
▶ Total MSE accumulator for pseudo-accuracy: total mse ← 0
Model training II
▶ Iterate over training batches:
for (imgs, labels) ∈ train loader
▶ Transfer input images to device:
imgs ← imgs.to(device)
▶ Compute batch size:
b ← imgs.size(0)
▶ Sample random diffusion timesteps:
t ∼ Uniform{0, T − 1}b
▶ Generate noise tensor:
ϵ ∼ N (0, I ), ϵ ∈ RB×C ×H×W
Model training III
▶ Compute ᾱt from precomputed αcumprod :
ᾱt = extract(αcumprod , t, imgs.shape)
▶ Corrupt input image using forward diffusion process:
√ √
xt = ᾱt · x + 1 − ᾱt · ϵ
▶ Get text embeddings for labels:
etext = label embs[labels]
▶ Predict noise:
ϵ̂ = model(xt , t, etext )
▶ Compute loss:
B
1 X
LMSE = ∥ϵ̂i − ϵi ∥2
BCHW
i=1
where B is batch size, C is channels, H is height and W is
width.
Model training IV
▶ Backpropagation and optimization:
▶ Zero gradients: opt.zero grad()
▶ Compute gradients: LMSE .backward()
▶ Update weights: opt.step()
▶ Accumulate total loss:
total loss += LMSE · b
▶ Estimate denoised image x0 using reparameterization:
√
xt − 1 − ᾱt · ϵ̂
x̂0 = √
ᾱt
▶ Compute pseudo-accuracy =1 - MSE where MSE is:
B
1 X (i)
MSE = ∥x̂0 − x (i) ∥2
BCHW
i=1
Model training V
▶ Accumulate MSE:
total mse += MSE · b
▶ End of epoch:
▶ Normalize accumulated loss and accuracy:
total loss total mse
epoch loss = , epoch acc = 1 −
N N
where N = len(train ds)
▶ Append to history:
loss hist.append(epoch loss), acc hist.append(epoch acc)
▶ Log accuracy to file:
Append “Epoch#i : Accuracy ≈ x” to accuracy log.txt
Results I
Some images generated from the diffusion model trained on flower
dataset:
Results II
Results III
Some images generated from the diffusion model trained on
MNIST dataset:
Results IV