Assignment 4
CSED105 Introduction to Artificial Intelligence
Instructor: Jungseul Ok
TAs: Sangwon Ryu, Sungjae Lee, Minhyeon Oh, Kyeongheung Yun,
Kyungmin Kim, Changhwan Sung
csed-105@postech.ac.kr
Due: 23:59pm May 19, 2025
• Assignment that your hand-in must be of your own work, and hand-written or
e-hand-written.
• You submit a scanned copy of your answer on PLMS online in a single PDF file.
• If your work is submitted within 12 hours after the deadline, it will be penalized with
20% discount factor in the score.
• Delay more than 12 hours will NOT be accepted.
• You may ask questions on the assignment through the Q&A board in PLMS rather
than e-mail.
1
1 K-means Algorithm
Given a dataset D = {x(i) }i=1,2,...,m of m data points in R, we want to partitioning them
into K clusters using K-means algorithm. Let µk ∈ R denote the centroid of cluster
k ∈ {1, 2, ..., K}. Indicating the assignment of data point x(i) by one-hot vector r(i) =
(i) (i) (i) (i)
(r1 , ..., rK ) ∈ {0, 1}K such that K (i)
P
k=1 rk = 1 and rk = 1 only if x is assigned to clus-
ter k, the K-means algorithm aims at finding {r(i) }i=1,...,m from the following optimization
problem:
m X
K
X 1 (i)
min rk ∥x(i) − µk ∥22 (1a)
{r(i) ∈{0,1}K },{µk ∈R}
i=1 k=1
2
XK
(i)
subject to rk = 1 ∀i ∈ {1, ..., m} , (1b)
k=1
to which K-means algorithm iteratively optimizes r(i) ’s (assignment step) and µk ’s (update
step) by fixing the other.
In what follows, we are going to derive the assignment and update steps. First, to derive
the update step optimizing µk ’s given r(i) ’s, we begin with the partial derivative of (1a) with
respect to µk given as follows:
m K
! m m
! m
!
∂ 1 X X (i) (i) X (i)
X (i)
X (i)
r (x − µj )2 = rk (µk − x(i) ) = rk µk − rk x(i) .
∂µk 2 i=1 j=1 k i=1 i=1 i=1
Finding µk that makes the derivative zero, we can conclude that given {r(i) }i=1,...,m , the
optimal choice of µk is given as follows:
Pm (i) (i)
i=1 rk x
µk = P m (i)
, (2)
r
i=1 k
which is the mean of points assigned to cluster k since the denominator is the number of
points assigned to cluster k and the numerator is the sum of x(i) ’s in cluster k.
(a) [5pt] For the assignment step, obtain the optimal r(i) for (1) given {µk }k=1,...,K . You
need to justify your answer.
(b) [10pt] The following colab code includes an implementation of K-means algorithm for
a wine dataset:
https://colab.research.google.com/drive/1vTpwEycgACl0ndaPpZXMQ243KE1mgz5i?usp=sharing
Find the appropriate K value and iteration count for clustering. Write K and iteration
count, and attach screenshot of the clustering result.
(c) [10pt] Is the K-means algorithm robust to outlier data points (i.e., points that lie far
from the main data distribution)? Justify your answer.
(d) [10 pt] Can K-means still work well when the data forms clusters of different shapes (for
example, one cluster might be a dense circle and another might be a sparse triangle)?
Justify your answer.
2
2 Generative Model
Consider MNIST dataset D = {x(i) }m i=1 consisting of m gray-scale images of the handwriting
digits, and a VAE model consisting of encoder f and decoder g, which are both MLPs. In
training, we encode x(i) using f (x(i) ) = (µ(i) , (σ (i) )2 ) to generate 2-dimensional random latent
code z (i) ∼ N (µ(i) , (σ (i) )2 ), and decode z (i) to reconstruct x(i) using g(z (i) ) = x̂(i) .
Our learning objective can be informally described as follows:
m
1 X
min CE(x(i) ; g(z (i) )) +λ · KL(N (µ(i) , (σ (i) )2 ); N (0, I)) , (3)
f,g m i=1 | {z } | {z }
reconstruction error regularization
where I is the 2 × 2 identity matrix. The first term, CE(·), measures a discrepancy between
original x(i) and reconstructed x̂(i) = g(z (i) ), and the second term,KL(·), measures the dis-
tance between the distribution of z (i) and a zero-mean normal distribution N (0, I)). Noting
that the random vector from N (0, I)) is on [−3, 3] × [−3, 3] with probability more than 95%,
the regularization term forces that the encoder of VAE maps most of MNIST images to some
latent codes in [−3, 3] × [−3, 3].
The following colab code is an implementation of the VAE,
https://colab.research.google.com/drive/14EoXt eAbyY A6P f M 3ab5pq61GW hLrY ny?usp = sharing
In order to change λ, you may modify the following part (yellow boxed) of the code:
When you couldn’t see the result in 30 minutes, check if you have enabled GPU use option
in colab through the following instruction:
3
(a) [10pt] Explain how a Generative Adversarial Network (GAN) generates new images
and describe the potential failure modes that may arise during this process (using
precise terminology).
(b) [5pt] Explain why a Variational Autoencoder (VAE) outperforms a standard Autoen-
coder (AE) in image generation, from the perspective of its modified loss function.
(c) [5pt] Train you own VAE from above colab link. Describe the procedure to generate
plausible images of handwritten digits from trained VAE.
(d) [10pt] Setting lambda to 0 and 1, i.e., λ = 0, λ = 1 in (3), attach the generated images
for latent code z in [−3, 3] × [−3, 3] of each λ. Explain which image, generated with
different values of λ, appears more plausible, and justify your answer.
(e) [10pt] If λ is set to a very large value (e.g., 1000), will performance improve? Justify
your answer.