This R code document contains code for implementing the Expectation-Maximization (EM) algorithm for Gaussian mixture models with 1, 2, and 3 clusters of data. The code includes functions for the EM steps, starting values, and plotting the results. It applies the EM algorithm to real datasets with 1 and 2 dimensions and to a simulated 3 cluster dataset.
Presentation introduces R code for the Expectation-Maximization (EM) algorithm for Gaussian mixtures, covering 1D, 2D data, and multiple clusters.
Explains the iterative EM algorithm process involving E-steps and M-steps to find maximum likelihood estimates, along with convergence detection using log-likelihood.
Presents an application of EM algorithm on the 1D Faithful dataset, including data loading, histogram plotting, log-likelihood function, and two EM iterations.
Further iterations of the EM algorithm on the 1D Faithful dataset are conducted; includes plotting of distributions using histogram and fitted curve for convergence checks.
Describes the application of the EM algorithm on the 2D structure of the Faithful dataset, involving data preparation, plotting eruption times and waiting times.
Develops further on 2D Gaussian mixtures using EM algorithm with multiple iterations (2, 5, 10, 20), demonstrating the fitting of distributions through contour plots.
Introduces simulated data for three clusters and extends the EM algorithm, detailing likelihood calculation and visualizing results for multiple iterations.
R Code
For
Expectation-Maximization (EM)
Algorithmfor Gaussian Mixtures
Avjinder Singh Kaler
This is the R code for EM algorithm. Here, R code is used for 1D, 2D and 3 clusters dataset.
One can modify this code and use for his own project.
2.
Expectation-Maximization (EM) isan iterative
algorithm for finding maximum likelihood estimates of
parameters in statistical models, where the model depends
on unobserved latent variables. The EM iteration alternates
between performing an expectation (E) step, which creates a
function for the expectation of the log-likelihood evaluated
using the current estimate for the parameters, and a
maximization (M) step, which computes parameters
maximizing the expected log-likelihood found on the E step.
Iterate these steps until convergence detect. In simple
language, each iteration consists of an E-step and an M-step.
Convergence is generally detected by computing the value of
the log-likelihood after each iteration and halting when it
appears not to be changing in a significant manner from one
iteration to the next.
3.
#FaithFUL Dataset 1D
#
#loaddata
#
data(faithful)
Y<-faithful$waiting
#
#plot histogram of waiting times
#
postscript("geyser.eps",width=5,height=3,onefile=F)
par(mar=c(2,2,1,1),mgp=c(1,0.5,0),cex=0.7,lwd=0.5,las=1)
hist(Y,breaks=seq(40,100,3),xlab="Waiting time between eruptions (min)",
xlim=c(40,100),main="",axes=F)
axis(1,seq(40,100,6),pos=0)
axis(2,seq(0,35,5),pos=40,las=1)
dev.off()
#
#log-likelihood function
#
ln<-function(p,Y) {
-sum(log(p[1]*dnorm(Y,p[2],sqrt(p[4]))+(1-p[1])*dnorm(Y,p[3],sqrt(p[5]))))
}
#
#EM algorithm
#
emstep<-function(Y,p) {
EZ<-p[1]*dnorm(Y,p[2],sqrt(p[4]))/
(p[1]*dnorm(Y,p[2],sqrt(p[4]))
+(1-p[1])*dnorm(Y,p[3],sqrt(p[5])))
p[1]<-mean(EZ)
p[2]<-sum(EZ*Y)/sum(EZ)
p[3]<-sum((1-EZ)*Y)/sum(1-EZ)
p[4]<-sum(EZ*(Y-p[2])^2)/sum(EZ)
p[5]<-sum((1-EZ)*(Y-p[3])^2)/sum(1-EZ)
p
}
emiteration<-function(Y,p,n=10) {
for (i in (1:n)) {
p<-emstep(Y,p)
}
p
}
#
#starting values
#
p<-c(0.5,40,90,16,16)
#
#2 iterations of EM algorithm
#
p<-emiteration(Y,p,2)
p
#check for convergence
p<-emstep(Y,p)
p
#
#plot histogram with fitted distribution
#
hist(Y,breaks=seq(40,100,3),xlab="Waiting time between eruptions (min)",
xlim=c(40,100),
main="",axes=F)
axis(1,seq(40,100,6),pos=0)
4.
axis(2,seq(0,35,5),pos=40,las=1)
x<-seq(40,100,0.1)
y<-p[1]*dnorm(x,p[2],sqrt(p[4]))+(1-p[1])*dnorm(x,p[3],sqrt(p[5]))
lines(x,y*3*length(Y))
#5 iterations ofEM algorithm
#log-likelihood function
#
ln<-function(p,Y) {
-sum(log(p[1]*dnorm(Y,p[2],sqrt(p[4]))+(1-p[1])*dnorm(Y,p[3],sqrt(p[5]))))
}
#
#EM algorithm
#
emstep<-function(Y,p) {
EZ<-p[1]*dnorm(Y,p[2],sqrt(p[4]))/
(p[1]*dnorm(Y,p[2],sqrt(p[4]))
+(1-p[1])*dnorm(Y,p[3],sqrt(p[5])))
p[1]<-mean(EZ)
p[2]<-sum(EZ*Y)/sum(EZ)
p[3]<-sum((1-EZ)*Y)/sum(1-EZ)
p[4]<-sum(EZ*(Y-p[2])^2)/sum(EZ)
p[5]<-sum((1-EZ)*(Y-p[3])^2)/sum(1-EZ)
p
}
emiteration<-function(Y,p,n=10) {
for (i in (1:n)) {
p<-emstep(Y,p)
}
p
}
#
#starting values
#
p<-c(0.5,40,90,16,16)
#
#5 iterations of EM algorithm
#
p<-emiteration(Y,p,5)
p
#check for convergence
p<-emstep(Y,p)
p
#
#plot histogram with fitted distribution
#
hist(Y,breaks=seq(40,100,3),xlab="Waiting time between eruptions (min)",
xlim=c(40,100),
main="",axes=F)
axis(1,seq(40,100,6),pos=0)
axis(2,seq(0,35,5),pos=40,las=1)
x<-seq(40,100,0.1)
y<-p[1]*dnorm(x,p[2],sqrt(p[4]))+(1-p[1])*dnorm(x,p[3],sqrt(p[5]))
lines(x,y*3*length(Y))
#10 iterations of EM algorithm
#log-likelihood function
#
ln<-function(p,Y) {
-sum(log(p[1]*dnorm(Y,p[2],sqrt(p[4]))+(1-p[1])*dnorm(Y,p[3],sqrt(p[5]))))
}
#
#EM algorithm
#
emstep<-function(Y,p) {
EZ<-p[1]*dnorm(Y,p[2],sqrt(p[4]))/
(p[1]*dnorm(Y,p[2],sqrt(p[4]))
+(1-p[1])*dnorm(Y,p[3],sqrt(p[5])))
5.
p[1]<-mean(EZ)
p[2]<-sum(EZ*Y)/sum(EZ)
p[3]<-sum((1-EZ)*Y)/sum(1-EZ)
p[4]<-sum(EZ*(Y-p[2])^2)/sum(EZ)
p[5]<-sum((1-EZ)*(Y-p[3])^2)/sum(1-EZ)
p
}
emiteration<-function(Y,p,n=10) {
for (iin (1:n)) {
p<-emstep(Y,p)
}
p
}
#
#starting values
#
p<-c(0.5,40,90,16,16)
#
#10 iterations of EM algorithm
#
p<-emiteration(Y,p,10)
p
#check for convergence
p<-emstep(Y,p)
p
#
#plot histogram with fitted distribution
#
hist(Y,breaks=seq(40,100,3),xlab="Waiting time between eruptions (min)",
xlim=c(40,100),
main="",axes=F)
axis(1,seq(40,100,6),pos=0)
axis(2,seq(0,35,5),pos=40,las=1)
x<-seq(40,100,0.1)
y<-p[1]*dnorm(x,p[2],sqrt(p[4]))+(1-p[1])*dnorm(x,p[3],sqrt(p[5]))
lines(x,y*3*length(Y))
#Faithful Dataset 2D
#load data
#
#
#Bivariate mixture model
#
data(faithful)
Y<-faithful$waiting
X<-faithful$eruptions
plot(Y,X,xlab="Waiting time between eruptions (min)",
ylab="Eruption times (min)",
xlim=c(40,100),ylim=c(0,6),
main="",axes=F,cex=0.7)
axis(1,seq(40,100,6),pos=0)
axis(2,seq(0,6,1),pos=40,las=1)
#
#density of bivariate normal
#
dbinorm<-function(x,m,s){
1/sqrt(det(2*pi*s))*exp(-0.5*t(x-m)%*%solve(s)%*%(x-m))
}
f.hat<-function(x,p) {
p$p*dbinorm(x,p$m1,p$s1)+(1-p$p)*dbinorm(x,p$m2,p$s2)
}
Estep<-function(x,p) {
p$p*dbinorm(x,p$m1,p$s1)/