This page contains source code relating to chapter 9 of Bishop’s Pattern Recognition and Machine Learning (2009)

This chapter is about Mixture Models and Expectation-Maximization.

This first function generates a dataset from a mixture of Gaussians:

# generating n datapoints from a mixture of K Gaussians with dimensions d
# k  : the respective datapoint classes
# mu : kxd matrix with means
# sig: kxdxd matrix with dxd covariate matrices
gen.mix <- function(n, k, mu, sig) {
  library(MASS)

  d <- length(mu[1,])  # number of dimensions
  result <- matrix(rep(NA,n*d), ncol=d)
  colnames(result) <- paste0("X",1:d)
  
  for(i in 1:n) {
    result[i,] <- mvrnorm(1, mu = mu[k[i],], Sigma=sig[,,k[i]])
  }
  
  result
}

Here’s a dataset constructed by a mixture of three Gaussians:

set.seed(101)
n <- 360

mu <- matrix(c(14.0,4.0,
               15.0,5.0,
               16.5,5.0), ncol=2, byrow=T)

sigs <- array(rep(NA,2*2*3), c(2,2,3))  # 3D matrix
sigs[,,1] <- matrix(c(.25, .21, .21,.25), nrow=2, byrow=TRUE)
sigs[,,2] <- matrix(c(.25,-.21,-.21,.25), nrow=2, byrow=TRUE)
sigs[,,3] <- matrix(c(.25, .21, .21,.25), nrow=2, byrow=TRUE)

pi      <- c(.2,.5,.3)                           # mixing coeffs
classes <- sample(1:3, n, replace=TRUE, prob=pi)

d <- gen.mix(n, classes, mu, sigs)
plot(d, col=c("red","green","blue")[classes], xlab="X1", ylab="X2", pch=19)

K-means clustering (section 9.1)

Given a dataset \(x=(x_1,\ldots,x_n)\) of \(D\)-dimensional datapoints, and a given \(K\) different classes/clusters, we wish to assign each datapoint \(x_i\) to a cluster \(k\), such that the points of the same cluster are closer to each other, than to the points of the other clusters.

Each cluster \(k\) has a center \(\mu_k\) and the goal is to find an assignment for all datapoints that minimizes the sum of the distances of the datapoints to their respective cluster centers.

This can be described by the objective function

\[J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} \| x_n - \mu_k \|^2\]

to be minimized. \(r_{nk}\) is an indicator, it’s a vector of \(K\) positions using a 1-of-K coding scheme, ie, it’s all zeros except for a one that determines the datapoint’s cluster. Eg, for \(K=4\), if \(r_{nk} = (0,0,1,0)\) it means that \(x_n\) belongs to the third cluster.

One algorithm is the k means that mimimizes \(J\) by first minimizing \(J\) wrt to \(r_{nk}\), and then, using these new values, minimize \(J\) wrt to the centers \(\mu_k\).

The k means does not guarantee an optimal solution (nor convergence).

k_means <- function(dataset, K, max_iter=100) {
  
  # get the dataset classification given the current indicators
  get_classes <- function(rnk)
    apply(rnk,1,function(row) which.max(row))
  
  d      <- ncol(dataset)                                # number of dimensions
  N      <- nrow(dataset)                                # number of samples
  ranges <- sapply(1:d, function (i) range(dataset[,i])) # the ranges for each dimension

  # generate K initial random cluster centers (each center is a row vector)
  mu <- t(replicate(K,sapply(1:d, function(i) runif(1,ranges[1,i], ranges[2,i]))))
  
  # indicators (each row consists of 0...1...0, ie, it's a 1-of-K coding scheme)
  rnk <- matrix(rep(0,K*n), ncol=K)
  old_classes <- get_classes(rnk)
  
  for(it in 1:max_iter) {
    
    # update indicators for each datapoint
    for(n in 1:N) {
      distances <- sapply(1:K, function(k) norm(as.matrix(dataset[n,]-mu[k,]),"F"))
      rnk[n,]   <- rep(0,K)
      rnk[n,which.min(distances)] <- 1
    }
    
    classes <- get_classes(rnk)
    if (all(old_classes == classes)) # convergence achieved?
      break
    else 
      old_classes <- classes
    
    # update centers given the updated indicators
    for(k in 1:K) {
      mu[k,]  <- rnk[,k] %*% dataset / sum(rnk[,k])
    }
  }
  
  list(mu=mu, pred=classes)
}

Let’s try the algorithm:

set.seed(101)
result <- k_means(d,3)  # set clustering to 3 classes

plot(d, col=c("red","green","blue")[result$pred], xlab="X1", ylab="X2", pch=19)
points(result$mu, pch=3, lwd=4)  # plot the centers

Mixture of Gaussians (section 9.2)

If we decide to model the previous dataset as a mixture of \(K\) Gaussians, then for a given datapoint \(x\):

\[p(x) = \sum_{i=1}^K p(K=i)p(x|K=i) = \sum_{i=1}^K \pi_k \mathcal{N}(x|\mu_k,\Sigma_k)\]

is a superposition of \(K\) Gaussians. Each density \(\mathcal{N}(x|\mu_k,\Sigma_k)\) is a component of the mixture with its own mean and covariance matrix. The parameters \(\pi_k\) are called the mixing coefficients, such that \(\sum_k \pi_k = 1, \pi_k \geq=0\) (\(\pi_k\) are probabilities).

An important value is \(\gamma_k(x) \equiv p(K=k|x)\) which is called the responsability of Gaussian \(k\) over datapoint \(x\). By Bayes theorem,

\[\gamma_k(x) \equiv p(K=k|x) = \frac{p(K=k)p(x|K=k)}{\sum_i p(K=i)p(x|K=i)} = \frac{\pi_k \mathcal{N}(x|\mu_k,\Sigma_k)}{\sum_i \pi_i \mathcal{N}(x|\mu_i,\Sigma_i)}\]

The parameters of this model are \(\pi \equiv \{\pi_1,\ldots,\pi_K\}\), \(\mu \equiv \{ \mu_1,\ldots,\mu_K \}\) and \(\Sigma \equiv \{\Sigma_1,\ldots,\Sigma_k\}\).

The log-likelihood of the dataset \(X \equiv \{X_1,\ldots,X_N\}\) given the parameters is

\[\log p(X|\pi,\mu\Sigma) = \sum_{n=1}^N \log \left\{ \sum_{k=1}^K \pi_k \mathcal{N}(x|\mu_k,\Sigma_k) \right\}\]

Notice that there is not a closed-form analytic solution for the MLE (in fact, Bishop says that it might produce numerical problems if a mixture only has one datapoints, since the variance goes to the zero, pgs. 433-5). This is a good eg where the Expectation-maximization (EM) algorithm can find a numerical solution.

EM for Gaussian Mixtures (section 9.2.2)

We have an incomplete dataset \(X\) with no information about which density produced each datapoint. Let’s call the parameters \(\theta = \{\pi,\mu,\Sigma\}\).

Using our previous dataset as an eg:

plot(d, col="black", xlab="X1", ylab="X2", pch=19)

We want the model to provide with adequate parameter values \(\theta\).

So, we introduce a K-dimensional binary random variable \(Z\), where each concretization \(z\) consists of a vector of zeros except for one coordinate that has a one (eg for \(K=5\), \(z=(0,0,0,1,0)\)). So, there are \(K\) possible different states of \(z\). Let’s call \(z_k\) the k-th coordinate of \(z\).

The marginal distribution is specified in terms of the mixing coefficient \[p(z_k=1) = \pi_k \iff p(z) = \prod_{k=1}^K \pi_k^{z_k}\] since \(z\) has zero everywhere else.

The conditional distribution of \(x\) given a value \(z\) is \[p(x|z_k=1) = \mathcal{N}(x|\mu_k,\Sigma_k)\] which can be also stated as \[p(x|z) = \prod_{k=1}^K \mathcal{N}(x|\mu_k,\Sigma_k)^{z_k}\]

With these two distributions we can compute the joint distribution \(p(x,z)\)

\[p(x,z) = p(x|z)p(z) = \prod_{k=1}^K ( \pi_k \mathcal{N}(x|\mu_k,\Sigma_k) )^{z_k}\]

The marginal distribution of \(x\) becomes

\[p(x) = \sum_z p(z)p(x|z) = \sum_{k=1}^k \pi_k \mathcal{N}(x|\mu_k,\Sigma_k)\]

which corresponds to the original mixing Gaussian model from the previous section. This means that this new model with latent variables \(z\) is equivalent to the original model.

The advantage is that the original model could not be used with MLE, but this new model can be used in the EM algorithm!

The likelihood of the complete dataset \(\{X,Z\}\) is

\[p(X,Z|\theta) = \prod_{n=1}^N \prod_{k=1}^K ( \pi_k^{z_k} \mathcal{N}(x|\mu_k,\Sigma_k) )^{z_k}\]

so the log-likelihood:

\[\log p(X,Z|\theta) = \sum_{n=1}^N \sum_{k=1}^K z_k \left\{ \log \pi_k + \log \mathcal{N}(x|\mu_k,\Sigma_k) \right\}\] check and compare with the previous log-likelihood of incomplete dataset \(X\); this one is much simpler for a MLE solution.

Using the formulas for marginal \(p(z)\) and the conditional \(p(x|z)\), and Bayes theorem, we can find the posterior distribution

\[p(Z|X,\theta) \propto \prod_{n=1}^N \prod_{k=1}^K ( \pi_k^{z_k} \mathcal{N}(x|\mu_k,\Sigma_k) )^{z_k}\]

The e-step of the EM is finding the value

\[E_{Z|X,\theta} [ \log p(X,Z|\theta) ] = \sum_{n=1}^N \sum_{k=1}^K \gamma_k(x_i) \left\{ \log \pi_k + \log \mathcal{N}(x|\mu_k,\Sigma_k) \right\}\]

For that we need to compute each responsability \(\gamma_k(x_i)\) using the current parameter values \(\theta\). To recall the expression:

\[\gamma_k(x) \equiv p(K=k|x) = \frac{p(K=k)p(x|K=k)}{\sum_i p(K=i)p(x|K=i)} = \frac{\pi_k \mathcal{N}(x|\mu_k,\Sigma_k)}{\sum_i \pi_i \mathcal{N}(x|\mu_i,\Sigma_i)}\]

The m-step fixes the responsabilities and maximizes the previous expected value wrt \(\theta\). These updates have closed-forms (we do not to compute the expected value):

  • \(\mu_k = \frac{1}{N_k} \sum_{n=1}^N \gamma_k(x_n) x_n\)

  • \(\Sigma_k = \frac{1}{N_k} \sum_{n=1}^N \gamma_k(x_n) (x_n - \mu_k)(x_n - \mu_k)^T\)

  • \(\pi_k = \frac{N_k}{N}\)

  • \(N_k = \sum_{n=1}^N \gamma_k(x_n)\)

After an iteration (e-step plus m-step) we check if the parameters are within convergence tolerance, and if not, run another iteration.

library(mvtnorm)

em_gaussian_mix <- function(dataset, K, max_iter=100, epsilon=1e-3) {

  # get the dataset classification given the current indicators
  get_classes <- function(gammak)
    apply(gammak,1,function(row) which.max(row))
  
  d      <- ncol(dataset)                                # number of dimensions
  N      <- nrow(dataset)                                # number of samples
  ranges <- sapply(1:d, function (i) range(dataset[,i])) # the ranges for each dimension
  
  # initial values
  pik <- rep(1/K,K)
  muk <- t(replicate(K,sapply(1:d, function(i) runif(1,ranges[1,i], ranges[2,i]))))
  Sigmas <- array(rep(NA,2*2*3), c(2,2,3)) 
  for (k in 1:K)
    Sigmas[,,k] <- diag(d)
  gammak <- matrix(rep(0,K*N),ncol=K) # the responsabilities
  old_gammak <- gammak

  # EM steps
  for(it in 1:max_iter) {
    
    # Expectation step: compute responsabilities
    
    for (k in 1:K) {
      gammak[,k] <- apply(dataset, 1, 
                            function(xi) {
                               pik[k] * dmvnorm(xi,muk[k,], Sigmas[,,k])
                            })
    }
    gammak <- t(apply(gammak, 1, function(row) row/sum(row)))
    
    if (sum(abs(gammak - old_gammak)) < epsilon) # convergence achieved?
      break
    else 
      old_gammak <- gammak
    
    # Maximization step: maximize the expected value wrt parameters theta
      
    Nk  <- sapply(1:K, function (k) sum(gammak[,k]))
    pik <- Nk/N
    for (k in 1:K) {
      muk[k,]     <- apply(gammak[,k] * dataset,2,sum) / Nk[k]
      Sigmas[,,k] <- diag(d) * 0 # reset
      for(n in 1:N) {
        Sigmas[,,k] <- Sigmas[,,k] + 
                       gammak[n,k]* (dataset[n,]-muk[k,])%*%t(dataset[n,]-muk[k,])
      }
      Sigmas[,,k] <- Sigmas[,,k] / Nk[k]  
    }
  }
  
  list(mu=mu, Sigmas=Sigmas, gammak=gammak, pred=get_classes(gammak))
}

Let’s try the algorithm. In this first plot we select the class with higher responsability:

set.seed(101)
result <- em_gaussian_mix(d,3)  # set clustering to 3 classes

plot(d, col=c("red","green","blue")[result$pred], xlab="X1", ylab="X2", pch=19)

In the next plot, each dot’s color is a combination of the three classes using the RGB scheme to represent the three responsibilities. Eg, the mixing of the second and third Gaussian will be described in shades of blue and green, and so on…

plot(d, col=rgb(result$gammak[,1], result$gammak[,2], result$gammak[,3]), xlab="X1", ylab="X2", pch=19)