Refs:

The goal of spectral clustering is to cluster data that is connected but not necessarily clustered within convex boundaries.

Let’s get a dataset nasty for standard clustering:

# install.packages("mlbench")
library(mlbench)

set.seed(111)
obj <- mlbench.spirals(100,1,0.025)
my.data <-  4 * obj$x
plot(my.data)

Spectral clustering needs a similarity or affinity \(s(x,y)\) measure determining how close points \(x\) and \(y\) are from each other.

Let’s denote the Similarity Matrix, \(S\), as the matrix that at \(S_{ij} = s(x_i,x_j)\) gives the similarity between observations \(x_i\) and \(x_j\).

Common similary measures are: + Euclidean distance: \(s(x_i,x_j)=-\|x_i - x_j\|^2\) + Gaussian Kernel: \(s(x_i,x_j)=exp(- \alpha \|x_i - x_j\|^2)\)

In the Gaussian Kernel if two points are close then \(S_{ij} \approx 1\) and when two points are far apart then \(S_{ij} \approx 0\). This measure can also have a hard cut at a certain distance \(r\), i.e., \(\|x_i - x_j\|^2 \geq r \implies S_{ij} = 0\).

We expect that when two points are from different clusters they are far away. But it might also happen that two points from the same cluster are also far away, except that there should be a sequence of points from the same cluster creating a “path” between them.

Let’s compute \(S\) for this dataset using the gaussian kernel:

s <- function(x1, x2, alpha=1) {
  exp(- alpha * norm(as.matrix(x1-x2), type="F"))
}

make.similarity <- function(my.data, similarity) {
  N <- nrow(my.data)
  S <- matrix(rep(NA,N^2), ncol=N)
  for(i in 1:N) {
    for(j in 1:N) {
      S[i,j] <- similarity(my.data[i,], my.data[j,])
    }
  }
  S
}

S <- make.similarity(my.data, s)
S[1:8,1:8]
##            [,1]        [,2]       [,3]       [,4]        [,5]        [,6]
## [1,] 1.00000000 0.064179185 0.74290158 0.63193426 0.098831073 0.094897744
## [2,] 0.06417919 1.000000000 0.06938066 0.04276431 0.214229495 0.275123731
## [3,] 0.74290158 0.069380663 1.00000000 0.61054893 0.089569089 0.088641808
## [4,] 0.63193426 0.042764307 0.61054893 1.00000000 0.062517586 0.059982837
## [5,] 0.09883107 0.214229495 0.08956909 0.06251759 1.000000000 0.776556494
## [6,] 0.09489774 0.275123731 0.08864181 0.05998284 0.776556494 1.000000000
## [7,] 0.56549848 0.048335201 0.66577557 0.71959220 0.059731778 0.059016049
## [8,] 0.03355084 0.008796359 0.04342047 0.04426067 0.005091154 0.005548028
##            [,7]        [,8]
## [1,] 0.56549848 0.033550836
## [2,] 0.04833520 0.008796359
## [3,] 0.66577557 0.043420466
## [4,] 0.71959220 0.044260673
## [5,] 0.05973178 0.005091154
## [6,] 0.05901605 0.005548028
## [7,] 1.00000000 0.058059785
## [8,] 0.05805979 1.000000000

The next step is to compute an affinity matrix \(A\) based on \(S\). \(A\) must be made of positive values and be symmetric.

This is usually done by applying a k-nearest neighboor filter to build a representation of a graph connecting just the closest dataset points. However, to be symmetric, if \(A_{ij}\) is selected as a nearest neighboor, so will \(A_{ji}\).

Let’s compute one:

make.affinity <- function(S, n.neighboors=2) {
  N <- length(S[,1])

  if (n.neighboors >= N) {  # fully connected
    A <- S
  } else {
    A <- matrix(rep(0,N^2), ncol=N)
    for(i in 1:N) { # for each line
      # only connect to those points with larger similarity 
      best.similarities <- sort(S[i,], decreasing=TRUE)[1:n.neighboors]
      for (s in best.similarities) {
        j <- which(S[i,] == s)
        A[i,j] <- S[i,j]
        A[j,i] <- S[i,j] # to make an undirected graph, ie, the matrix becomes symmetric
      }
    }
  }
  A  
}

A <- make.affinity(S, 3)  # use 3 neighboors (includes self)
A[1:8,1:8]
##           [,1] [,2]      [,3]      [,4]      [,5]      [,6]      [,7] [,8]
## [1,] 1.0000000    0 0.7429016 0.6319343 0.0000000 0.0000000 0.0000000    0
## [2,] 0.0000000    1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000    0
## [3,] 0.7429016    0 1.0000000 0.0000000 0.0000000 0.0000000 0.6657756    0
## [4,] 0.6319343    0 0.0000000 1.0000000 0.0000000 0.0000000 0.7195922    0
## [5,] 0.0000000    0 0.0000000 0.0000000 1.0000000 0.7765565 0.0000000    0
## [6,] 0.0000000    0 0.0000000 0.0000000 0.7765565 1.0000000 0.0000000    0
## [7,] 0.0000000    0 0.6657756 0.7195922 0.0000000 0.0000000 1.0000000    0
## [8,] 0.0000000    0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000    1

With this affinity matrix, clustering is replaced by a graph-partition problem, where connected graph components are interpreted as clusters. The graph must be partitioned such that edges connecting different clusters should have low weigths, and edges within the same cluster must have high values. Spectral clustering tries to construct this type of graph.

There is the need of a degree matrix \(D\) where each diagonal value is the degree of the respective vertex and all other positions are zero:

D <- diag(apply(A, 1, sum)) # sum rows
D[1:8,1:8]
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
## [1,] 2.374836 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [2,] 0.000000 2.597451 0.000000 0.000000 0.000000 0.000000 0.000000
## [3,] 0.000000 0.000000 2.408677 0.000000 0.000000 0.000000 0.000000
## [4,] 0.000000 0.000000 0.000000 2.351526 0.000000 0.000000 0.000000
## [5,] 0.000000 0.000000 0.000000 0.000000 2.523175 0.000000 0.000000
## [6,] 0.000000 0.000000 0.000000 0.000000 0.000000 2.519936 0.000000
## [7,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.170424
## [8,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
##          [,8]
## [1,] 0.000000
## [2,] 0.000000
## [3,] 0.000000
## [4,] 0.000000
## [5,] 0.000000
## [6,] 0.000000
## [7,] 0.000000
## [8,] 2.302241

Next we compute the unnormalized graph Laplacian (\(U = D - A\)) and/or a normalized version (\(L\)). The normalized Laplacian has many variants:

Herein we use the unnormalized \(U\) (for other algorithms check Luxburg’s tutorial).

U <- D - A
round(U[1:12,1:12],1)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
##  [1,]  1.4  0.0 -0.7 -0.6  0.0  0.0  0.0  0.0  0.0   0.0   0.0   0.0
##  [2,]  0.0  1.6  0.0  0.0  0.0  0.0  0.0  0.0  0.0   0.0   0.0   0.0
##  [3,] -0.7  0.0  1.4  0.0  0.0  0.0 -0.7  0.0  0.0   0.0   0.0   0.0
##  [4,] -0.6  0.0  0.0  1.4  0.0  0.0 -0.7  0.0  0.0   0.0   0.0   0.0
##  [5,]  0.0  0.0  0.0  0.0  1.5 -0.8  0.0  0.0  0.0   0.0   0.0   0.0
##  [6,]  0.0  0.0  0.0  0.0 -0.8  1.5  0.0  0.0  0.0   0.0   0.0   0.0
##  [7,]  0.0  0.0 -0.7 -0.7  0.0  0.0  2.2  0.0  0.0  -0.8   0.0   0.0
##  [8,]  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.3  0.0   0.0   0.0   0.0
##  [9,]  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.5   0.0   0.0   0.0
## [10,]  0.0  0.0  0.0  0.0  0.0  0.0 -0.8  0.0  0.0   1.6  -0.8   0.0
## [11,]  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  -0.8   1.5  -0.8
## [12,]  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   0.0  -0.8   1.5
# L <- diag(nrow(my.data)) - solve(D) %*% A  # simple Laplacian
# round(L[1:12,1:12],1)

# matrix power operator: computes M^power (M must be diagonalizable)
"%^%" <- function(M, power)
  with(eigen(M), vectors %*% (values^power * solve(vectors)))

# L <- (D %^% (-1/2)) %*% A %*% (D %^% (-1/2))  # normalized Laplacian

Assuming we want \(k\) clusters, the next step is to find the \(k\) smallest eigenvectors (ignoring the trivial constant eigenvector).

k   <- 2
evL <- eigen(U, symmetric=TRUE)
Z   <- evL$vectors[,(ncol(evL$vectors)-k+1):ncol(evL$vectors)]

The \(i^{th}\) row of \(Z\) defines a transformation for observation \(x_i\). Let’s check that they are well-separated:

plot(Z, col=obj$classes, pch=20) # notice that all 50 points, of each cluster, are on top of each other

So, in this transformed space it becomes easy for a standard k-means clustering to find the appropriate clusters:

library(stats)
km <- kmeans(Z, centers=k, nstart=5)
plot(my.data, col=km$cluster)

If we don’t know how much clusters there are, the eigenvalue spectrum has a gap that give us the value of \(k\).

signif(evL$values,2) # eigenvalues are in decreasing order
##   [1] 3.3e+00 3.3e+00 3.0e+00 3.0e+00 2.9e+00 2.9e+00 2.8e+00 2.8e+00
##   [9] 2.7e+00 2.7e+00 2.7e+00 2.7e+00 2.6e+00 2.6e+00 2.6e+00 2.5e+00
##  [17] 2.5e+00 2.5e+00 2.5e+00 2.5e+00 2.4e+00 2.4e+00 2.4e+00 2.3e+00
##  [25] 2.3e+00 2.3e+00 2.3e+00 2.3e+00 2.2e+00 2.2e+00 2.2e+00 2.1e+00
##  [33] 2.1e+00 2.0e+00 2.0e+00 2.0e+00 2.0e+00 1.9e+00 1.9e+00 1.8e+00
##  [41] 1.8e+00 1.7e+00 1.7e+00 1.7e+00 1.6e+00 1.6e+00 1.6e+00 1.5e+00
##  [49] 1.5e+00 1.4e+00 1.4e+00 1.4e+00 1.3e+00 1.3e+00 1.2e+00 1.2e+00
##  [57] 1.1e+00 1.1e+00 1.1e+00 1.0e+00 9.7e-01 9.4e-01 8.8e-01 8.6e-01
##  [65] 7.9e-01 7.7e-01 7.1e-01 6.9e-01 6.2e-01 6.1e-01 5.5e-01 5.3e-01
##  [73] 4.7e-01 4.6e-01 4.1e-01 4.0e-01 3.4e-01 3.3e-01 2.8e-01 2.8e-01
##  [81] 2.3e-01 2.2e-01 1.8e-01 1.8e-01 1.4e-01 1.3e-01 1.0e-01 9.9e-02
##  [89] 7.0e-02 6.9e-02 4.4e-02 4.4e-02 2.5e-02 2.5e-02 1.1e-02 1.1e-02
##  [97] 2.8e-03 2.7e-03 2.8e-16 1.1e-16
plot(1:10, rev(evL$values)[1:10], log="y")
abline(v=2.25, col="red", lty=2) # there are just 2 clusters as expected

This gap usually is hard to find. Choosing the optimal \(k\) is called rounding.

Using R kernlab package

Of course all this is already implemented in R:

library(kernlab)

sc <- specc(my.data, centers=2)
plot(my.data, col=sc, pch=4)            # estimated classes (x)
points(my.data, col=obj$classes, pch=5) # true classes (<>)