Refs:

K-Means

  1. Pick an initial set of K centroids (this can be random or any other means)
  2. For each data point, assign it to the member of the closest centroid according to the given distance function
  3. Adjust the centroid position as the mean of all its assigned member data points. Go back to (2) until the membership isn’t change and centroid position is stable.
  4. Output the centroids.

Notice that in K-Means, we require the definition of: + the distance function + the mean function + the number of centroids \(K\)

K-Means is \(O(nkr)\), where \(n\) is the number of points, \(r\) is the number of rounds and \(k\) the number of centroids.

The result of each round is undeterministic. The usual practices is to run multiple rounds of K-Means and pick the result of the best round. The best round is one who minimize the average distance of each point to its assigned centroid.

library(stats)
set.seed(101)
km <- kmeans(iris[,1:4], 3)
plot(iris[,1], iris[,2], col=km$cluster)
points(km$centers[,c(1,2)], col=1:3, pch=19, cex=2)

table(km$cluster, iris$Species)
##    
##     setosa versicolor virginica
##   1      0         48        14
##   2     50          0         0
##   3      0          2        36
# Another round:
set.seed(900)
km <- kmeans(iris[,1:4], 3)
plot(iris[,1], iris[,2], col=km$cluster)
points(km$centers[,c(1,2)], col=1:3, pch=19, cex=2)

table(predicted=km$cluster, true=iris$Species)
##          true
## predicted setosa versicolor virginica
##         1      0         46        50
##         2     17          4         0
##         3     33          0         0

Hierarchical Clustering

In this approach, it compares all pairs of data points and merge the one with the closest distance.

  1. Compute distance between every pairs of point/cluster.
  1. Distance between point is just using the distance function.
  2. Compute distance between pointA to clusterB may involve many choices (such as the min/max/avg distance between the pointA and points in the clusterB).
  3. Compute distance between clusterA to clusterB may first compute distance of all points pairs (one from clusterA and the other from clusterB) and then pick either min/max/avg of these pairs.
  1. Combine the two closest point/cluster into a cluster. Go back to (1) until only one big cluster remains.

In hierarchical clustering, the complexity is O(n^2), the output will be a tree of merging steps. It doesn’t require us to specify \(K\) or a mean function. Since its high complexity, hierarchical clustering is typically used when the number of points are not too high.

m <- matrix(1:15,5,3)
dist(m) # computes the distance between rows of m (since there are 3 columns, it is the euclidian distance between tri-dimensional points)
##          1        2        3        4
## 2 1.732051                           
## 3 3.464102 1.732051                  
## 4 5.196152 3.464102 1.732051         
## 5 6.928203 5.196152 3.464102 1.732051
dist(m,method="manhattan") # using the manhattan metric
##    1  2  3  4
## 2  3         
## 3  6  3      
## 4  9  6  3   
## 5 12  9  6  3
set.seed(101)
sampleiris <- iris[sample(1:150, 40),] # get samples from iris dataset
# each observation has 4 variables, ie, they are interpreted as 4-D points
distance   <- dist(sampleiris[,-5], method="euclidean") 
cluster    <- hclust(distance, method="average")
plot(cluster, hang=-1, label=sampleiris$Species)

Other ways to present the information:

plot(as.dendrogram(cluster), edgePar=list(col="darkgreen", lwd=2), horiz=T) 

str(as.dendrogram(cluster)) # Prints dendrogram structure as text.
## --[dendrogram w/ 2 branches and 40 members at h = 3.95]
##   |--[dendrogram w/ 2 branches and 13 members at h = 0.819]
##   |  |--[dendrogram w/ 2 branches and 6 members at h = 0.426]
##   |  |  |--[dendrogram w/ 2 branches and 2 members at h = 0.141]
##   |  |  |  |--leaf "9" 
##   |  |  |  `--leaf "39" 
##   |  |  `--[dendrogram w/ 2 branches and 4 members at h = 0.28]
##   |  |     |--leaf "30" 
##   |  |     `--[dendrogram w/ 2 branches and 3 members at h = 0.244]
##   |  |        |--leaf "7" 
##   |  |        `--[dendrogram w/ 2 branches and 2 members at h = 0.141]
##   |  |           |--leaf "48" 
##   |  |           `--leaf "3" 
##   |  `--[dendrogram w/ 2 branches and 7 members at h = 0.574]
##   |     |--leaf "37" 
##   |     `--[dendrogram w/ 2 branches and 6 members at h = 0.563]
##   |        |--[dendrogram w/ 2 branches and 2 members at h = 0.387]
##   |        |  |--leaf "6" 
##   |        |  `--leaf "20" 
##   |        `--[dendrogram w/ 2 branches and 4 members at h = 0.452]
##   |           |--[dendrogram w/ 2 branches and 2 members at h = 0.265]
##   |           |  |--leaf "44" 
##   |           |  `--leaf "24" 
##   |           `--[dendrogram w/ 2 branches and 2 members at h = 0.3]
##   |              |--leaf "28" 
##   |              `--leaf "50" 
##   `--[dendrogram w/ 2 branches and 27 members at h = 2.63]
##      |--[dendrogram w/ 2 branches and 3 members at h = 0.912]
##      |  |--leaf "110" 
##      |  `--[dendrogram w/ 2 branches and 2 members at h = 0.819]
##      |     |--leaf "106" 
##      |     `--leaf "118" 
##      `--[dendrogram w/ 2 branches and 24 members at h = 1.6]
##         |--[dendrogram w/ 2 branches and 13 members at h = 1.03]
##         |  |--[dendrogram w/ 2 branches and 4 members at h = 0.752]
##         |  |  |--leaf "103" 
##         |  |  `--[dendrogram w/ 2 branches and 3 members at h = 0.494]
##         |  |     |--leaf "140" 
##         |  |     `--[dendrogram w/ 2 branches and 2 members at h = 0.361]
##         |  |        |--leaf "117" 
##         |  |        `--leaf "148" 
##         |  `--[dendrogram w/ 2 branches and 9 members at h = 1.01]
##         |     |--[dendrogram w/ 2 branches and 3 members at h = 0.557]
##         |     |  |--leaf "51" 
##         |     |  `--[dendrogram w/ 2 branches and 2 members at h = 0.374]
##         |     |     |--leaf "77" 
##         |     |     `--leaf "55" 
##         |     `--[dendrogram w/ 2 branches and 6 members at h = 0.769]
##         |        |--leaf "135" 
##         |        `--[dendrogram w/ 2 branches and 5 members at h = 0.482]
##         |           |--[dendrogram w/ 2 branches and 2 members at h = 0.361]
##         |           |  |--leaf "124" 
##         |           |  `--leaf "128" 
##         |           `--[dendrogram w/ 2 branches and 3 members at h = 0.361]
##         |              |--leaf "84" 
##         |              `--[dendrogram w/ 2 branches and 2 members at h = 0]
##         |                 |--leaf "102" 
##         |                 `--leaf "143" 
##         `--[dendrogram w/ 2 branches and 11 members at h = 1.3]
##            |--[dendrogram w/ 2 branches and 8 members at h = 0.632]
##            |  |--leaf "92" 
##            |  `--[dendrogram w/ 2 branches and 7 members at h = 0.559]
##            |     |--leaf "85" 
##            |     `--[dendrogram w/ 2 branches and 6 members at h = 0.445]
##            |        |--leaf "93" 
##            |        `--[dendrogram w/ 2 branches and 5 members at h = 0.408]
##            |           |--leaf "56" 
##            |           `--[dendrogram w/ 2 branches and 4 members at h = 0.345]
##            |              |--leaf "62" 
##            |              `--[dendrogram w/ 2 branches and 3 members at h = 0.198]
##            |                 |--leaf "89" 
##            |                 `--[dendrogram w/ 2 branches and 2 members at h = 0.141]
##            |                    |--leaf "97" 
##            |                    `--leaf "100" 
##            `--[dendrogram w/ 2 branches and 3 members at h = 0.858]
##               |--leaf "80" 
##               `--[dendrogram w/ 2 branches and 2 members at h = 0.721]
##                  |--leaf "99" 
##                  `--leaf "61"
cluster$labels[cluster$order] # Prints the row labels in the order they appear in the tree.
##  [1] "9"   "39"  "30"  "7"   "48"  "3"   "37"  "6"   "20"  "44"  "24" 
## [12] "28"  "50"  "110" "106" "118" "103" "140" "117" "148" "51"  "77" 
## [23] "55"  "135" "124" "128" "84"  "102" "143" "92"  "85"  "93"  "56" 
## [34] "62"  "89"  "97"  "100" "80"  "99"  "61"

It’s possible to prune the resulting tree. In the next egs we cut by number of clusters:

par(mfrow=c(1,2))
group.3 <- cutree(cluster, k = 3)  # prune the tree by 3 clusters
table(group.3, sampleiris$Species) # compare with known classes
##        
## group.3 setosa versicolor virginica
##       1      0         15         9
##       2     13          0         0
##       3      0          0         3
plot(sampleiris[,c(1,2)], col=group.3, pch=19, cex=2.5, main="3 clusters")
points(sampleiris[,c(1,2)], col=sampleiris$Species, pch=19, cex=1)
group.6 <- cutree(cluster, k = 6)  # we can prune by more clusters
table(group.6, sampleiris$Species)
##        
## group.6 setosa versicolor virginica
##       1      0          8         0
##       2     13          0         0
##       3      0          0         3
##       4      0          4         5
##       5      0          3         0
##       6      0          0         4
plot(sampleiris[,c(1,2)], col=group.6, pch=19, cex=2.5, main="6 clusters")
points(sampleiris[,c(1,2)], col=sampleiris$Species, pch=19, cex=1) # the little points are the true classes

par(mfrow=c(1,1))

It is also possible to cut by height of the original tree:

plot(cluster, hang=-1, label=sampleiris$Species)
abline(h=0.9,lty=3,col="red")

height.0.9 <- cutree(cluster, h = 0.9)
table(height.0.9, sampleiris$Species) # compare with known classes
##           
## height.0.9 setosa versicolor virginica
##          1      0          8         0
##          2     13          0         0
##          3      0          0         2
##          4      0          3         0
##          5      0          1         5
##          6      0          3         0
##          7      0          0         1
##          8      0          0         4
plot(sampleiris[,c(1,2)], col=height.0.9, pch=19, cex=2.5, main="3 clusters")
points(sampleiris[,c(1,2)], col=sampleiris$Species, pch=19, cex=1)

And if we don’t know the number of clusters? (ref)

# Calculate the dissimilarity between observations using the Euclidean distance 
dist.iris <- dist(iris, method="euclidean")
## Warning in dist(iris, method = "euclidean"): NAs introduced by coercion
# Compute a hierarchical cluster analysis on the distance matrix using the complete linkage method 
h.iris <- hclust(dist.iris, method="complete") 
h.iris
## 
## Call:
## hclust(d = dist.iris, method = "complete")
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 150
head(h.iris$merge, n=10)
##       [,1] [,2]
##  [1,] -102 -143
##  [2,]   -8  -40
##  [3,]   -1  -18
##  [4,]  -10  -35
##  [5,] -129 -133
##  [6,]  -11  -49
##  [7,]   -5  -38
##  [8,]  -20  -22
##  [9,]  -30  -31
## [10,]  -58  -94

The minus in front of the unit number indicates that this is a single observation being merged; whereas numbers alone indicate the step at which the considered clusters were built (check ??hclust).

plot(h.iris)

What is an appropriate number of clusters according to this plot? A common choice is to cut the tree by the largest difference of heights between two nodes. The height values are contained in the output of hclust function:

h.iris.heights <- h.iris$height # height values
h.iris.heights[1:10]
##  [1] 0.0000000 0.1118034 0.1118034 0.1118034 0.1118034 0.1118034 0.1581139
##  [8] 0.1581139 0.1581139 0.1581139
subs <- round(h.iris.heights - c(0,h.iris.heights[-length(h.iris.heights)]), 3) # subtract next height
which.max(subs)
## [1] 149

Since the largest jump was on the last step of the merging process, it suggests two clusters (herein, we know it is three).

Other fancy stuff

# Cuts dendrogram at specified level and draws rectangles around the resulting clusters
plot(cluster); rect.hclust(cluster, k=5, border="red")

c <- cor(t(iris[,-5]), method="spearman"); 
d <- as.dist(1-c);
mycl  <- cutree(cluster, h=1);
subcl <- names(mycl[mycl==3]) # which observations are considered class 3
subd  <- as.dist(as.matrix(d)[subcl,subcl])
subhr <- hclust(subd, method = "complete")
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R") # Import tree coloring function.
# In this example the dendrogram for the above object is colored with the imported 'dendroCol()' function based on the identifiers provided in its 'keys' argument. If 'xPar' is set to 'nodePar' then the labels are colored instead of the leaves.
dend_colored <- dendrapply(as.dendrogram(cluster), dendroCol, keys=subcl, xPar="edgePar", bgr="red", fgr="blue", lwd=2, pch=20) 
par(mfrow = c(1, 3))
# Plots the colored tree in different formats. The last command shows how one can zoom into the tree with the 'xlim and ylim' arguments, which is possible since R 2.8.
plot(dend_colored, horiz=T)
plot(dend_colored, horiz=T, type="tr")
plot(dend_colored, horiz=T, edgePar=list(lwd=2), xlim=c(3,0), ylim=c(1,3))