K-Means
Hierarchical Clustering
Other fancy stuff
Fuzzy C-Means
Multi-Gaussian with Expectation-Maximization
Density-based Cluster
QT Clustering
Self-Organizing Map (SOM)
k-Nearest Neighbour
Annex: Scaling datasets

Refs:

K-Means

Pick an initial set of K centroids (this can be random or any other means)
For each data point, assign it to the member of the closest centroid according to the given distance function
Adjust the centroid position as the mean of all its assigned member data points. Go back to (2) until the membership isn’t change and centroid position is stable.
Output the centroids.

Notice that in K-Means, we require the definition of: + the distance function + the mean function + the number of centroids \(K\)

K-Means is \(O(nkr)\), where \(n\) is the number of points, \(r\) is the number of rounds and \(k\) the number of centroids.

The result of each round is undeterministic. The usual practices is to run multiple rounds of K-Means and pick the result of the best round. The best round is one who minimize the average distance of each point to its assigned centroid.

library(stats)
set.seed(101)
km <- kmeans(iris[,1:4], 3)
plot(iris[,1], iris[,2], col=km$cluster)
points(km$centers[,c(1,2)], col=1:3, pch=19, cex=2)

table(km$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1      0         48        14
##   2     50          0         0
##   3      0          2        36

# Another round:
set.seed(900)
km <- kmeans(iris[,1:4], 3)
plot(iris[,1], iris[,2], col=km$cluster)
points(km$centers[,c(1,2)], col=1:3, pch=19, cex=2)

table(predicted=km$cluster, true=iris$Species)

##          true
## predicted setosa versicolor virginica
##         1      0         46        50
##         2     17          4         0
##         3     33          0         0

Hierarchical Clustering

In this approach, it compares all pairs of data points and merge the one with the closest distance.

Compute distance between every pairs of point/cluster.

Distance between point is just using the distance function.
Compute distance between pointA to clusterB may involve many choices (such as the min/max/avg distance between the pointA and points in the clusterB).
Compute distance between clusterA to clusterB may first compute distance of all points pairs (one from clusterA and the other from clusterB) and then pick either min/max/avg of these pairs.

Combine the two closest point/cluster into a cluster. Go back to (1) until only one big cluster remains.

In hierarchical clustering, the complexity is O(n^2), the output will be a tree of merging steps. It doesn’t require us to specify \(K\) or a mean function. Since its high complexity, hierarchical clustering is typically used when the number of points are not too high.

m <- matrix(1:15,5,3)
dist(m) # computes the distance between rows of m (since there are 3 columns, it is the euclidian distance between tri-dimensional points)

##          1        2        3        4
## 2 1.732051                           
## 3 3.464102 1.732051                  
## 4 5.196152 3.464102 1.732051         
## 5 6.928203 5.196152 3.464102 1.732051

dist(m,method="manhattan") # using the manhattan metric

##    1  2  3  4
## 2  3         
## 3  6  3      
## 4  9  6  3   
## 5 12  9  6  3

set.seed(101)
sampleiris <- iris[sample(1:150, 40),] # get samples from iris dataset
# each observation has 4 variables, ie, they are interpreted as 4-D points
distance   <- dist(sampleiris[,-5], method="euclidean") 
cluster    <- hclust(distance, method="average")
plot(cluster, hang=-1, label=sampleiris$Species)

Other ways to present the information:

plot(as.dendrogram(cluster), edgePar=list(col="darkgreen", lwd=2), horiz=T)

str(as.dendrogram(cluster)) # Prints dendrogram structure as text.

## --[dendrogram w/ 2 branches and 40 members at h = 3.95]
##   |--[dendrogram w/ 2 branches and 13 members at h = 0.819]
##   |  |--[dendrogram w/ 2 branches and 6 members at h = 0.426]
##   |  |  |--[dendrogram w/ 2 branches and 2 members at h = 0.141]
##   |  |  |  |--leaf "9" 
##   |  |  |  `--leaf "39" 
##   |  |  `--[dendrogram w/ 2 branches and 4 members at h = 0.28]
##   |  |     |--leaf "30" 
##   |  |     `--[dendrogram w/ 2 branches and 3 members at h = 0.244]
##   |  |        |--leaf "7" 
##   |  |        `--[dendrogram w/ 2 branches and 2 members at h = 0.141]
##   |  |           |--leaf "48" 
##   |  |           `--leaf "3" 
##   |  `--[dendrogram w/ 2 branches and 7 members at h = 0.574]
##   |     |--leaf "37" 
##   |     `--[dendrogram w/ 2 branches and 6 members at h = 0.563]
##   |        |--[dendrogram w/ 2 branches and 2 members at h = 0.387]
##   |        |  |--leaf "6" 
##   |        |  `--leaf "20" 
##   |        `--[dendrogram w/ 2 branches and 4 members at h = 0.452]
##   |           |--[dendrogram w/ 2 branches and 2 members at h = 0.265]
##   |           |  |--leaf "44" 
##   |           |  `--leaf "24" 
##   |           `--[dendrogram w/ 2 branches and 2 members at h = 0.3]
##   |              |--leaf "28" 
##   |              `--leaf "50" 
##   `--[dendrogram w/ 2 branches and 27 members at h = 2.63]
##      |--[dendrogram w/ 2 branches and 3 members at h = 0.912]
##      |  |--leaf "110" 
##      |  `--[dendrogram w/ 2 branches and 2 members at h = 0.819]
##      |     |--leaf "106" 
##      |     `--leaf "118" 
##      `--[dendrogram w/ 2 branches and 24 members at h = 1.6]
##         |--[dendrogram w/ 2 branches and 13 members at h = 1.03]
##         |  |--[dendrogram w/ 2 branches and 4 members at h = 0.752]
##         |  |  |--leaf "103" 
##         |  |  `--[dendrogram w/ 2 branches and 3 members at h = 0.494]
##         |  |     |--leaf "140" 
##         |  |     `--[dendrogram w/ 2 branches and 2 members at h = 0.361]
##         |  |        |--leaf "117" 
##         |  |        `--leaf "148" 
##         |  `--[dendrogram w/ 2 branches and 9 members at h = 1.01]
##         |     |--[dendrogram w/ 2 branches and 3 members at h = 0.557]
##         |     |  |--leaf "51" 
##         |     |  `--[dendrogram w/ 2 branches and 2 members at h = 0.374]
##         |     |     |--leaf "77" 
##         |     |     `--leaf "55" 
##         |     `--[dendrogram w/ 2 branches and 6 members at h = 0.769]
##         |        |--leaf "135" 
##         |        `--[dendrogram w/ 2 branches and 5 members at h = 0.482]
##         |           |--[dendrogram w/ 2 branches and 2 members at h = 0.361]
##         |           |  |--leaf "124" 
##         |           |  `--leaf "128" 
##         |           `--[dendrogram w/ 2 branches and 3 members at h = 0.361]
##         |              |--leaf "84" 
##         |              `--[dendrogram w/ 2 branches and 2 members at h = 0]
##         |                 |--leaf "102" 
##         |                 `--leaf "143" 
##         `--[dendrogram w/ 2 branches and 11 members at h = 1.3]
##            |--[dendrogram w/ 2 branches and 8 members at h = 0.632]
##            |  |--leaf "92" 
##            |  `--[dendrogram w/ 2 branches and 7 members at h = 0.559]
##            |     |--leaf "85" 
##            |     `--[dendrogram w/ 2 branches and 6 members at h = 0.445]
##            |        |--leaf "93" 
##            |        `--[dendrogram w/ 2 branches and 5 members at h = 0.408]
##            |           |--leaf "56" 
##            |           `--[dendrogram w/ 2 branches and 4 members at h = 0.345]
##            |              |--leaf "62" 
##            |              `--[dendrogram w/ 2 branches and 3 members at h = 0.198]
##            |                 |--leaf "89" 
##            |                 `--[dendrogram w/ 2 branches and 2 members at h = 0.141]
##            |                    |--leaf "97" 
##            |                    `--leaf "100" 
##            `--[dendrogram w/ 2 branches and 3 members at h = 0.858]
##               |--leaf "80" 
##               `--[dendrogram w/ 2 branches and 2 members at h = 0.721]
##                  |--leaf "99" 
##                  `--leaf "61"

cluster$labels[cluster$order] # Prints the row labels in the order they appear in the tree.

##  [1] "9"   "39"  "30"  "7"   "48"  "3"   "37"  "6"   "20"  "44"  "24" 
## [12] "28"  "50"  "110" "106" "118" "103" "140" "117" "148" "51"  "77" 
## [23] "55"  "135" "124" "128" "84"  "102" "143" "92"  "85"  "93"  "56" 
## [34] "62"  "89"  "97"  "100" "80"  "99"  "61"

It’s possible to prune the resulting tree. In the next egs we cut by number of clusters:

par(mfrow=c(1,2))
group.3 <- cutree(cluster, k = 3)  # prune the tree by 3 clusters
table(group.3, sampleiris$Species) # compare with known classes

##        
## group.3 setosa versicolor virginica
##       1      0         15         9
##       2     13          0         0
##       3      0          0         3

plot(sampleiris[,c(1,2)], col=group.3, pch=19, cex=2.5, main="3 clusters")
points(sampleiris[,c(1,2)], col=sampleiris$Species, pch=19, cex=1)
group.6 <- cutree(cluster, k = 6)  # we can prune by more clusters
table(group.6, sampleiris$Species)

##        
## group.6 setosa versicolor virginica
##       1      0          8         0
##       2     13          0         0
##       3      0          0         3
##       4      0          4         5
##       5      0          3         0
##       6      0          0         4

plot(sampleiris[,c(1,2)], col=group.6, pch=19, cex=2.5, main="6 clusters")
points(sampleiris[,c(1,2)], col=sampleiris$Species, pch=19, cex=1) # the little points are the true classes

par(mfrow=c(1,1))

It is also possible to cut by height of the original tree:

plot(cluster, hang=-1, label=sampleiris$Species)
abline(h=0.9,lty=3,col="red")

height.0.9 <- cutree(cluster, h = 0.9)
table(height.0.9, sampleiris$Species) # compare with known classes

##           
## height.0.9 setosa versicolor virginica
##          1      0          8         0
##          2     13          0         0
##          3      0          0         2
##          4      0          3         0
##          5      0          1         5
##          6      0          3         0
##          7      0          0         1
##          8      0          0         4

plot(sampleiris[,c(1,2)], col=height.0.9, pch=19, cex=2.5, main="3 clusters")
points(sampleiris[,c(1,2)], col=sampleiris$Species, pch=19, cex=1)

And if we don’t know the number of clusters? (ref)

# Calculate the dissimilarity between observations using the Euclidean distance 
dist.iris <- dist(iris, method="euclidean")

## Warning in dist(iris, method = "euclidean"): NAs introduced by coercion

# Compute a hierarchical cluster analysis on the distance matrix using the complete linkage method 
h.iris <- hclust(dist.iris, method="complete") 
h.iris

## 
## Call:
## hclust(d = dist.iris, method = "complete")
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 150

head(h.iris$merge, n=10)

##       [,1] [,2]
##  [1,] -102 -143
##  [2,]   -8  -40
##  [3,]   -1  -18
##  [4,]  -10  -35
##  [5,] -129 -133
##  [6,]  -11  -49
##  [7,]   -5  -38
##  [8,]  -20  -22
##  [9,]  -30  -31
## [10,]  -58  -94

The minus in front of the unit number indicates that this is a single observation being merged; whereas numbers alone indicate the step at which the considered clusters were built (check ??hclust).

plot(h.iris)

What is an appropriate number of clusters according to this plot? A common choice is to cut the tree by the largest difference of heights between two nodes. The height values are contained in the output of hclust function:

h.iris.heights <- h.iris$height # height values
h.iris.heights[1:10]

##  [1] 0.0000000 0.1118034 0.1118034 0.1118034 0.1118034 0.1118034 0.1581139
##  [8] 0.1581139 0.1581139 0.1581139

subs <- round(h.iris.heights - c(0,h.iris.heights[-length(h.iris.heights)]), 3) # subtract next height
which.max(subs)

## [1] 149

Since the largest jump was on the last step of the merging process, it suggests two clusters (herein, we know it is three).

Other fancy stuff

# Cuts dendrogram at specified level and draws rectangles around the resulting clusters
plot(cluster); rect.hclust(cluster, k=5, border="red")

c <- cor(t(iris[,-5]), method="spearman"); 
d <- as.dist(1-c);
mycl  <- cutree(cluster, h=1);
subcl <- names(mycl[mycl==3]) # which observations are considered class 3
subd  <- as.dist(as.matrix(d)[subcl,subcl])
subhr <- hclust(subd, method = "complete")
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R") # Import tree coloring function.
# In this example the dendrogram for the above object is colored with the imported 'dendroCol()' function based on the identifiers provided in its 'keys' argument. If 'xPar' is set to 'nodePar' then the labels are colored instead of the leaves.
dend_colored <- dendrapply(as.dendrogram(cluster), dendroCol, keys=subcl, xPar="edgePar", bgr="red", fgr="blue", lwd=2, pch=20) 
par(mfrow = c(1, 3))
# Plots the colored tree in different formats. The last command shows how one can zoom into the tree with the 'xlim and ylim' arguments, which is possible since R 2.8.
plot(dend_colored, horiz=T)
plot(dend_colored, horiz=T, type="tr")
plot(dend_colored, horiz=T, edgePar=list(lwd=2), xlim=c(3,0), ylim=c(1,3))

par(mfrow = c(1, 1))
# This example shows how one can manually color tree elements.
z <- as.dendrogram(cluster)
attr(z[[2]][[2]],"edgePar") <- list(col="blue", lwd=4, pch=NA)
attr(z[[2]][[1]],"edgePar") <- list(col="red", lwd=3, lty=3, pch=NA)
plot(z, horiz=T)

Fuzzy C-Means

Unlike K-Means where each data point belongs to only one cluster, in fuzzy cmeans, each data point has a fraction of membership to each cluster. The goal is to figure out the membership fraction that minimize the expected distance to each centroid. Details here.

The parameter m is the degree of fuzziness. The output is the matrix with each data point assigned a degree of membership to each centroids.

library(e1071)
result <- cmeans(iris[,-5], centers=3, iter.max=100, m=2, method="cmeans")  # 3 clusters
plot(iris[,1], iris[,2], col=result$cluster)
points(result$centers[,c(1,2)], col=1:3, pch=19, cex=2)

result$membership[1:5,] # degree of membership for each observation to each cluster:

##              1           2           3
## [1,] 0.9966236 0.002304387 0.001072022
## [2,] 0.9758510 0.016650680 0.007498337
## [3,] 0.9798249 0.013760264 0.006414830
## [4,] 0.9674254 0.022466586 0.010108035
## [5,] 0.9944703 0.003761751 0.001767930

table(iris$Species, result$cluster)

##             
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 47  3
##   virginica   0 13 37

Multi-Gaussian with Expectation-Maximization

Generally in machine learning, we will to learn a set of parameters that maximize the likelihood of observing our training data. However, what if there are some hidden variable in our data that we haven’t observed. Expectation Maximization is a very common technique to use the parameter to estimate the probability distribution of those hidden variable, compute the expected likelihood and then figure out the parameters that will maximize this expected likelihood. It can be explained as follows …

library(mclust)

## Package 'mclust' version 5.2

## Type 'citation("mclust")' for citing this R package in publications.

mc <- Mclust(iris[,1:4], 3)
summary(mc)

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VEV (ellipsoidal, equal shape) model with 3 components:
## 
##  log.likelihood   n df       BIC       ICL
##       -186.0736 150 38 -562.5514 -566.4577
## 
## Clustering table:
##  1  2  3 
## 50 45 55

plot(mc, what=c("classification"), dimens=c(1,2))

plot(mc, what=c("classification"), dimens=c(3,4))

table(iris$Species, mc$classification)

##             
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 45  5
##   virginica   0  0 50

Density-based Cluster

In density based cluster, a cluster is extend along the density distribution.

Two parameters is important: “eps” defines the radius of neighborhood of each point, and “minpts” is the number of neighbors within my “eps” radius.

The basic algorithm called DBscan proceeds as follows

First scan: For each point, compute the distance with all other points. Increment a neighbor count if it is smaller than “eps”.
Second scan: For each point, mark it as a core point if its neighbor count is greater than “minpts”
Third scan: For each core point, if it is not already assigned a cluster, create a new cluster and assign that to this core point as well as all of its neighbors within “eps” radius.

Unlike other cluster, density based cluster can have some outliers (data points that doesn’t belong to any clusters). On the other hand, it can detect cluster of arbitrary shapes (doesn’t have to be circular at all).

library(fpc)
set.seed(121)
sampleiris <- iris[sample(1:150, 40),] # get samples from iris dataset
# eps is radius of neighborhood, MinPts is no of neighbors within eps
cluster <- dbscan(sampleiris[,-5], eps=0.6, MinPts=4)
# black points are outliers, triangles are core points and circles are boundary points
plot(cluster, sampleiris)

plot(cluster, sampleiris[,c(1,4)])

# Notice points in cluster 0 are unassigned outliers
table(cluster$cluster, sampleiris$Species)

##    
##     setosa versicolor virginica
##   0      0          4         5
##   1      0          3         8
##   2      0          6         0
##   3     14          0         0

QT Clustering

Quality Control Clustering requires the threshold distance within the cluster and the minimum number of elements in each cluster.

For each data point find all its candidate data points, ie, those which are within the range of the threshold distance from the given data point. This way we find the candidate data points for all data point and choose the one with large number of candidate data points to form a cluster. Now data points which belongs to this cluster is removed and the same procedure is repeated with the reduced set of data points until no more cluster can be formed satisfying the minimum size criteria.

Pros: + Quality Guaranteed - Only clusters that pass a user-defined quality threshold will be returned. + Number of clusters is not a parameter + All possible clusters are considered

Cons: + Computationally Intensive and Time Consuming - Increasing the minimum cluster size or increasing the number of data points can greatly increase the computational time. + Threshold distance and minimum number of elements in the cluster are parameters

library(flexclust)

## Loading required package: grid

## Loading required package: lattice

## Loading required package: modeltools

## Loading required package: stats4

cl1 <- qtclust(iris[,-5], radius=2) # Uses 2 as the maximum distance of the points to the cluster centers.
cl2 <- qtclust(iris[,-5], radius=1) # Uses 1 as the maximum distance of the points to the cluster centers.
par(mfrow=c(1,2))
plot(iris[,c(1,2)], col=predict(cl1), xlab="", ylab="")
plot(iris[,c(1,2)], col=predict(cl2), xlab="", ylab="")

par(mfrow=c(1,1))
table(attributes(cl1)$cluster, iris$Species) # not very good...

##    
##     setosa versicolor virginica
##   1      0         49        44
##   2     50          1         0
##   3      0          0         6

table(attributes(cl2)$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1      0         21        39
##   2     48          0         0
##   3      0         29         1
##   4      0          0        10

Self-Organizing Map (SOM)

ref: http://www.jstatsoft.org/v21/i05

Self-organizing map (SOM), also known as Kohonen network, is an artificial neural network algorithm in the unsupervised learning area. The approach iteratively assigns all items in a data matrix to a specified number of representatives and then updates each representative by the mean of its assigned data points. Widely used R packages for SOM clustering and visualization are: class (part of R), SOM and kohonen.

library(kohonen)

## Loading required package: class

## Loading required package: MASS

## 
## Attaching package: 'kohonen'

## The following object is masked from 'package:mclust':
## 
##     map

set.seed(101)
train.obs <- sample(nrow(iris), 50) # get the training set observations
train.set <- scale(iris[train.obs,][,-5]) # check info about scaling data below
test.set  <- scale(iris[-train.obs, ][-5],
               center = attr(train.set, "scaled:center"),
               scale  = attr(train.set, "scaled:scale"))
som.iris <- som(train.set, grid = somgrid(5, 5, "hexagonal"))
plot(som.iris)

som.prediction <- 
  predict(som.iris, newdata = test.set,
          trainX = train.set,
          trainY = classvec2classmat(iris[,5][train.obs]))

table(iris[,5][-train.obs], som.prediction$prediction)

##             
##              setosa versicolor virginica
##   setosa         31          0         0
##   versicolor      0         27         3
##   virginica       0          4        33

k-Nearest Neighbour

ref: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/kNN

Loosely related is the k-Nearest Neighbour algorithm. This is a classification procedure without model training!

Let s be a sample from the test set and I be the set of classified observations. Then: + Compute the distance between ‘s’ and each instance in ‘I’ + Sort the distances in increasing numerical order and pick the first ‘k’ elements + Compute and return the most frequent class in the ‘k’ nearest neighbors, optionally weighting each instance’s class by the inverse of its distance to ‘s’

We use the kknn package. The kknn function uses the Minkowski Distance as its metric:

\[\Big( \sum_i |x_i - y_i|^p \Big)^{1/p}\] is the distance between vectors \(x = (x_1, x_2\ldots x_n)\) and \(y = (y_1, y_2\ldots y_n)\).

When \(p=2\) we have as a special case the Euclidean distanc, and when \(p=1\) we have the Manhattan distance.

library(kknn)
library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'caret'

## The following object is masked from 'package:kknn':
## 
##     contr.dummy

# make a dataset
inTrain   <- createDataPartition(y=iris$Species, p=0.75, list=FALSE) 
known.set <- iris[inTrain,]
test.set  <- iris[-inTrain,]

iris.kknn <- kknn(Species ~ ., known.set, test.set[,-5], 
                  distance = 1, k = 7, scale = TRUE,
                  kernel = "triangular") 
# the kernel param specifies how to weight the neighbors according to their distances 
# kernel = "rectangular" does not weight (check help for more options)

#here are some useful information from the returned object:
iris.kknn$prob[10:20,]

##       setosa versicolor  virginica
##  [1,]      1  0.0000000 0.00000000
##  [2,]      1  0.0000000 0.00000000
##  [3,]      1  0.0000000 0.00000000
##  [4,]      0  0.9166251 0.08337486
##  [5,]      0  0.9631746 0.03682539
##  [6,]      0  0.7727362 0.22726375
##  [7,]      0  1.0000000 0.00000000
##  [8,]      0  1.0000000 0.00000000
##  [9,]      0  1.0000000 0.00000000
## [10,]      0  1.0000000 0.00000000
## [11,]      0  1.0000000 0.00000000

iris.kknn$fitted.values

##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] versicolor versicolor versicolor versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] virginica  virginica  virginica  virginica  virginica  virginica 
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

# Let's test the performance of the classification:
table(test.set$Species, fitted(iris.kknn))

##             
##              setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         12         0
##   virginica       0          0        12

pairs(test.set[,-5], 
      pch = as.character(as.numeric(test.set$Species)), 
      col = c("green3", "red")[(test.set$Species != iris.kknn)+1])

The package is also able to perform cross-validation:

set.seed(101)
# 10-fol cross validation with k=7 neighbors
iris.cv <- cv.kknn(Species ~ ., iris, kcv=10, kernel="triangular")
iris.cv # 6% for the mean error with a sd of 2.4%
## [[1]]
##     y yhat
## 1   1    1
## 2   1    1
## 3   1    1
## 4   1    1
## 5   1    1
## 6   1    1
## 7   1    1
## 8   1    1
## 9   1    1
## 10  1    1
## 11  1    1
## 12  1    1
## 13  1    1
## 14  1    1
## 15  1    1
## 16  1    1
## 17  1    1
## 18  1    1
## 19  1    1
## 20  1    1
## 21  1    1
## 22  1    1
## 23  1    1
## 24  1    1
## 25  1    1
## 26  1    1
## 27  1    1
## 28  1    1
## 29  1    1
## 30  1    1
## 31  1    1
## 32  1    1
## 33  1    1
## 34  1    1
## 35  1    1
## 36  1    1
## 37  1    1
## 38  1    1
## 39  1    1
## 40  1    1
## 41  1    1
## 42  1    1
## 43  1    1
## 44  1    1
## 45  1    1
## 46  1    1
## 47  1    1
## 48  1    1
## 49  1    1
## 50  1    1
## 51  2    2
## 52  2    2
## 53  2    2
## 54  2    2
## 55  2    2
## 56  2    2
## 57  2    2
## 58  2    2
## 59  2    2
## 60  2    2
## 61  2    2
## 62  2    2
## 63  2    2
## 64  2    2
## 65  2    2
## 66  2    2
## 67  2    2
## 68  2    2
## 69  2    2
## 70  2    2
## 71  2    3
## 72  2    2
## 73  2    2
## 74  2    2
## 75  2    2
## 76  2    2
## 77  2    2
## 78  2    2
## 79  2    2
## 80  2    2
## 81  2    2
## 82  2    2
## 83  2    2
## 84  2    3
## 85  2    2
## 86  2    2
## 87  2    2
## 88  2    2
## 89  2    2
## 90  2    2
## 91  2    2
## 92  2    2
## 93  2    2
## 94  2    2
## 95  2    2
## 96  2    2
## 97  2    2
## 98  2    2
## 99  2    2
## 100 2    2
## 101 3    3
## 102 3    3
## 103 3    3
## 104 3    3
## 105 3    3
## 106 3    3
## 107 3    2
## 108 3    3
## 109 3    3
## 110 3    3
## 111 3    3
## 112 3    3
## 113 3    3
## 114 3    3
## 115 3    3
## 116 3    3
## 117 3    3
## 118 3    3
## 119 3    3
## 120 3    2
## 121 3    3
## 122 3    3
## 123 3    3
## 124 3    3
## 125 3    3
## 126 3    3
## 127 3    3
## 128 3    3
## 129 3    3
## 130 3    3
## 131 3    3
## 132 3    3
## 133 3    3
## 134 3    2
## 135 3    2
## 136 3    3
## 137 3    3
## 138 3    3
## 139 3    2
## 140 3    3
## 141 3    3
## 142 3    3
## 143 3    3
## 144 3    3
## 145 3    3
## 146 3    3
## 147 3    3
## 148 3    3
## 149 3    3
## 150 3    2
## 
## [[2]]
## [1] 1
# Another method for leave-one-out cross-validation
iris.cv2 <- train.kknn(Species ~ ., iris, nn=10, kernel="triangular")
plot(iris.cv2, type="b")

Annex: Scaling datasets

# Sample data matrix.
set.seed(101)
y <- matrix(rnorm(100,20,5), 20, 5, 
            dimnames=list(paste0("g", 1:20), paste0("t", 1:5))) 
head(y)

##          t1       t2        t3       t4       t5
## g1 18.36982 19.18122 22.412294 18.70099 29.26074
## g2 22.76231 23.54261 23.791069 12.94413 25.55838
## g3 16.62528 18.66010  8.403363 16.79321 17.44312
## g4 21.07180 12.68039 17.702476 20.56229 17.28059
## g5 21.55385 23.72218 14.473082 22.11302 11.35536
## g6 25.86983 12.94805 22.014641 21.93418 22.35375

apply(y,2,mean) # check mean and sd of each column

##       t1       t2       t3       t4       t5 
## 19.51341 19.97489 18.83587 19.59978 21.14628

apply(y,2,sd)

##       t1       t2       t3       t4       t5 
## 4.333922 4.865144 4.869446 4.524513 4.896093

We use the function scale() to centers and/or scales the data. In its default settings, the function returns columns that have a mean close to zero and a standard deviation of one.

To scale matrix m by rows use t(scale(t(m)))

apply(scale(y,scale=FALSE),2,mean) # just centers, sd remains

##            t1            t2            t3            t4            t5 
## -1.598548e-15 -8.881784e-16  8.893168e-17  1.421073e-15  9.770179e-16

apply(scale(y,scale=FALSE),2,sd)

##       t1       t2       t3       t4       t5 
## 4.333922 4.865144 4.869446 4.524513 4.896093

yscaled.cols <- scale(y)     # scale and center columns
yscaled.cols

##              t1         t2          t3           t4          t5
## g1  -0.26387079 -0.1631327  0.73446325 -0.198648360  1.65733269
## g2   0.74964351  0.7333236  1.01761146 -1.471018301  0.90114582
## g3  -0.66640154 -0.2702466 -2.14244131 -0.620301943 -0.75634973
## g4   0.35957833 -1.4993379 -0.23275522  0.212732483 -0.78954536
## g5   0.47080525  0.7702328 -0.89595069  0.555473021 -1.99974101
## g6   1.46666664 -1.4443224  0.65280043  0.515944968  0.24661813
## g7   0.82616543  0.4851763  0.82325786 -0.671623234 -0.22862037
## g8  -0.01778632 -0.1174654 -0.48594471  0.253007511  1.14253268
## g9   1.17023997  0.4853524 -0.05879892  0.024748442  0.50534193
## g10 -0.14529790  0.5171054 -1.28459287  0.005770029  1.35137610
## g11  0.71963171  0.9249059 -0.94202536  1.757031114  1.11947761
## g12 -0.80472959  0.2920519 -0.04276069  1.878634879 -0.26911410
## g13  1.75946063  1.0409647  0.83246433  1.362801669 -0.60279677
## g14 -1.57998038 -2.1254085 -1.19528556  0.002697621 -0.96957085
## g15 -0.16078505  1.2279968  1.00820991 -1.921631917  0.05387814
## g16 -0.11077790 -0.7392909 -0.84030066 -1.058014251 -1.04142430
## g17 -0.86807903  0.1778022  0.40888417  0.422738207 -0.68846829
## g18  0.17972511  0.9510079  1.39916960 -1.323790446  1.15983972
## g19 -0.83106361 -1.7127776  1.44426030  0.241333945  0.27388967
## g20 -2.25314448  0.4660622 -0.20026533  0.032114566 -1.06580171
## attr(,"scaled:center")
##       t1       t2       t3       t4       t5 
## 19.51341 19.97489 18.83587 19.59978 21.14628 
## attr(,"scaled:scale")
##       t1       t2       t3       t4       t5 
## 4.333922 4.865144 4.869446 4.524513 4.896093

apply(yscaled.cols, 2, mean) # should be zero (or close)

##            t1            t2            t3            t4            t5 
## -3.774541e-16 -1.769418e-16  2.602085e-17  3.220828e-16  1.994932e-16

apply(yscaled.cols, 2, sd)   # should be one

## t1 t2 t3 t4 t5 
##  1  1  1  1  1

yscaled.rows <- t(scale(t(y))) # scale and center rows
yscaled.rows

##              t1         t2          t3          t4         t5
## g1  -0.70146922 -0.5244426  0.18049073 -0.62921639  1.6746375
## g2   0.20805780  0.3637707  0.41335183 -1.75120706  0.7660267
## g3   0.25412664  0.7512123 -1.75440719  0.29515064  0.4539176
## g4   0.95978103 -1.5474389 -0.04691905  0.80754754 -0.1729706
## g5   0.53856095  0.9398118 -0.77173722  0.64203660 -1.3486721
## g6   1.00886709 -1.6814042  0.20622960  0.18947708  0.2768304
## g7   0.77188434  0.4958809  0.68119141 -1.60497773 -0.3439789
## g8  -0.29589986 -0.3045803 -1.07796965  0.04895326  1.6294966
## g9   1.10200904  0.2245639 -1.25282818 -0.79938451  0.7256398
## g10 -0.25023860  0.4014515 -1.38904991 -0.11614524  1.3539823
## g11 -0.08934988  0.2577665 -1.66880274  0.83705838  0.6633278
## g12 -1.05284092  0.1324959 -0.47852097  1.61227896 -0.2134129
## g13  0.95433210  0.3532205 -0.26231438  0.56122020 -1.6064584
## g14 -0.41751910 -1.2088171 -0.32626655  1.39562259  0.5569801
## g15 -0.23182819  0.9941868  0.61535180 -1.59166953  0.2139591
## g16  1.62544279  0.1005058 -0.83797717 -0.79851497 -0.0894564
## g17 -1.45035288  0.6054853  0.60022738  0.87719322 -0.6325530
## g18 -0.35271513  0.4458710  0.63996413 -1.59099957  0.8578796
## g19 -0.60885588 -1.3715211  1.16969530  0.24497448  0.5657072
## g20 -1.55574784  1.0864663  0.15982994  0.55834304 -0.2488915
## attr(,"scaled:center")
##       g1       g2       g3       g4       g5       g6       g7       g8 
## 21.58501 21.71970 15.58502 17.85951 18.64350 21.02409 20.97238 20.55881 
##       g9      g10      g11      g12      g13      g14      g15      g16 
## 21.76062 20.26872 23.10650 20.79551 23.80567 14.26540 20.16531 16.20313 
##      g17      g18      g19      g20 
## 19.34120 22.19566 19.32024 17.10492 
## attr(,"scaled:scale")
##       g1       g2       g3       g4       g5       g6       g7       g8 
## 4.583515 5.011152 4.093492 3.346897 5.403934 4.803152 2.748557 3.793453 
##       g9      g10      g11      g12      g13      g14      g15      g16 
## 2.563062 5.534801 5.307861 4.530343 3.492611 3.830964 5.817778 1.741174 
##      g17      g18      g19      g20 
## 2.475240 5.396222 5.598357 4.728566

apply(yscaled.rows, 1, mean) # should be zero (or close)

##            g1            g2            g3            g4            g5 
## -1.665335e-17 -2.942091e-16  6.661338e-17  6.938894e-18 -2.886580e-16 
##            g6            g7            g8            g9           g10 
##  2.775558e-17 -5.217886e-16  1.859624e-16  3.053113e-16  2.414735e-16 
##           g11           g12           g13           g14           g15 
##  1.387779e-16 -3.164271e-16 -4.107608e-16  1.113476e-17  2.498002e-16 
##           g16           g17           g18           g19           g20 
## -8.104804e-16  2.664535e-16  1.332159e-16  3.887949e-17  0.000000e+00

apply(yscaled.rows, 1, sd)   # should be one

##  g1  g2  g3  g4  g5  g6  g7  g8  g9 g10 g11 g12 g13 g14 g15 g16 g17 g18 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## g19 g20 
##   1   1

Clustering

João Neto

December 2013