Refs:

Introduction

Consider a binary classification, where input vectors \(x_i\) (the input space) and labels (aka, targets, classes) \(y_i = \pm 1\).

x1s <- c(.5,1,1,2,3,3.5,     1,3.5,4,5,5.5,6)
x2s <- c(3.5,1,2.5,2,1,1.2,  5.8,3,4,5,4,1)
ys <- c(rep(+1,6),          rep(-1,6))
my.data <- data.frame(x1=x1s, x2=x2s, type=as.factor(ys))
my.data
##     x1  x2 type
## 1  0.5 3.5    1
## 2  1.0 1.0    1
## 3  1.0 2.5    1
## 4  2.0 2.0    1
## 5  3.0 1.0    1
## 6  3.5 1.2    1
## 7  1.0 5.8   -1
## 8  3.5 3.0   -1
## 9  4.0 4.0   -1
## 10 5.0 5.0   -1
## 11 5.5 4.0   -1
## 12 6.0 1.0   -1
plot(my.data[,-3],col=(ys+3)/2, pch=19); abline(h=0,v=0,lty=3)

In SVM the error is minimized by maximizing the margin \(\gamma\), ie., the minimal distance between the hyperplane separating the two classes and the closest datapoints of each class (called support vectors).

The separating hyperplane can be written as \[w.x + b = 0\] where vector b is the bias, and w the weigths.

w is the normal to the hyperplane and \(\frac{b}{\|w\|}\) is the perpendicular distance from the hyperplane to the origin

Deciding the class for a new observation \(x_i\) is calculated by \[D(x_i) = signal(w.x_i + b)\]

\(D()\) is invariant under a positive scaling \(w \rightarrow \lambda w\), \(b \rightarrow \lambda b\). We fix \(\lambda\) such that the margin has distance \(1\), and so \[w.x + b = 1\] or \[w.x + b = -1\] for the support vectors of one or the other side. This defines what is called by canonical hyperplane.

If \(x_1, x_2\) are support vectors of each side, then subtracting the previous equations: \[w.(x_1 - x_2) = 2\]

If we project the vector \(x_1 - x_2\) (the red vector above) onto the normal to the hyperplane, ie, \(\frac{w}{\|w\|}\), we get twice the size of the margin:

\[(x_1 - x_2) . \frac{w}{\|w\|} = \frac{w.(x_1 - x_2)}{\|w\|} = \frac{2}{\|w\|}\]

So \(\gamma = \frac{1}{\|w\|}\).

We want to maximize the margin \(\gamma\). So we need to maximize \(\frac{1}{\|w\|}\), ie, minimize \(\|w\|\) which is the same as minimize \(\frac{1}{2}\|w\|^2\).

But this minimization has constraints. The vector \(w\) must be such that all following conditions remain true \[y_i (w.x_i + b) \geq 1\] which respects the classification of the original dataset. The equations only return 1 for the support vectors. For every other datapoint, the value will be greater than one.

This optimization requires Langrage multipliers which we will not follow (check the references above)

This optimization will provide us with values \(\alpha_i\) such that: \[w = \sum_i \alpha_i y_i x_i\] \[\sum_i \alpha_i y_i = 0\]

where if \(x_i\) is a support vector then \(\alpha_i>0\) (or zero otherwise).

and being \(S\) the indexes of the support vectors: \[b = \frac{1}{|S|} \sum_{s \in S} \Big( y_s - x_s . \sum_{m \in S} \alpha_m y_m x_m \Big)\]

with \(b\) and \(w\) the hyperplane is defined and we get our SVM!

library(e1071)
svm.model <- svm(type ~ ., data=my.data, type='C-classification', kernel='linear',scale=FALSE)

plot(my.data[,-3],col=(ys+3)/2, pch=19, xlim=c(-1,6), ylim=c(-1,6)); abline(h=0,v=0,lty=3)
points(my.data[svm.model$index,c(1,2)],col="blue",cex=2) # show the support vectors

# get parameters of hiperplane
w <- t(svm.model$coefs) %*% svm.model$SV
b <- -svm.model$rho
# in this 2D case the hyperplane is the line w[1,1]*x1 + w[1,2]*x2 + b = 0
abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="blue", lty=3)

And let’s make a prediction:

observations <- data.frame(x1=c(1,3.5),x2=c(4,3.5))

plot(my.data[,-3],col=(ys+3)/2, pch=19, xlim=c(-1,6), ylim=c(-1,6)); abline(h=0,v=0,lty=3)
points(observations[1,], col="green", pch=19)
points(observations[2,], col="blue", pch=19)
abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="blue", lty=3)

predict(svm.model, observations) # the results are right
##  1  2 
##  1 -1 
## Levels: -1 1

Another eg with the iris dataset:

data(iris)
svm.model <- svm(Species ~ Sepal.Length + Sepal.Width, data = iris, kernel = "linear")
# the + are support vectors
plot(iris$Sepal.Length, iris$Sepal.Width, col = as.integer(iris[, 5]), 
     pch = c("o","+")[1:150 %in% svm.model$index + 1], cex = 2, 
     xlab = "Sepal length", ylab = "Sepal width")

plot(svm.model, iris, Sepal.Width ~ Sepal.Length, 
     slice = list(sepal.width = 1, sepal.length = 2))

svm.pred  <- predict(svm.model, iris[,-5]) 
table(pred = svm.pred, true = iris[,5]) # show the confusion matrix
##             true
## pred         setosa versicolor virginica
##   setosa         49          0         0
##   versicolor      1         38        15
##   virginica       0         12        35

Non linearly separable data

Notice that \(x_i\) always appear in a dot product. This means that the results do not depend in the input space’s dimension. So, one way is to increase the dimension of the data using a mapping \(\phi\), turning each \(x_i\) into \(\phi(x_i)\), such that the new data may be linearly separable: \[x_i . x_j \rightarrow \phi(x_i) . \phi(x_j)\]

Here’s an eg how to separate the 2D blue cloud using height in 3D:

This higher dimensional space is called a feature space and it must be a Hilbert space (ie, the concept of dot product applies).

The function \[K(x_i,x_j) = \phi(x_i) . \phi(x_j)\] is called a kernel (in fact we just need to know the kernel, not the mapping \(\phi\)). The kernel is, therefore, the inner product between mapped pairs of points in feature space.

Different choices of kernel define different Hilbert spaces to use.

Some popular kernels:

where \(a,b\) are parameters defining the kermel’s behaviour.

Kernels can also be defined by algorithms, not only by functions.

The decision function becomes \[D(z) = sign\big(\sum_i \alpha_i y_i K(x_i,z) + b\big)\]

An R eg using the polynomial kernel \((0.1 x_1.x_2 + 1)^8\):

svm.model <- svm(type ~ ., data=my.data, type='C-classification', kernel='polynomial', degree=8, gamma=0.1, coef0=1, scale=FALSE)

plot(my.data[,-3],col=(ys+3)/2, pch=19, xlim=c(-1,6), ylim=c(-1,6)); abline(h=0,v=0,lty=3)
points(my.data[svm.model$index,c(1,2)],col="blue",cex=2) # show the support vectors

svm.pred <- predict(svm.model, my.data[,-3])
table(pred = svm.pred, true = my.data[,3]) 
##     true
## pred -1 1
##   -1  6 0
##   1   0 6

Using the same kernel with the iris dataset:

svm.model <- svm(Species ~ Sepal.Length + Sepal.Width, data = iris, kernel = 'polynomial', degree=8, gamma=0.1, coef0=1)
plot(svm.model, iris, Sepal.Width ~ Sepal.Length, 
     slice = list(Sepal.Width = 1, Sepal.Length = 2))

svm.pred  <- predict(svm.model, iris[,-5]) 
table(pred = svm.pred, true = iris[,5]) # show the confusion matrix
##             true
## pred         setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         37        15
##   virginica       0         13        35
# not very great, but we had just used two attributes. If we use all four:
svm.model <- svm(Species ~ ., data = iris, kernel = 'polynomial', degree=8, gamma=0.1, coef0=1)
plot(svm.model, iris, Sepal.Width ~ Sepal.Length, 
     slice = list(Petal.Width = 3, Petal.Length = 2.5)) # showing a 2D slice of the 4D space