Introduction

Refs

The Gaussian Discriminant Analysis (GDA) is a generative method, given data \(x\) and class \(y\), we learn \(p(x,y)\) and thus predict \(p(y|x)\).

The general steps for a generative model are:

  1. Model \(p(y)\) and \(p(x|y)\)

  2. Learn parameters of the model by maximizing \(p(x,y)=p(x|y)p(y)\)

  3. With \[p(y|x) = \frac{p(x|y)p(y)}{p(x)} = \frac{p(x|y)p(y)}{\sum_{y_i\in Y}p(x|y_i)p(y_i)} \propto p(x|y)p(y)\] predict \(p(y|x)\) by \[\arg \max_y p(y|x) = \arg \max_y \frac{p(x|y)p(y)}{p(x)} = \arg \max_y p(x|y)p(y)\]

Let’s consider two classes for \(y\), \(y_0 = 0, y_1 = 1\).

Step 1. In GDA we model \(p(y)\) as:

\[y \sim \text{Bernoulli}(\phi) \implies p(y|\phi) = \phi^y(1-\phi)^{1-\phi}\]

and model \(p(x|y)\) by a multivariate Gaussian with a given covariance matrix \(\Sigma\):

\[x|y_i \sim \mathcal{N}(\mu_i,\Sigma) \implies p(x|y_i) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp( \frac{1}{2} (x-\mu_i)^T \Sigma^{-1} (x-\mu_i) )\]

Step 2. Express the joint likelihood of the data \(x=(x^1,x^2,\ldots,x^m), y=(y^1,y^2,\ldots,y^m)\):

\[p(x,y|\phi,\mu_0,\mu_1,\Sigma) = \prod_{i=1}^m p(x^i,y^i) = \prod_{i=1}^m p(x^i|y^i)p(y^i)\]

and maximize the expression in \(\phi,\mu_0,\mu_1,\Sigma\).

The solutions are:

\[\phi = \frac{\sum_i I(y^i = 1)}{m}\]

\[\mu_k = \frac{\sum_i I(y^i = k)x^i}{I(y^i = k)}\]

\[\Sigma = \frac{1}{m} \sum_{I(y^i=0)} (x^i-\mu_0)(x^i-\mu_0)^T + \frac{1}{m} \sum_{I(y^i=1)} (x^i-\mu_1)(x^i-\mu_1)^T\]

Step 3. Given the parameters, maximize \[\arg \max_y p(x|y)p(y)\] With two classes it is finding which \(p(x|y=0)p(y=0)\) or \(p(x|y=1)p(y=1)\) is larger.

Herein, we assumed that \(\Sigma\) is the same for both class conditional densities. This is called Linear Discriminant Analysis. Otherwise, it is called Quadratic Discriminant Analysis.

Linear Discriminant Analysis

Refs:

Linear Discriminant Analysis (LDA) finds a linear combination of features that separates different classes. This combination can be used to perform classification or for dimensionality reduction before classification (using another method).

LDA is closely related to […] regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. In [regression analysis] however, the dependent variable is a numerical quantity, while for LDA it is a categorical variable (i.e. the class label). Logistic regression and probit regression are more similar to LDA, as they also explain a categorical variable. These other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method. wikipedia

Consider a classification problem with \(K\) classes. Each class \(C_i\) has \(N_i\) m-dimensional samples. Let \(X\) be a matrix where each column corresponds to one sample. The goal is to obtain a transformation of \(X\) to \(X'\) by projecting the samples of \(X\) onto a hyperplane of dimension \(K-1\).

In two dimensions (2 classes) we want to project the matrix onto a line the maximizes the separability of the classes (images from Aly Farag’s LDA Tutorial):

LDA is a parametric method, ie, they assume some distribution over the data, in this case unimodal Gaussian likelihoods. If the distribution is significantly non-gaussian, LDA will perform badly since it will be unable to preserve the complex strucutre of the data. However, by assuming a distribution, it’s possible to assign probabilities for group assignments (as shown in the next section).

If the discriminatory information is not in the mean, but in the variance, LDA will fail:

The Basic Idea

Let a sample \(x^T = (x_1, \ldots, x_m)\) where \(x_i\) is a random variable.

We assume that \(x\) follows a Multivariate Normal Distribution (MVD) with mean \(\mu\) (a m-dimensional vector) and covariance matrix \(\Sigma\) (a mm matrix), ie, \(x \sim N(\mu,\Sigma)\).

The pdf is:

\[f(x|\mu,\Sigma) = \frac{1}{(2\pi)^{m/2} |\Sigma|^{1/2}} exp \Big[ -\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu) \Big]\]

Notice that the set \(\{x:f(x|\mu,\Sigma)>C\}\) is a m-dimensional ellipsoid centered at point \(\mu\), where \(\Sigma\) determines the ellipsoid’s shape.

Eg, for \(\mu=(0,0)\) and \(\Sigma = (1,0.8;0.8,3)\) the contour plot is:

Def: The Mahalanobis distance of a point \(x\) to mean \(\mu\) is

\[D = \sqrt{ (x-\mu)^T \Sigma^{-1} (x-\mu) }\]

Two points have the same Mahalanobis distance if they are on the same ellipsoid centered on \(\mu\).

Now suppose that given a point \(x\) we wish the find the closest mean \(\mu_k\) (of class \(K\)) measured by the Mahalanobis distance. A point is closer to \(\mu_1\) than it is to \(\mu_2\) when

\[(x-\mu_1)^T \Sigma^{-1}_1 (x-\mu_1) < (x-\mu_2)^T \Sigma^{-1}_2 (x-\mu_2)\]

this is a quadratic expression on \(x\).

If we assume that all covariance matrixes are equal, \(\Sigma = \Sigma_k, k=1\ldots K\), the expression simplifies to \[-2\mu_1^T\Sigma^{-1}x + \mu_1^T\Sigma^{-1}\mu_1 < -2\mu_2^T\Sigma^{-1}x + \mu_2^T\Sigma^{-1}\mu_2\] which is a linear expression of \(x\).

There are linear and quadratic discriminant analysis (QDA), depending on the assumptions we make. In LDA the different covariance matrixes are grouped into a single one, in order to have that linear expression. This makes it simpler but all the class groups share the same structure. In QDA the ellipsoid’s shapes vary.

To classify a new sample, we must compute the posterior \(p(x \in C_k|x)\). The classification is just to assign the class with largest posterior probability. The likelihood of the data \(p(x)\) is the MVD as seen above. The only thing needed to apply Bayes’ theorem is a prior probability of \(P(x \in C_k)\) which usually is just the proportion of samples belonging to class \(C_k\). Another option is to have a vague prior \(P(x \in C_k) = 1/K\)

In LDA the decision boundary between classes \(i\) and \(j\) is given by \[\log \frac{p(x \in i|x)}{p(x \in j|x)} = 0\]

LDA for Classification

library(MASS)

X <- matrix(c(2,3,3,4,4,5,5,6,5,7,2,1,3,2,4,2,4,3,6,4,7,6,1,6,2,7,1,7), ncol=2, byrow=T)
Y <- c(1,1,1,1,1,2,2,2,2,2,2,3,3,3);
plot(X,col=Y, pch=19)

idx_class1 <- which(Y==1)  # indexes of the observations for each class
idx_class2 <- which(Y==2)

my.data <- data.frame(X1=X[,1], X2=X[,2], Y=Y)
model <- lda(Y ~ . , data=my.data)
model
## Call:
## lda(Y ~ ., data = my.data)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.3571429 0.4285714 0.2142857 
## 
## Group means:
##         X1       X2
## 1 3.800000 5.000000
## 2 4.333333 3.000000
## 3 1.333333 6.666667
## 
## Coefficients of linear discriminants:
##          LD1       LD2
## X1 -2.025094 0.3695336
## X2  1.963192 0.2946290
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9985 0.0015
projected.data <- X %*% model$scaling
plot(projected.data, col=Y, pch=19)

# now let's predict the class of a new sample
new.data <- data.frame(X1=c(2), X2=c(5.1)) # new data point should be class 3
prediction <- predict(model, new.data)
model$prior          # the prior probabilities before the new data point
##         1         2         3 
## 0.3571429 0.4285714 0.2142857
prediction$posterior # the posterior probs for the classes given this new data point
##           1            2          3
## 1 0.9397416 2.147726e-14 0.06025842
plot(X,col=Y, pch=19)
# Let's plot it with the color for the class it predicts:
points(new.data$X1, new.data$X2, pch=17, col=prediction$class)
# Plot the means:
points(model$means, pch="+", col=1:3)

Let’s see the boundaries:

library(klaR) # Classification and visualization package

partimat(x=my.data[,-3], grouping=as.factor(my.data[,3]), method="lda", 
         col.mean=1, image.colors = c("lightgrey","red","green"))

Let’s do the same for QDA:

model <- qda(Y ~ . , data=my.data)

new.data <- data.frame(X1=c(2), X2=c(5.1)) # new data point should be class 3
prediction <- predict(model, new.data)
prediction$posterior
##              1            2        3
## 1 5.403558e-05 8.261899e-14 0.999946
partimat(x=my.data[,-3], grouping=as.factor(my.data[,3]), method="qda", 
         col.mean=1, image.colors = c("lightgrey","red","green"))

LDA for Dimensionality Reduction before Classification

Here the objective is to perform dimensionality reduction. But unlike PCA it preserves as much class information as possible.

Let’s see an eg:

set.seed(101) 
# Read Sample Data
my.data <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"),
                    header=FALSE)
Y <- my.data[,1]                            # classes
X <- as.data.frame(scale(my.data[,2:14]))   # scaling data (ie, mean=0 & sd=1 for all Xi)
names(X) <- paste0("X",1:13)
head(cbind(signif(X,3),Y), n=10)
##       X1      X2     X3     X4      X5    X6    X7     X8     X9      X10
## 1  1.510 -0.5610  0.231 -1.170  1.9100 0.807 1.030 -0.658  1.220  0.25100
## 2  0.246 -0.4980 -0.826 -2.480  0.0181 0.567 0.732 -0.818 -0.543 -0.29200
## 3  0.196  0.0212  1.110 -0.268  0.0881 0.807 1.210 -0.497  2.130  0.26800
## 4  1.690 -0.3460  0.487 -0.807  0.9280 2.480 1.460 -0.979  1.030  1.18000
## 5  0.295  0.2270  1.840  0.451  1.2800 0.807 0.661  0.226  0.400 -0.31800
## 6  1.480 -0.5160  0.304 -1.290  0.8580 1.560 1.360 -0.176  0.662  0.73000
## 7  1.710 -0.4170  0.304 -1.470 -0.2620 0.327 0.491 -0.497  0.680  0.08280
## 8  1.300 -0.1670  0.888 -0.567  1.4900 0.487 0.481 -0.417 -0.596 -0.00349
## 9  2.250 -0.6230 -0.716 -1.650 -0.1920 0.807 0.952 -0.577  0.680  0.06120
## 10 1.060 -0.8830 -0.352 -1.050 -0.1220 1.090 1.120 -1.140  0.453  0.93300
##       X11   X12     X13 Y
## 1   0.361 1.840  1.0100 1
## 2   0.405 1.110  0.9630 1
## 3   0.317 0.786  1.3900 1
## 4  -0.426 1.180  2.3300 1
## 5   0.361 0.448 -0.0378 1
## 6   0.405 0.336  2.2300 1
## 7   0.274 1.360  1.7200 1
## 8   0.449 1.360  1.7400 1
## 9   0.536 0.336  0.9470 1
## 10  0.230 1.320  0.9470 1
train <- sample(1:dim(X)[1],50)  # define the training set indexes

# First test with a knn 
library(class)
prediction <- knn(X[train,], X[-train,], Y[train], k=1)
table1 <- table(old=Y[-train], new=prediction)
table1
##    new
## old  1  2  3
##   1 40  0  0
##   2  6 42  2
##   3  0  1 37

This is the result of applying KNN.

The next test reduces the dataset by performing a LDA. Then we classify:

library(MASS)
model <- lda(Y ~ . , data=cbind(X,Y), subset=train)
model$prior   # the prior probabilities used (by default, the class proportions for the training set)
##    1    2    3 
## 0.38 0.42 0.20
model$scaling # the resulting projection matrix to project our dataset into a 2-dimensional vector space
##             LD1         LD2
## X1   0.10022270  0.94483391
## X2   0.09532697  0.53392351
## X3  -0.09876930  0.44206847
## X4   1.17749547  0.01822394
## X5  -0.27710709  0.04414024
## X6   0.48981986  0.09808563
## X7  -1.17527870 -0.60381590
## X8  -0.33679479 -0.19980790
## X9   0.07133650 -0.02637518
## X10  1.03977509  0.90506536
## X11  0.10599483 -0.02142074
## X12 -1.36310559 -0.10681806
## X13 -1.23185636  0.65014190
projected.data <- as.matrix(X) %*% model$scaling  # make the projection to 2D space
head(projected.data)
##            LD1        LD2
## [1,] -5.792619  1.5313280
## [2,] -6.176574 -0.4178032
## [3,] -3.610735  1.1316789
## [4,] -4.510693  3.6465953
## [5,] -1.240909  0.5314009
## [6,] -4.821236  2.6848295
plot(projected.data, col=Y)

prediction <- knn(projected.data[train,], projected.data[-train,], Y[train], k=1) # Now let's reapply knn:
table2 <- table(old=Y[-train], new=prediction)
table2
##    new
## old  1  2  3
##   1 40  0  0
##   2  1 45  4
##   3  0  0 38

We have greatly improved the training error, from 9 misclassifications to just 5.

LDA projections (iris eg)

library(MASS)

X <- scale(as.matrix(iris[,-5])) # better scale mean and variance before LDA
Y <- unclass(iris$Species)
iris1 <- data.frame(X=X, Y=Y)
colnames(iris1) <- colnames(iris)
head(iris1)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   -0.8976739  1.01560199    -1.335752   -1.311052       1
## 2   -1.1392005 -0.13153881    -1.335752   -1.311052       1
## 3   -1.3807271  0.32731751    -1.392399   -1.311052       1
## 4   -1.5014904  0.09788935    -1.279104   -1.311052       1
## 5   -1.0184372  1.24503015    -1.335752   -1.311052       1
## 6   -0.5353840  1.93331463    -1.165809   -1.048667       1
model <- lda(Species ~ . , data=iris1, prior=c(1,1,1)/3)

plot(iris1[,"Sepal.Length"], iris1[,"Petal.Length"], 
     col=c("blue","green","red")[iris1$Species], pch=19,
     xlab="Sepal Length", ylab="Petal.Length")

means <- model$means
points(means[,c(1,3)], pch=3, lwd=2, col="purple")

model
## Call:
## lda(Species ~ ., data = iris1, prior = c(1, 1, 1)/3)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.3333333 0.3333333 0.3333333 
## 
## Group means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1   -1.0111914   0.8504137   -1.3006301  -1.2507035
## 2    0.1119073  -0.6592236    0.2843712   0.1661774
## 3    0.8992841  -0.1911901    1.0162589   1.0845261
## 
## Coefficients of linear discriminants:
##                     LD1         LD2
## Sepal.Length  0.6867795  0.01995817
## Sepal.Width   0.6688251  0.94344183
## Petal.Length -3.8857950 -1.64511887
## Petal.Width  -2.1422387  2.16413593
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9912 0.0088
model$scaling # the parameters of the linear discriminant functions
##                     LD1         LD2
## Sepal.Length  0.6867795  0.01995817
## Sepal.Width   0.6688251  0.94344183
## Petal.Length -3.8857950 -1.64511887
## Petal.Width  -2.1422387  2.16413593

Eg, the first discriminant function is a linear combination of the variables

\[0.687 \times \text{Sepal Length} + 0.669\times \text{Sepal Width} - 3.886 \times \text{Petal Length} - 2.14 \times \text{Petal Width}\]

We can use predict to calculate the value of a discriminant function for each sample in the data set

pred <- predict(model, iris[,-5])

# The next horizontal axis are meaningless, they depends on the sample order of the dataset
head(pred$x[,1]) # contains the values of the dataset observations for the first discriminant function
##           1           2           3           4           5           6 
## -0.02509743 -0.49686587 -0.11187726 -1.02459673 -0.02689287 -1.14571979
plot(pred$x[,1], col=c("blue","green","red")[iris1$Species], pch=19) # we can plot them

# Notice that the 2nd discriminant function does not separate that well the 2nd & 3rd class
plot(pred$x[,2], col=c("blue","green","red")[iris1$Species], pch=19) # we can plot them

Let’s use Sepal.Length and Petal.Length to vizualize the projections by both linear discriminant functions

vec <- c(model$scaling[1,1], model$scaling[3,1])
v   <- vec / sqrt(sum(vec^2))  # make it a unit vector
lda1.points <- as.matrix(iris1[,c(1,3)]) %*% v %*% t(v) # to project point X into unit vector v just calculate X.v.v^T

plot(iris1[,"Sepal.Length"], iris1[,"Petal.Length"], 
     col=c("blue","green","red")[iris1$Species], pch=19,
     xlab="Sepal Length", ylab="Petal.Length", ,  main="1st discriminant functions")
segments(-vec[1],-vec[2],vec[1],vec[2])
# points(lda1.points , col=c("blue","green","red")[iris1$Species], pch=18) # draw projection point
for(i in 1:nrow(iris1)) {
  segments(iris1[i,1], iris1[i,3], lda1.points[i,1], lda1.points[i,2], 
           lty=2, col=c("blue","green","red")[iris1[i,]$Species])
}

vec <- c(model$scaling[1,2], model$scaling[3,2])
v   <- vec / sqrt(sum(vec^2))
lda2.points <- as.matrix(iris1[,c(1,3)]) %*% v %*% t(v)

plot(iris1[,"Sepal.Length"], iris1[,"Petal.Length"], 
     col=c("blue","green","red")[iris1$Species], pch=19,
     xlab="Sepal Length", ylab="Petal.Length", ,  main="2nd discriminant functions")
segments(-2*vec[1],-2*vec[2],2*vec[1],2*vec[2])
# points(lda2.points , col=c("blue","green","red")[iris1$Species], pch=18) # draw projection point
for(i in 1:nrow(iris1)) {
  segments(iris1[i,1], iris1[i,3], lda2.points[i,1], lda2.points[i,2], 
           lty=2, col=c("blue","green","red")[iris1[i,]$Species])
}