## Introduction

Refs

The Gaussian Discriminant Analysis (GDA) is a generative method, given data $$x$$ and class $$y$$, we learn $$p(x,y)$$ and thus predict $$p(y|x)$$.

The general steps for a generative model are:

1. Model $$p(y)$$ and $$p(x|y)$$

2. Learn parameters of the model by maximizing $$p(x,y)=p(x|y)p(y)$$

3. With $p(y|x) = \frac{p(x|y)p(y)}{p(x)} = \frac{p(x|y)p(y)}{\sum_{y_i\in Y}p(x|y_i)p(y_i)} \propto p(x|y)p(y)$ predict $$p(y|x)$$ by $\arg \max_y p(y|x) = \arg \max_y \frac{p(x|y)p(y)}{p(x)} = \arg \max_y p(x|y)p(y)$

Let’s consider two classes for $$y$$, $$y_0 = 0, y_1 = 1$$.

Step 1. In GDA we model $$p(y)$$ as:

$y \sim \text{Bernoulli}(\phi) \implies p(y|\phi) = \phi^y(1-\phi)^{1-\phi}$

and model $$p(x|y)$$ by a multivariate Gaussian with a given covariance matrix $$\Sigma$$:

$x|y_i \sim \mathcal{N}(\mu_i,\Sigma) \implies p(x|y_i) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp( \frac{1}{2} (x-\mu_i)^T \Sigma^{-1} (x-\mu_i) )$

Step 2. Express the joint likelihood of the data $$x=(x^1,x^2,\ldots,x^m), y=(y^1,y^2,\ldots,y^m)$$:

$p(x,y|\phi,\mu_0,\mu_1,\Sigma) = \prod_{i=1}^m p(x^i,y^i) = \prod_{i=1}^m p(x^i|y^i)p(y^i)$

and maximize the expression in $$\phi,\mu_0,\mu_1,\Sigma$$.

The solutions are:

$\phi = \frac{\sum_i I(y^i = 1)}{m}$

$\mu_k = \frac{\sum_i I(y^i = k)x^i}{I(y^i = k)}$

$\Sigma = \frac{1}{m} \sum_{I(y^i=0)} (x^i-\mu_0)(x^i-\mu_0)^T + \frac{1}{m} \sum_{I(y^i=1)} (x^i-\mu_1)(x^i-\mu_1)^T$

Step 3. Given the parameters, maximize $\arg \max_y p(x|y)p(y)$ With two classes it is finding which $$p(x|y=0)p(y=0)$$ or $$p(x|y=1)p(y=1)$$ is larger.

Herein, we assumed that $$\Sigma$$ is the same for both class conditional densities. This is called Linear Discriminant Analysis. Otherwise, it is called Quadratic Discriminant Analysis.

## Linear Discriminant Analysis

Refs:

Linear Discriminant Analysis (LDA) finds a linear combination of features that separates different classes. This combination can be used to perform classification or for dimensionality reduction before classification (using another method).

LDA is closely related to […] regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. In [regression analysis] however, the dependent variable is a numerical quantity, while for LDA it is a categorical variable (i.e. the class label). Logistic regression and probit regression are more similar to LDA, as they also explain a categorical variable. These other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method. wikipedia

Consider a classification problem with $$K$$ classes. Each class $$C_i$$ has $$N_i$$ m-dimensional samples. Let $$X$$ be a matrix where each column corresponds to one sample. The goal is to obtain a transformation of $$X$$ to $$X'$$ by projecting the samples of $$X$$ onto a hyperplane of dimension $$K-1$$.

In two dimensions (2 classes) we want to project the matrix onto a line the maximizes the separability of the classes (images from Aly Farag’s LDA Tutorial):

LDA is a parametric method, ie, they assume some distribution over the data, in this case unimodal Gaussian likelihoods. If the distribution is significantly non-gaussian, LDA will perform badly since it will be unable to preserve the complex strucutre of the data. However, by assuming a distribution, it’s possible to assign probabilities for group assignments (as shown in the next section).

If the discriminatory information is not in the mean, but in the variance, LDA will fail:

## The Basic Idea

Let a sample $$x^T = (x_1, \ldots, x_m)$$ where $$x_i$$ is a random variable.

We assume that $$x$$ follows a Multivariate Normal Distribution (MVD) with mean $$\mu$$ (a m-dimensional vector) and covariance matrix $$\Sigma$$ (a mm matrix), ie, $$x \sim N(\mu,\Sigma)$$.

The pdf is:

$f(x|\mu,\Sigma) = \frac{1}{(2\pi)^{m/2} |\Sigma|^{1/2}} exp \Big[ -\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu) \Big]$

Notice that the set $$\{x:f(x|\mu,\Sigma)>C\}$$ is a m-dimensional ellipsoid centered at point $$\mu$$, where $$\Sigma$$ determines the ellipsoid’s shape.

Eg, for $$\mu=(0,0)$$ and $$\Sigma = (1,0.8;0.8,3)$$ the contour plot is:

Def: The Mahalanobis distance of a point $$x$$ to mean $$\mu$$ is

$D = \sqrt{ (x-\mu)^T \Sigma^{-1} (x-\mu) }$

Two points have the same Mahalanobis distance if they are on the same ellipsoid centered on $$\mu$$.

Now suppose that given a point $$x$$ we wish the find the closest mean $$\mu_k$$ (of class $$K$$) measured by the Mahalanobis distance. A point is closer to $$\mu_1$$ than it is to $$\mu_2$$ when

$(x-\mu_1)^T \Sigma^{-1}_1 (x-\mu_1) < (x-\mu_2)^T \Sigma^{-1}_2 (x-\mu_2)$

this is a quadratic expression on $$x$$.

If we assume that all covariance matrixes are equal, $$\Sigma = \Sigma_k, k=1\ldots K$$, the expression simplifies to $-2\mu_1^T\Sigma^{-1}x + \mu_1^T\Sigma^{-1}\mu_1 < -2\mu_2^T\Sigma^{-1}x + \mu_2^T\Sigma^{-1}\mu_2$ which is a linear expression of $$x$$.

There are linear and quadratic discriminant analysis (QDA), depending on the assumptions we make. In LDA the different covariance matrixes are grouped into a single one, in order to have that linear expression. This makes it simpler but all the class groups share the same structure. In QDA the ellipsoid’s shapes vary.

To classify a new sample, we must compute the posterior $$p(x \in C_k|x)$$. The classification is just to assign the class with largest posterior probability. The likelihood of the data $$p(x)$$ is the MVD as seen above. The only thing needed to apply Bayes’ theorem is a prior probability of $$P(x \in C_k)$$ which usually is just the proportion of samples belonging to class $$C_k$$. Another option is to have a vague prior $$P(x \in C_k) = 1/K$$

In LDA the decision boundary between classes $$i$$ and $$j$$ is given by $\log \frac{p(x \in i|x)}{p(x \in j|x)} = 0$

## LDA for Classification

library(MASS)

X <- matrix(c(2,3,3,4,4,5,5,6,5,7,2,1,3,2,4,2,4,3,6,4,7,6,1,6,2,7,1,7), ncol=2, byrow=T)
Y <- c(1,1,1,1,1,2,2,2,2,2,2,3,3,3);
plot(X,col=Y, pch=19)

idx_class1 <- which(Y==1)  # indexes of the observations for each class
idx_class2 <- which(Y==2)

my.data <- data.frame(X1=X[,1], X2=X[,2], Y=Y)
model <- lda(Y ~ . , data=my.data)
model
## Call:
## lda(Y ~ ., data = my.data)
##
## Prior probabilities of groups:
##         1         2         3
## 0.3571429 0.4285714 0.2142857
##
## Group means:
##         X1       X2
## 1 3.800000 5.000000
## 2 4.333333 3.000000
## 3 1.333333 6.666667
##
## Coefficients of linear discriminants:
##          LD1       LD2
## X1 -2.025094 0.3695336
## X2  1.963192 0.2946290
##
## Proportion of trace:
##    LD1    LD2
## 0.9985 0.0015
projected.data <- X %*% model$scaling plot(projected.data, col=Y, pch=19) # now let's predict the class of a new sample new.data <- data.frame(X1=c(2), X2=c(5.1)) # new data point should be class 3 prediction <- predict(model, new.data) model$prior          # the prior probabilities before the new data point
##         1         2         3
## 0.3571429 0.4285714 0.2142857
prediction$posterior # the posterior probs for the classes given this new data point ## 1 2 3 ## 1 0.9397416 2.147726e-14 0.06025842 plot(X,col=Y, pch=19) # Let's plot it with the color for the class it predicts: points(new.data$X1, new.data$X2, pch=17, col=prediction$class)
# Plot the means:
points(model$means, pch="+", col=1:3) Let’s see the boundaries: library(klaR) # Classification and visualization package partimat(x=my.data[,-3], grouping=as.factor(my.data[,3]), method="lda", col.mean=1, image.colors = c("lightgrey","red","green")) Let’s do the same for QDA: model <- qda(Y ~ . , data=my.data) new.data <- data.frame(X1=c(2), X2=c(5.1)) # new data point should be class 3 prediction <- predict(model, new.data) prediction$posterior
##              1            2        3
## 1 5.403558e-05 8.261899e-14 0.999946
partimat(x=my.data[,-3], grouping=as.factor(my.data[,3]), method="qda",
col.mean=1, image.colors = c("lightgrey","red","green"))

## LDA for Dimensionality Reduction before Classification

Here the objective is to perform dimensionality reduction. But unlike PCA it preserves as much class information as possible.

Let’s see an eg:

set.seed(101)
Y <- my.data[,1]                            # classes
X <- as.data.frame(scale(my.data[,2:14]))   # scaling data (ie, mean=0 & sd=1 for all Xi)
names(X) <- paste0("X",1:13)
head(cbind(signif(X,3),Y), n=10)
##       X1      X2     X3     X4      X5    X6    X7     X8     X9      X10
## 1  1.510 -0.5610  0.231 -1.170  1.9100 0.807 1.030 -0.658  1.220  0.25100
## 2  0.246 -0.4980 -0.826 -2.480  0.0181 0.567 0.732 -0.818 -0.543 -0.29200
## 3  0.196  0.0212  1.110 -0.268  0.0881 0.807 1.210 -0.497  2.130  0.26800
## 4  1.690 -0.3460  0.487 -0.807  0.9280 2.480 1.460 -0.979  1.030  1.18000
## 5  0.295  0.2270  1.840  0.451  1.2800 0.807 0.661  0.226  0.400 -0.31800
## 6  1.480 -0.5160  0.304 -1.290  0.8580 1.560 1.360 -0.176  0.662  0.73000
## 7  1.710 -0.4170  0.304 -1.470 -0.2620 0.327 0.491 -0.497  0.680  0.08280
## 8  1.300 -0.1670  0.888 -0.567  1.4900 0.487 0.481 -0.417 -0.596 -0.00349
## 9  2.250 -0.6230 -0.716 -1.650 -0.1920 0.807 0.952 -0.577  0.680  0.06120
## 10 1.060 -0.8830 -0.352 -1.050 -0.1220 1.090 1.120 -1.140  0.453  0.93300
##       X11   X12     X13 Y
## 1   0.361 1.840  1.0100 1
## 2   0.405 1.110  0.9630 1
## 3   0.317 0.786  1.3900 1
## 4  -0.426 1.180  2.3300 1
## 5   0.361 0.448 -0.0378 1
## 6   0.405 0.336  2.2300 1
## 7   0.274 1.360  1.7200 1
## 8   0.449 1.360  1.7400 1
## 9   0.536 0.336  0.9470 1
## 10  0.230 1.320  0.9470 1
train <- sample(1:dim(X)[1],50)  # define the training set indexes

# First test with a knn
library(class)
prediction <- knn(X[train,], X[-train,], Y[train], k=1)
table1 <- table(old=Y[-train], new=prediction)
table1
##    new
## old  1  2  3
##   1 40  0  0
##   2  6 42  2
##   3  0  1 37

This is the result of applying KNN.

The next test reduces the dataset by performing a LDA. Then we classify:

library(MASS)
model <- lda(Y ~ . , data=cbind(X,Y), subset=train)
model$prior # the prior probabilities used (by default, the class proportions for the training set) ## 1 2 3 ## 0.38 0.42 0.20 model$scaling # the resulting projection matrix to project our dataset into a 2-dimensional vector space
##             LD1         LD2
## X1   0.10022270  0.94483391
## X2   0.09532697  0.53392351
## X3  -0.09876930  0.44206847
## X4   1.17749547  0.01822394
## X5  -0.27710709  0.04414024
## X6   0.48981986  0.09808563
## X7  -1.17527870 -0.60381590
## X8  -0.33679479 -0.19980790
## X9   0.07133650 -0.02637518
## X10  1.03977509  0.90506536
## X11  0.10599483 -0.02142074
## X12 -1.36310559 -0.10681806
## X13 -1.23185636  0.65014190
projected.data <- as.matrix(X) %*% model$scaling # make the projection to 2D space head(projected.data) ## LD1 LD2 ## [1,] -5.792619 1.5313280 ## [2,] -6.176574 -0.4178032 ## [3,] -3.610735 1.1316789 ## [4,] -4.510693 3.6465953 ## [5,] -1.240909 0.5314009 ## [6,] -4.821236 2.6848295 plot(projected.data, col=Y) prediction <- knn(projected.data[train,], projected.data[-train,], Y[train], k=1) # Now let's reapply knn: table2 <- table(old=Y[-train], new=prediction) table2 ## new ## old 1 2 3 ## 1 40 0 0 ## 2 1 45 4 ## 3 0 0 38 We have greatly improved the training error, from 9 misclassifications to just 5. ## LDA projections (iris eg) library(MASS) X <- scale(as.matrix(iris[,-5])) # better scale mean and variance before LDA Y <- unclass(iris$Species)
iris1 <- data.frame(X=X, Y=Y)
colnames(iris1) <- colnames(iris)
head(iris1)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   -0.8976739  1.01560199    -1.335752   -1.311052       1
## 2   -1.1392005 -0.13153881    -1.335752   -1.311052       1
## 3   -1.3807271  0.32731751    -1.392399   -1.311052       1
## 4   -1.5014904  0.09788935    -1.279104   -1.311052       1
## 5   -1.0184372  1.24503015    -1.335752   -1.311052       1
## 6   -0.5353840  1.93331463    -1.165809   -1.048667       1
model <- lda(Species ~ . , data=iris1, prior=c(1,1,1)/3)

plot(iris1[,"Sepal.Length"], iris1[,"Petal.Length"],
col=c("blue","green","red")[iris1$Species], pch=19, xlab="Sepal Length", ylab="Petal.Length") means <- model$means
points(means[,c(1,3)], pch=3, lwd=2, col="purple")

model
## Call:
## lda(Species ~ ., data = iris1, prior = c(1, 1, 1)/3)
##
## Prior probabilities of groups:
##         1         2         3
## 0.3333333 0.3333333 0.3333333
##
## Group means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1   -1.0111914   0.8504137   -1.3006301  -1.2507035
## 2    0.1119073  -0.6592236    0.2843712   0.1661774
## 3    0.8992841  -0.1911901    1.0162589   1.0845261
##
## Coefficients of linear discriminants:
##                     LD1         LD2
## Sepal.Length  0.6867795  0.01995817
## Sepal.Width   0.6688251  0.94344183
## Petal.Length -3.8857950 -1.64511887
## Petal.Width  -2.1422387  2.16413593
##
## Proportion of trace:
##    LD1    LD2
## 0.9912 0.0088
model$scaling # the parameters of the linear discriminant functions ## LD1 LD2 ## Sepal.Length 0.6867795 0.01995817 ## Sepal.Width 0.6688251 0.94344183 ## Petal.Length -3.8857950 -1.64511887 ## Petal.Width -2.1422387 2.16413593 Eg, the first discriminant function is a linear combination of the variables $0.687 \times \text{Sepal Length} + 0.669\times \text{Sepal Width} - 3.886 \times \text{Petal Length} - 2.14 \times \text{Petal Width}$ We can use predict to calculate the value of a discriminant function for each sample in the data set pred <- predict(model, iris[,-5]) # The next horizontal axis are meaningless, they depends on the sample order of the dataset head(pred$x[,1]) # contains the values of the dataset observations for the first discriminant function
##           1           2           3           4           5           6
## -0.02509743 -0.49686587 -0.11187726 -1.02459673 -0.02689287 -1.14571979
plot(pred$x[,1], col=c("blue","green","red")[iris1$Species], pch=19) # we can plot them

# Notice that the 2nd discriminant function does not separate that well the 2nd & 3rd class
plot(pred$x[,2], col=c("blue","green","red")[iris1$Species], pch=19) # we can plot them

Let’s use Sepal.Length and Petal.Length to vizualize the projections by both linear discriminant functions

vec <- c(model$scaling[1,1], model$scaling[3,1])
v   <- vec / sqrt(sum(vec^2))  # make it a unit vector
lda1.points <- as.matrix(iris1[,c(1,3)]) %*% v %*% t(v) # to project point X into unit vector v just calculate X.v.v^T

plot(iris1[,"Sepal.Length"], iris1[,"Petal.Length"],
col=c("blue","green","red")[iris1$Species], pch=19, xlab="Sepal Length", ylab="Petal.Length", , main="1st discriminant functions") segments(-vec[1],-vec[2],vec[1],vec[2]) # points(lda1.points , col=c("blue","green","red")[iris1$Species], pch=18) # draw projection point
for(i in 1:nrow(iris1)) {
segments(iris1[i,1], iris1[i,3], lda1.points[i,1], lda1.points[i,2],
lty=2, col=c("blue","green","red")[iris1[i,]$Species]) } vec <- c(model$scaling[1,2], model$scaling[3,2]) v <- vec / sqrt(sum(vec^2)) lda2.points <- as.matrix(iris1[,c(1,3)]) %*% v %*% t(v) plot(iris1[,"Sepal.Length"], iris1[,"Petal.Length"], col=c("blue","green","red")[iris1$Species], pch=19,
xlab="Sepal Length", ylab="Petal.Length", ,  main="2nd discriminant functions")
segments(-2*vec[1],-2*vec[2],2*vec[1],2*vec[2])
# points(lda2.points , col=c("blue","green","red")[iris1$Species], pch=18) # draw projection point for(i in 1:nrow(iris1)) { segments(iris1[i,1], iris1[i,3], lda2.points[i,1], lda2.points[i,2], lty=2, col=c("blue","green","red")[iris1[i,]$Species])
}