Refs:

Explaining Naïve Bayes

Given features \(F_1, F_2, \ldots F_n\) and a class \(C\), we wish to know

\[p(C|F_1, F_2, \ldots F_n)\]

By Bayes’ Theorem:

\[p(C|F_1, F_2, \ldots F_n) = \frac{p(C)p(F_1, F_2, \ldots F_n|C)}{p(F_1, F_2, \ldots F_n)} \propto p(C)p(F_1, F_2, \ldots F_n|C)\]

The numerator \(p(C)p(F_1, F_2, \ldots F_n|C)\) is equal to the joint probability \(p(C,F_1, F_2, \ldots F_n)\) which by the chain rule:

\[ \begin{array}{lcl} p(C,F_1, F_2, \ldots F_n) & \propto & p(C) p(F_1, F_2, \ldots F_n|C) \\ & \propto & p(C) p(F_1|C) p(F_2, \ldots F_n|C,F_1) \\ & \propto & p(C) p(F_1|C) p(F_2|C,F_1) p(F_3, \ldots F_n|C,F_1,F_2)\\ & \propto & \ldots \\ & \propto & p(C) p(F_1|C) p(F_2|C,F_1) \ldots p(F_n|C,F_1\ldots F_{n-1}) \end{array} \]

Naive Bayes ‘naively’ assumes that every feature \(F_i\) is conditionally independent of every other \(F_j\) (\(i \neq j\)) given \(C\), ie:

\[p(F_i|C,F_j) = p(F_i|C), i \neq j\]

This greatly simplifies the previous chain rule:

\[p(C) p(F_1|C) p(F_2|C,F_1) \ldots p(F_n|C,F_1\ldots F_{n-1}) = P(C)p(F_1|C)p(F_2|C)\ldots = p(C) \prod_i P(F_i|C)\]

And so:

\[p(C|F_1, F_2, \ldots F_n) \propto p(C) \prod_i P(F_i|C)\]

An eg: here’s an aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.

data(UCBAdmissions)
margin.table(UCBAdmissions,1)
## Admit
## Admitted Rejected 
##     1755     2771
df <- as.data.frame(UCBAdmissions)
df[df$Dept=="A",] # just look at Dept. A
##      Admit Gender Dept Freq
## 1 Admitted   Male    A  512
## 2 Rejected   Male    A  313
## 3 Admitted Female    A   89
## 4 Rejected Female    A   19

The estimated probabilities of being admitted to dept. A being a male or female are: \[\hat{p}(admitted | male, dept.A) = 512 / (512+313) \approx 0.62 \] \[\hat{p}(admitted | female, dept.A) = 89 / (89+19) \approx 0.82 \]

Let A = admitted; G = gender and D = departement.

Naive Bayes assumes that \[p(A|G,D) \propto p(A)p(G|A)p(D|A)\]

Knowing the gender, G=g, and the department, D=d, we wish to predict the value A (ie, was he/she admitted?) that maximizes p(A|g,d):

\[\operatorname*{arg\,max}_A ~ p(A|g,d) = \operatorname*{arg\,max}_A ~ p(a) p(g|A) p(d|A))\]

To achieve this, we get the data \(p(A)\), \(p(G|A)\) and \(p(D|A)\) for each possible value \(A\) and perform the calculations.

Let’s say that G=‘female’, and D=‘department A’ and we which to know either A is ‘admitted’ or ‘rejected’.

We need to query the table for some numbers:

sum (df$Freq) # total number of submitions (admitted+rejected)
## [1] 4526
sum (df[df$Gender=="Female",]$Freq) # number of females
## [1] 1835
sum (df[df$Admit=="Admitted",]$Freq) # number of admitted on all depts
## [1] 1755
sum (df[df$Admit=="Rejected",]$Freq) # number of rejected on all depts
## [1] 2771
sum (df[df$Admit=="Admitted" & df$Gender=="Female",]$Freq) # number of admitted females
## [1] 557
sum (df[df$Admit=="Rejected" & df$Gender=="Female",]$Freq) # number of rejected females
## [1] 1278
sum (df[df$Admit=="Admitted" & df$Dept=="A",]$Freq) # number of admitted in dept 'A'
## [1] 601
sum (df[df$Admit=="Rejected" & df$Dept=="A",]$Freq) # number of rejected in dept 'A'
## [1] 332

Then:

And:

This means that \(p(admitted|female,'A') \propto 0.39 \times 0.32 \times 0.34 = 0.042432\)

Which is larger than \(p(rejected|female,'A') \propto 0.61 \times 0.2 \times 0.12 = 0.033672\)

Our prediction is that a female will be admitted in Dept. A. The odds ratio is \(1.26:1\) in favor of admittance (ie, 56% chances of admittance).

Using the R funtions

library(e1071) 
# creates a classifier using Gender and Dept as data, and Admittance as the observation class
classifier <- naiveBayes(Admit ~ Gender + Dept, data = UCBAdmissions)
test <- unique(df[,c(-1,-4)])                # get the pairs gender+department
test$Prediction <- predict(classifier, test) # apply the prediction for those pairs (and add to a new column to data frame 'test')
test                                         # check results
##    Gender Dept Prediction
## 1    Male    A   Admitted
## 3  Female    A   Admitted
## 5    Male    B   Admitted
## 7  Female    B   Admitted
## 9    Male    C   Rejected
## 11 Female    C   Rejected
## 13   Male    D   Rejected
## 15 Female    D   Rejected
## 17   Male    E   Rejected
## 19 Female    E   Rejected
## 21   Male    F   Rejected
## 23 Female    F   Rejected

It is possible to access the conditional probabilities from the $tables attribute:

classifier$tables$Gender["Admitted","Female"]  # p(female|admitted)
## [1] 0.3173789
classifier$tables$Dept["Admitted","A"]         # p(dept.A|admitted)
## [1] 0.3424501
classifier$tables$Gender["Rejected","Female"]  # p(female|rejected)
## [1] 0.4612053
classifier$tables$Dept["Rejected","A"]         # p(dept.A|rejected)
## [1] 0.1198123

And also the apriori distribution:

classifier$apriori / sum(classifier$apriori)
## Admit
##  Admitted  Rejected 
## 0.3877596 0.6122404

Continuous attributes

If attribute \(F_i\) is continuous, a typical approach is to assume that \(p(F_i|C=c) \sim N(mean(F_i|C=c), sd(F_i|C=c))\)

TODO: egs at http://en.wikipedia.org/wiki/Naive_Bayes_classifier

Standard dataset iris includes continuous attributes:

data(iris) # load iris dataset
pairs(iris[1:4], main="Iris Data (red=setosa,green=versicolor,blue=virginica)", 
      pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])

head(iris,n=12)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
library(e1071) 
# create a classifier using naive bayes using the first 4 columns as the data, an the last column as the class for each observation (naiveBayes is a supervised learning algorithm)
classifier<-naiveBayes(iris[,1:4], iris[,5]) 

Use predict() that receives the classifier object plus some observations (iris[,-5] is the original data without its class) and returns a sugested class for each observation.

predicted.classes <- predict(classifier, iris[,-5])
head(predicted.classes,n=12)
##  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## [11] setosa setosa
## Levels: setosa versicolor virginica

Then the method table() presents a confusion matrix between the sugested classe vector with the real class vector. In this case the classification was very good, but this is a biased result since we are using the same data in the training and in the test sets

table(predicted.classes, iris[,5], dnn=list('predicted','actual'))
##             actual
## predicted    setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         3
##   virginica       0          3        47
classifier$apriori / sum(classifier$apriori) # the prior distribution for the classes
## iris[, 5]
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333

Since the predictor variables here are all continuous, the naive Bayes classifier generates three Gaussian (Normal) distributions for each predictor variable: one for each value of the class variable Species. The first column is the mean, the 2nd column is the standard deviation.

classifier$tables$Petal.Length
##             Petal.Length
## iris[, 5]     [,1]      [,2]
##   setosa     1.462 0.1736640
##   versicolor 4.260 0.4699110
##   virginica  5.552 0.5518947

Let’s plot these ones:

plot(0:3, xlim=c(0.5,7), col="red", ylab="density",type="n", xlab="Petal Length",main="Petal length distribution for each species")
curve(dnorm(x, classifier$tables$Petal.Length[1,1], classifier$tables$Petal.Length[1,2]), add=TRUE, col="red")
curve(dnorm(x, classifier$tables$Petal.Length[2,1], classifier$tables$Petal.Length[2,2]), add=TRUE, col="blue")
curve(dnorm(x, classifier$tables$Petal.Length[3,1], classifier$tables$Petal.Length[3,2]), add=TRUE, col ="green")
legend("topright", c("setosa", "versicolor", "virginica"), col = c("red","blue","green"), lwd=1)

These values could also be accessed directly in the dataset. Say for Petal.Length:

mean(iris[iris$Species=="setosa",]$Petal.Length)
## [1] 1.462
sd(iris[iris$Species=="setosa",]$Petal.Length)
## [1] 0.173664

So, let’s say we want to find, without using the R function, all three possible values of \(p(C|observation)\):

observation <- data.frame(Sepal.Length = 5.0, 
                          Sepal.Width  = 3.2, 
                          Petal.Length = 1.5, 
                          Petal.Width  = 0.3)  # this observation lies within Setosa cluster

For setosa class:\[p(C=setosa|observation) \propto p(C=setosa) p(Sepal.Length=5.0 | C=setosa) p(Sepal.Width=3.2 | C=setosa) p(Petal.Length=1.5 | C=setosa) p(Petal.Width=0.3 | C=setosa)\]

A similar computation is needed for the other two classes. Let’s implement them:

iris.classes <- c("setosa","versicolor","virginica")
iris.attributes <- names(iris)[-5]

means     <- rep(0,length(iris.attributes))
sds       <- rep(0,length(iris.attributes))
densities <- rep(0,length(iris.attributes))

p.Cs <- c(0,0,0)              # p(C=class)
p.Cs_observation <- c(0,0,0)  # p(C=class | observation)

for (c in 1:length(iris.classes)) {
  p.Cs[c] <- nrow(iris[iris$Species==iris.classes[c],]) / nrow(iris) 

  for(i in 1:length(iris.attributes)) {
    means[i] <- sapply(iris[iris$Species==iris.classes[c],][i], mean)
    sds[i]   <- sapply(iris[iris$Species==iris.classes[c],][i], sd)
    densities[i] <- dnorm(as.numeric(observation[i]), means[i], sds[i])
  }
  
  p.Cs_observation[c] <- p.Cs[c] * prod(densities) # the final value for each class
}

names(p.Cs_observation) <- c("setosa","versicolor","virginica")
p.Cs_observation
##       setosa   versicolor    virginica 
## 2.466810e+00 1.953732e-15 4.920209e-23
p.Cs_observation / sum(p.Cs_observation) # normalize
##       setosa   versicolor    virginica 
## 1.000000e+00 7.920075e-16 1.994563e-23

This means that our naive bayes would predict ‘setosa’ with VERY high likelihood.

With the use of naiveBayes() we get the same prediction:

# type="raw" shows the probabilities
predict(classifier, observation, type="raw")
##      setosa   versicolor    virginica
## [1,]      1 7.920075e-16 1.994563e-23