Refs:
Given features \(F_1, F_2, \ldots F_n\) and a class \(C\), we wish to know
\[p(C|F_1, F_2, \ldots F_n)\]
By Bayes’ Theorem:
\[p(C|F_1, F_2, \ldots F_n) = \frac{p(C)p(F_1, F_2, \ldots F_n|C)}{p(F_1, F_2, \ldots F_n)} \propto p(C)p(F_1, F_2, \ldots F_n|C)\]
The numerator \(p(C)p(F_1, F_2, \ldots F_n|C)\) is equal to the joint probability \(p(C,F_1, F_2, \ldots F_n)\) which by the chain rule:
\[ \begin{array}{lcl} p(C,F_1, F_2, \ldots F_n) & \propto & p(C) p(F_1, F_2, \ldots F_n|C) \\ & \propto & p(C) p(F_1|C) p(F_2, \ldots F_n|C,F_1) \\ & \propto & p(C) p(F_1|C) p(F_2|C,F_1) p(F_3, \ldots F_n|C,F_1,F_2)\\ & \propto & \ldots \\ & \propto & p(C) p(F_1|C) p(F_2|C,F_1) \ldots p(F_n|C,F_1\ldots F_{n-1}) \end{array} \]
Naive Bayes ‘naively’ assumes that every feature \(F_i\) is conditionally independent of every other \(F_j\) (\(i \neq j\)) given \(C\), ie:
\[p(F_i|C,F_j) = p(F_i|C), i \neq j\]
This greatly simplifies the previous chain rule:
\[p(C) p(F_1|C) p(F_2|C,F_1) \ldots p(F_n|C,F_1\ldots F_{n-1}) = P(C)p(F_1|C)p(F_2|C)\ldots = p(C) \prod_i P(F_i|C)\]
And so:
\[p(C|F_1, F_2, \ldots F_n) \propto p(C) \prod_i P(F_i|C)\]
An eg: here’s an aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.
data(UCBAdmissions)
margin.table(UCBAdmissions,1)
## Admit
## Admitted Rejected
## 1755 2771
df <- as.data.frame(UCBAdmissions)
df[df$Dept=="A",] # just look at Dept. A
## Admit Gender Dept Freq
## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
The estimated probabilities of being admitted to dept. A being a male or female are: \[\hat{p}(admitted | male, dept.A) = 512 / (512+313) \approx 0.62 \] \[\hat{p}(admitted | female, dept.A) = 89 / (89+19) \approx 0.82 \]
Let A = admitted; G = gender and D = departement.
Naive Bayes assumes that \[p(A|G,D) \propto p(A)p(G|A)p(D|A)\]
Knowing the gender, G=g, and the department, D=d, we wish to predict the value A (ie, was he/she admitted?) that maximizes p(A|g,d):
\[\operatorname*{arg\,max}_A ~ p(A|g,d) = \operatorname*{arg\,max}_A ~ p(a) p(g|A) p(d|A))\]
To achieve this, we get the data \(p(A)\), \(p(G|A)\) and \(p(D|A)\) for each possible value \(A\) and perform the calculations.
Let’s say that G=‘female’, and D=‘department A’ and we which to know either A is ‘admitted’ or ‘rejected’.
We need to query the table for some numbers:
sum (df$Freq) # total number of submitions (admitted+rejected)
## [1] 4526
sum (df[df$Gender=="Female",]$Freq) # number of females
## [1] 1835
sum (df[df$Admit=="Admitted",]$Freq) # number of admitted on all depts
## [1] 1755
sum (df[df$Admit=="Rejected",]$Freq) # number of rejected on all depts
## [1] 2771
sum (df[df$Admit=="Admitted" & df$Gender=="Female",]$Freq) # number of admitted females
## [1] 557
sum (df[df$Admit=="Rejected" & df$Gender=="Female",]$Freq) # number of rejected females
## [1] 1278
sum (df[df$Admit=="Admitted" & df$Dept=="A",]$Freq) # number of admitted in dept 'A'
## [1] 601
sum (df[df$Admit=="Rejected" & df$Dept=="A",]$Freq) # number of rejected in dept 'A'
## [1] 332
Then:
And:
This means that \(p(admitted|female,'A') \propto 0.39 \times 0.32 \times 0.34 = 0.042432\)
Which is larger than \(p(rejected|female,'A') \propto 0.61 \times 0.2 \times 0.12 = 0.033672\)
Our prediction is that a female will be admitted in Dept. A. The odds ratio is \(1.26:1\) in favor of admittance (ie, 56% chances of admittance).
naiveBayes()
creates a classifier given observation data and the class for each observationpredict()
receives the classifier, some observations, and returns a vector with their predicted classeslibrary(e1071)
# creates a classifier using Gender and Dept as data, and Admittance as the observation class
classifier <- naiveBayes(Admit ~ Gender + Dept, data = UCBAdmissions)
test <- unique(df[,c(-1,-4)]) # get the pairs gender+department
test$Prediction <- predict(classifier, test) # apply the prediction for those pairs (and add to a new column to data frame 'test')
test # check results
## Gender Dept Prediction
## 1 Male A Admitted
## 3 Female A Admitted
## 5 Male B Admitted
## 7 Female B Admitted
## 9 Male C Rejected
## 11 Female C Rejected
## 13 Male D Rejected
## 15 Female D Rejected
## 17 Male E Rejected
## 19 Female E Rejected
## 21 Male F Rejected
## 23 Female F Rejected
It is possible to access the conditional probabilities from the $tables attribute:
classifier$tables$Gender["Admitted","Female"] # p(female|admitted)
## [1] 0.3173789
classifier$tables$Dept["Admitted","A"] # p(dept.A|admitted)
## [1] 0.3424501
classifier$tables$Gender["Rejected","Female"] # p(female|rejected)
## [1] 0.4612053
classifier$tables$Dept["Rejected","A"] # p(dept.A|rejected)
## [1] 0.1198123
And also the apriori distribution:
classifier$apriori / sum(classifier$apriori)
## Admit
## Admitted Rejected
## 0.3877596 0.6122404
If attribute \(F_i\) is continuous, a typical approach is to assume that \(p(F_i|C=c) \sim N(mean(F_i|C=c), sd(F_i|C=c))\)
TODO: egs at http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Standard dataset iris includes continuous attributes:
data(iris) # load iris dataset
pairs(iris[1:4], main="Iris Data (red=setosa,green=versicolor,blue=virginica)",
pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])
head(iris,n=12)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
library(e1071)
# create a classifier using naive bayes using the first 4 columns as the data, an the last column as the class for each observation (naiveBayes is a supervised learning algorithm)
classifier<-naiveBayes(iris[,1:4], iris[,5])
Use predict()
that receives the classifier object plus some observations (iris[,-5] is the original data without its class) and returns a sugested class for each observation.
predicted.classes <- predict(classifier, iris[,-5])
head(predicted.classes,n=12)
## [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## [11] setosa setosa
## Levels: setosa versicolor virginica
Then the method table()
presents a confusion matrix between the sugested classe vector with the real class vector. In this case the classification was very good, but this is a biased result since we are using the same data in the training and in the test sets
table(predicted.classes, iris[,5], dnn=list('predicted','actual'))
## actual
## predicted setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 47 3
## virginica 0 3 47
classifier$apriori / sum(classifier$apriori) # the prior distribution for the classes
## iris[, 5]
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
Since the predictor variables here are all continuous, the naive Bayes classifier generates three Gaussian (Normal) distributions for each predictor variable: one for each value of the class variable Species. The first column is the mean, the 2nd column is the standard deviation.
classifier$tables$Petal.Length
## Petal.Length
## iris[, 5] [,1] [,2]
## setosa 1.462 0.1736640
## versicolor 4.260 0.4699110
## virginica 5.552 0.5518947
Let’s plot these ones:
plot(0:3, xlim=c(0.5,7), col="red", ylab="density",type="n", xlab="Petal Length",main="Petal length distribution for each species")
curve(dnorm(x, classifier$tables$Petal.Length[1,1], classifier$tables$Petal.Length[1,2]), add=TRUE, col="red")
curve(dnorm(x, classifier$tables$Petal.Length[2,1], classifier$tables$Petal.Length[2,2]), add=TRUE, col="blue")
curve(dnorm(x, classifier$tables$Petal.Length[3,1], classifier$tables$Petal.Length[3,2]), add=TRUE, col ="green")
legend("topright", c("setosa", "versicolor", "virginica"), col = c("red","blue","green"), lwd=1)
These values could also be accessed directly in the dataset. Say for Petal.Length
:
mean(iris[iris$Species=="setosa",]$Petal.Length)
## [1] 1.462
sd(iris[iris$Species=="setosa",]$Petal.Length)
## [1] 0.173664
So, let’s say we want to find, without using the R function, all three possible values of \(p(C|observation)\):
observation <- data.frame(Sepal.Length = 5.0,
Sepal.Width = 3.2,
Petal.Length = 1.5,
Petal.Width = 0.3) # this observation lies within Setosa cluster
For setosa class:\[p(C=setosa|observation) \propto p(C=setosa) p(Sepal.Length=5.0 | C=setosa) p(Sepal.Width=3.2 | C=setosa) p(Petal.Length=1.5 | C=setosa) p(Petal.Width=0.3 | C=setosa)\]
A similar computation is needed for the other two classes. Let’s implement them:
iris.classes <- c("setosa","versicolor","virginica")
iris.attributes <- names(iris)[-5]
means <- rep(0,length(iris.attributes))
sds <- rep(0,length(iris.attributes))
densities <- rep(0,length(iris.attributes))
p.Cs <- c(0,0,0) # p(C=class)
p.Cs_observation <- c(0,0,0) # p(C=class | observation)
for (c in 1:length(iris.classes)) {
p.Cs[c] <- nrow(iris[iris$Species==iris.classes[c],]) / nrow(iris)
for(i in 1:length(iris.attributes)) {
means[i] <- sapply(iris[iris$Species==iris.classes[c],][i], mean)
sds[i] <- sapply(iris[iris$Species==iris.classes[c],][i], sd)
densities[i] <- dnorm(as.numeric(observation[i]), means[i], sds[i])
}
p.Cs_observation[c] <- p.Cs[c] * prod(densities) # the final value for each class
}
names(p.Cs_observation) <- c("setosa","versicolor","virginica")
p.Cs_observation
## setosa versicolor virginica
## 2.466810e+00 1.953732e-15 4.920209e-23
p.Cs_observation / sum(p.Cs_observation) # normalize
## setosa versicolor virginica
## 1.000000e+00 7.920075e-16 1.994563e-23
This means that our naive bayes would predict ‘setosa’ with VERY high likelihood.
With the use of naiveBayes() we get the same prediction:
# type="raw" shows the probabilities
predict(classifier, observation, type="raw")
## setosa versicolor virginica
## [1,] 1 7.920075e-16 1.994563e-23