Bayesian Networks

Probability Calculus
The model
Cancer example
Happy example
D-Separation
Package gRain
Hidden Markov Models
Finding the structure of a Bayes Net

Refs:

CRAN Task View: gRaphical Models in R
Initial egs from Udacity’s “Introduction to AI” course

Probability Calculus

Some useful rules derived from Kolgomorov Axioms for random variables:

Marginal Probability (\(B_i \cap B_j = \emptyset, \sum_i B_i = \Omega\)): \[P(A) = \sum_i P(A, B_i)\]
Conditional Probability \[P(A,B) = P(B|A) P(A) = P(A|B) P(B)\]
Chain Rule \[P(A_1, A_2 \ldots A_n) = P(A_1) P(A_2 | A_1) \ldots P(A_n|A_1, A_2 \ldots A_{n-1})\]
Bayes’s rule \[P(A|B,C) = \frac{P(B|A,C) P(A|C)}{P(B|C)}\]
Total Probability \[P(A|C) = \sum_i P(A|C,B_i) P(B_i|C)\]

Notation for Independence between two random variables:

\[A \perp B \iff P(A,B) = P(A) P(B)\]

We define conditional independence as:

\[A \perp B | C \iff P(A,B|C) = P(A|C) P(B|C)\]

The model

A Bayesian network is a probabilistic graphical model (a type of statistical model) that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG) wikipedia.

Eg:

Instead of having to find the overall joint probability like:

\[P(A,B,C,D,E) = P(A|B,C,D,E) P(B|C,D,E) P(C|D,E) P(D|E) P(E)\]

The model represented by this graph assumes that the joint probability can be factored by:

\[P(A,B,C,D,E) = P(A) P(B) P(C|A,B) P(D|C) P(E|C)\]

which is much simpler (\(2^5-1=31\) parameters (probability values) for the full join vs. 10 parameters for the distribution proposed by this bayes network).

These models will help us to infer when new information is given.

But first let’s do some probability calculus around simple bayes networks.

Cancer example

The classic cancer eg, where there are two different tests \(T_1\) and \(T_2\). We draw into the graph the causal relations that our model implies. In this case we assume that having cancer (random variable \(C\)) is a cause these tests to have a positive result, not the other way around.

This code produces the next diagram. The majority of the following graph scripts will be hidden for presentation purposes.

library(igraph)

cancer.bnet <- graph.formula(C--+T1, C--+T2)
V(cancer.bnet)$size <- 50
V(cancer.bnet)$color <- "white"
E(cancer.bnet)$color <- "black"
  
# each pair defines the (x,y) of each node
layout <- matrix(c(5, 5,
                   0, 0,
                   10,0), ncol=2, byrow=TRUE) 
# this layout can be found by an interactive graphing screen that allows you to 
# manually move around the vertices and arrange them as you want
# use:
#  tkplot(graph.obj)
#  (& after the manipulation:) layout <- tkplot.getcoords(1)

plot(cancer.bnet, layout=layout, edge.arrow.size=1, vertex.label.color=1)

Assume that the following probabilities are given:

\(P(C) = 0.01\), \(P(\neg C) = 0.99\)
\(P(T=+|C) = 0.9\), \(P(T=-|C) = 0.1\) where + is a positive test (these values are the same for both tests)
\(P(T=+|\neg C) = 0.2\), \(P(T=-|\neg C) = 0.8\)

Let’s compute some useful probabilities (to simplify \(P(+)\) means \(P(T=+)\)):

\(P(+) = P(+|C)P(C) + P(+|\neg C)P(\neg C) = 0.9 \times 0.01 + 0.2 \times 0.99 = 0.207\) (marginal probability)
\(P(-) = 1- P(+) = 0.793\)
\(P(C,+) = P(C) P(+|C) = 0.01 \times 0.9 = 0.009\)
\(P(C,-) = P(C) P(-|C) = 0.01 \times 0.1 = 0.001\)
\(P(\neg C,+) = P(\neg C) P(+|\neg C) = 0.99 \times 0.2 = 0.198\)
\(P(\neg C,-) = P(\neg C) P(-|\neg C) = 0.99 \times 0.8 = 0.792\)

Notice that these last for joint probabilities cover all hypothesis, so they must sum to 1 (which they do).

Also \(P(+) = P(+,C) + P(+,\neg C)\) which the results also hold.

Let’s apply Bayes’ rule to find \(P(C|+)\) and \(P(C|-)\):

\[P(C|+) = \frac{P(+|C) P(C)}{P(+)} = \frac{0.9 \times 0.01}{0.207} \approx 0.0435\] \[P(C|-) = \frac{P(-|C) P(C)}{P(-)} = \frac{0.1 \times 0.01}{0.792} \approx 0.00127\]

Bayes rule can be interpreted as a learning/inference mechanism, i.e., what is the new probability of having cancer after we have known the evidence, ie, that the test as been positive. In this interpretation, the probabilities are named as follows:

\(P(C)\), the prior probability (what we know before the evidence)
\(P(+|C)\), the likelihood of the data given the hypothesis
\(P(+)\), the evidence (the marginal probability of the test is positive)
\(P(C|+)\), the posterior probability, the new belief after the evidence is processed

We can also infer if we have the information about the two tests (herein \(+_1+_2\) means \(T_1=+ \land T_2=+\)), namely \(P(C|+_1,+_2)\)

\[P(C|+_1+_2) = \frac{P(+_1+_2|C)P(C)}{P(+_1+_2)} \propto P(+_1+_2|C)P(C) = P(+_1|C)P(+_2|C)P(C) = 0.0081\]

Here we defer the computation of \(P(+_1+_2)\) which is a constant (that’s why it is proportional to the right expressions).

If we do the same thing for \[P(\neg C|+_1+_2) \propto 0.0396\]

Since \(P(C|+_1+_2) + P(\neg C|+_1+_2) = 1\), it’s enough to normalize both results to obtain the true probabilities:

\(P(C|+_1+_2) = \frac{0.0081}{0.0081+0.0396} \approx 0.1698\)
\(P(\neg C|+_1+_2) = \frac{0.0396}{0.0081+0.0396} \approx 0.8302\)

Using the same ideas, eg:

\(P(C|+_1-_2) = 0.0056\)

Notice that given the cancer result, the two tests do not depend on each other, they are conditionally independent given \(C\), i.e., \(T_1 \perp T_2|C\) which means \(P(+_1|C,+_2) = P(+_1|C)\).

So, \(P(+_1|C,+_2) = P(+_1|C) = 0.9\)

When a variable is known, the respective node in the bayes net is shaded:

V(cancer.bnet)[1]$color <- "grey"
plot(cancer.bnet, layout=layout, edge.arrow.size=1, vertex.label.color=1)

But if \(C\) is not known, then \(P(+_1|+_2)\) is a different event and has a different probability (the first step is the application of the total probability rule):

\[P(+_1|+_2) = P(+_1|+_2,C) P(C|+_2) + P(+_1|+_2,\neg C) P(\neg C|+_2) = \ldots = 0.2304\]

Happy example

This new situation models the event of someone being happy, \(H\), with two possible causes: it is sunny, \(S\) or that someone got a raise, \(R\). All random variables only have boolean values (as before).

Our model is:

Our initial data: + \(P(S) = 0.7\) + \(P(R) = 0.01\) + \(P(H|S,R) = 1\) + \(P(H|\neg S,R) = 0.9\) + \(P(H|S,\neg R) = 0.7\) + \(P(H|\neg S,\neg R) = 0.1\)

What value is \(P(R|S)\)? Without extra information, \(R \perp S\), so \(P(R|S) = P(R) = 0.01\).

From these we can compute:

\(P(H|S) = P(H|S,R) P(R|S) + P(H|S,\neg R) P(\neg R|S) = P(H|S,R) P(R) + P(H|S,\neg R) P(\neg R) = 0.703\)
\(P(H|R) = P(H|R,S) P(S|R) + P(H|R, \neg S) P(\neg S|R) = P(H|R,S) P(S) + P(H|R,\neg S) P(\neg S) = 0.97\)
\(P(H) = P(H|S)P(S) + P(H|\neg S)P(\neg S) = \ldots \approx 0.5245\)

Let’s next compute \(P(R|H)\):

\[P(R|H) = \frac{P(H|R) P(R)}{P(H)} = \frac{0.97 \times 0.01}{0.5245} \approx 0.185\]

And what about \(P(R|H,S)\)? Herein there’s extra information (we know she is happy).

\[P(R|H,S) = \frac{P(H|R,S) P(R|S)}{P(H|S)} = \frac{1 \times 0.01}{0.703} \approx 0.0142\]

Notice how strongly the hypothesis \(R\), i.e, she is happy, fell after we knew \(S\), i.e., that it was sunny (a 10-fold decrease). This is what is called to a cause to explain away another cause.

The effect also happens in reverse. If we know \(\neg S\) then

\(P(R|H,\neg S) \approx 0.084\)

so, the raise hypothesis got stronger (a 8-fold increase) to explain the fact that she is happy.

In terms of dependence we shown that, even knowing that \(R \perp S\), when we know \(H\) a dependence occurs:

\[R \cancel{\perp} S | H\]

D-Separation

Given a bayes net \(BN\) and a set of known nodes, we can infer if two nodes of \(BN\) are conditional dependent.

Patterns of dependence (called active triplets):

Here the following dependences occur:

\(A \cancel{\perp} C\)
\(E \cancel{\perp} F\)
\(G \cancel{\perp} H | I\)

The next graph shows another dependende pattern, namely \(A \cancel{\perp} B | D\):

Patterns of independence (called inactive triplets):

Here the following independences occur:

\(A \perp C|B\)
\(E \perp F|D\)
\(G \perp H\)

Package `gRain`

refs: + http://www.jstatsoft.org/v46/i10/paper + http://cran.r-project.org/web/packages/gRain/index.html

gRain is a R package for probability propagation in Bayes Networks. Let see how to use it going thru several examples.

Let’s first use the cancer bayes net above:

assuming the previous conditional probability tables (CPTs):

\(P(C) = 0.01\)
\(P(\neg C) = 0.99\)
\(P(T=+|C) = 0.9\)
\(P(T=-|C) = 0.1\)
\(P(T=+|\neg C) = 0.2\)
\(P(T=-|\neg C) = 0.8\)

library(gRain)

## Loading required package: gRbase

# if there's a problem with RBGL package do
#  source("http://bioconductor.org/biocLite.R")
#  biocLite("RBGL")

# the possible values (all nodes are boolean vars)
yn <- c("yes","no")

# specify the CPTs
node.C <- cptable(~ C, values=c(1, 99), levels=yn)
node.T1 <- cptable(~ T1 + C, values=c(9,1,2,8), levels=yn)
node.T2 <- cptable(~ T2 + C, values=c(9,1,2,8), levels=yn)

# create an intermediate representation of the CPTs
plist <- compileCPT(list(node.C, node.T1, node.T2))
plist

## CPTspec with probabilities:
##  P( C )
##  P( T1 | C )
##  P( T2 | C )

plist$C

## C
##  yes   no 
## 0.01 0.99

plist$T1

##      C
## T1    yes  no
##   yes 0.9 0.2
##   no  0.1 0.8

# create network
bn.cancer <- grain(plist)
summary(bn.cancer)

## Independence network: Compiled: FALSE Propagated: FALSE 
##  Nodes : chr [1:3] "C" "T1" "T2"

This object can be queried:

# The marginal probability for each variable:
querygrain(bn.cancer, nodes=c("C", "T1", "T2"), type="marginal")

## $C
## C
##  yes   no 
## 0.01 0.99 
## 
## $T1
## T1
##   yes    no 
## 0.207 0.793 
## 
## $T2
## T2
##   yes    no 
## 0.207 0.793

# The joint probability P(C,T1):
querygrain(bn.cancer, nodes=c("C","T1"), type="joint")

##      T1
## C       yes    no
##   yes 0.009 0.001
##   no  0.198 0.792

# P(T1=+ | T2=+):
#  1. add evidence to the net
bn.cancer.1 <- setFinding(bn.cancer, nodes=c("T2"), states=c("yes")) 
#  2. query the new net
querygrain(bn.cancer.1, nodes=c("T1"))

## $T1
## T1
##       yes        no 
## 0.2304348 0.7695652

# The probability of this evidence:
# print(getFinding(bn.cancer.1))
# The conditional P(not C | not T1)
bn.cancer.2 <- setFinding(bn.cancer, nodes=c("T1"), states=c("no")) 
querygrain(bn.cancer.2, nodes=c("C"))

## $C
## C
##         yes          no 
## 0.001261034 0.998738966

Another way to use this package is to build the CPTs from a given dataset

data("cad1")
head(cad1)

##      Sex   AngPec        AMI QWave QWavecode    STcode STchange SuffHeartF
## 1   Male     None NotCertain    No    Usable    Usable       No         No
## 2   Male Atypical NotCertain    No    Usable    Usable       No         No
## 3 Female     None   Definite    No    Usable    Usable       No         No
## 4   Male     None NotCertain    No    Usable Nonusable       No         No
## 5   Male     None NotCertain    No    Usable Nonusable       No         No
## 6   Male     None NotCertain    No    Usable Nonusable       No         No
##   Hypertrophi Hyperchol Smoker Inherit Heartfail CAD
## 1          No        No     No      No        No  No
## 2          No        No     No      No        No  No
## 3          No        No     No      No        No  No
## 4          No        No     No      No        No  No
## 5          No        No     No      No        No  No
## 6          No        No     No      No        No  No

# create the DAG
dag.cad <- dag(~ CAD:Smoker:Inherit:Hyperchol + 
                 AngPec:CAD + 
                 Heartfail:CAD + 
                 QWave:CAD)

library(Rgraphviz) # if not installed, execute at R's command line:

## Loading required package: graph

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:igraph':
## 
##     normalize, union

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colnames,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit

## 
## Attaching package: 'graph'

## The following objects are masked from 'package:igraph':
## 
##     degree, edges, intersection

## Loading required package: grid

# source("http://bioconductor.org/biocLite.R")
# biocLite("Rgraphviz")
plot(dag.cad)

# smooth is a small positive number to avoid zero entries in the CPTs
# (cf. Additive smoothing, http://en.wikipedia.org/wiki/Additive_smoothing)
bn.cad <- grain(dag.cad, data = cad1, smooth = 0.1)
# Let's ask some questions...
querygrain(bn.cad, nodes=c("CAD", "Smoker"), type="conditional")

##       CAD
## Smoker        No       Yes
##    No  0.7172084 0.2827916
##    Yes 0.4933771 0.5066229

querygrain(bn.cad, nodes=c("CAD"), type="marginal")

## $CAD
## CAD
##        No       Yes 
## 0.5418012 0.4581988

# ...and add some evidence
bn.cad.1 <- setFinding(bn.cad, nodes=c("Smoker"), states=c("Yes"))
querygrain(bn.cad.1, nodes=c("CAD"), type="marginal")

## $CAD
## CAD
##        No       Yes 
## 0.4933771 0.5066229

For the next section we will use an example presented in this paper

yn <- c("yes", "no")
a <- cptable(~ asia, values = c(1, 99), levels = yn)
t.a <- cptable(~ tub + asia, values = c(5, 95, 1, 99), levels = yn)
s <- cptable(~ smoke, values = c(5,5), levels = yn)
l.s <- cptable(~ lung + smoke, values = c(1, 9, 1, 99), levels = yn)
b.s <- cptable(~ bronc + smoke, values = c(6, 4, 3, 7), levels = yn)
x.e <- cptable(~ xray + either, values = c(98, 2, 5, 95), levels = yn)
d.be <- cptable(~ dysp + bronc + either, values = c(9, 1, 7, 3, 8, 2, 1, 9), levels = yn)
e.lt <- ortable(~ either + lung + tub, levels = yn)

bn.gin <- grain(compileCPT(list(a, t.a, s, l.s, b.s, e.lt, x.e, d.be)))
plot(bn.gin)

If the network is too big (not in this case, of course), finding the joint probabilities can take too long. One alternative is using simulation.

In this network the process would produce random values for ‘asia’ and ‘smoke’ and propagate those values into its children, until a new observation is created (i.e., a set of values for all the attributes in the joint probability). The process is then repeat \(N\) times. If \(N\) is large, the dataset should contain a good approximation of the true joint probability.

Let’s see an example with the previous net:

sim.gin <- simulate(bn.gin, nsim=5000)
head(sim.gin, n=10)

##    asia tub smoke lung bronc either xray dysp
## 1    no  no    no   no    no     no   no   no
## 2    no  no    no   no    no     no  yes   no
## 3    no  no    no   no    no     no   no   no
## 4    no  no    no   no    no     no   no   no
## 5    no  no    no   no    no     no   no   no
## 6    no  no    no   no    no     no   no   no
## 7   yes  no   yes   no    no     no   no   no
## 8    no  no   yes   no    no     no   no   no
## 9    no  no    no   no    no     no   no   no
## 10   no  no    no   no    no     no   no   no

# say we want to know P(Lung,Bronc):
xtabs(~ lung+bronc, data=sim.gin) / nrow(sim.gin)

##      bronc
## lung     yes     no
##   yes 0.0268 0.0256
##   no  0.4192 0.5284

# we can also compute the true P(Lung,Bronc) from the bayes net, since we know their true CPTs:
querygrain(bn.gin, nodes = c("lung", "bronc"), type = "joint")

##      bronc
## lung     yes     no
##   yes 0.0315 0.0235
##   no  0.4185 0.5265

We can also use the package to predict a set of responses from a set of explanatory variables:

new.observation <- data.frame(dysp = "yes",
                              either = "yes",
                              tub="no",
                              asia="no",
                              xray="yes",
                              smoke="yes")

predict(bn.gin, 
        newdata = new.observation,
        predictors=c("smoke", "asia", "tub" , "dysp", "xray"), 
        response = c("lung", "bronc"),
        type = "dist")

## $pred
## $pred$lung
##            yes        no
## [1,] 0.7744796 0.2255204
## 
## $pred$bronc
##            yes        no
## [1,] 0.7181958 0.2818042
## 
## 
## $pEvidence
## [1] 0.05084759

# This gives the most probable value for each variable, it does not imply that the jointly configuration with these two values is the most probable
# $pFinding is the probability of the explanatory variables

Hidden Markov Models

The gRainpackage provides a service to produce repeated patterns which are useful to model Hidden Markov Models (HMM) with hidden states and observed states.

Let’s make a HMM with

\[p(x,y) = p(x_0) \prod_{t=1}^5 p(x_t | x_{t-1}) p(y_t|x_t)\]

where \(x_t\) are the unobserved states, and \(y_t\) the observed ones (both boolean values).

The transition probabilities are:

library(gRain)
yn <- c("yes", "no")
# Rainy at day i = TRUE is x_i = TRUE
x.x <- cptable(~ x[i] | x[i-1], values = c(60, 40, 20, 80), levels = yn)
# Happy at day i = TRUE is y_i = TRUE
y.x <- cptable(~ y[i] | x[i], values = c(40, 60, 90, 10), levels = yn)

inst <- repeatPattern(list(x.x, y.x), instances = 1:5)
x.0 <- cptable(~ x0, values = c(1, 0), levels = yn) # prior distribution P(Sunny_0)=1
hmm <- grain(compileCPT(c(list(x.0), inst)))

plot(hmm)

hmm.1 <- setFinding(hmm, nodes=c("y1"), states=c("yes")) # happy at dia 1
querygrain(hmm.1, nodes=c("x1"), type="marginal")

## $x1
## x1
## yes  no 
## 0.4 0.6

So, \(p(\text{Rainy at day 1} | \text{Happy at day 1}) = P(x_1|y_1) = 0.4\)

We can check this by doing the computations by hand:

\(P(y_1|x_1) = 0.4\) as defined
\(P(x_1) = P(x_1|x_0) P(x_0) = 0.6 \times 1 = 0.6\)
\(P(y_1) = P(y_1|x_1) \times P(x_1) + P(y_1|\neg x_1) \times P(\neg x_1) = 0.6\)

So,

\[P(x_1|y_1) = \frac{P(y_1|x_1) P(x_1)}{p(y_1)} = 0.4\]

Finding the structure of a Bayes Net

R package bnlearn provides the capability of infering the bayes Net structure given only a dataset.

Refs:

http://www.jstatsoft.org/v35/i03/paper
http://cran.r-project.org/web/packages/bnlearn/index.html
Nagarajan - Bayesian Networks in R (2013)

In the next code sample, dataset alarm has 20k observations made by this bayes network:

But let’s assume we only know the dataset:

library(bnlearn)

data(alarm)
head(alarm, n=1)

##      CVP   PCWP  HIST TPR     BP   CO HRBP HREK HRSA    PAP   SAO2 FIO2
## 1 NORMAL NORMAL FALSE LOW NORMAL HIGH HIGH HIGH HIGH NORMAL NORMAL  LOW
##   PRSS ECO2 MINV    MVS   HYP   LVF   APL  ANES   PMB    INT  KINK DISC
## 1 HIGH ZERO HIGH NORMAL FALSE FALSE FALSE FALSE FALSE NORMAL FALSE TRUE
##      LVV   STKV CCHL  ERLO   HR  ERCA   SHNT    PVS   ACO2 VALV VLNG VTUB
## 1 NORMAL NORMAL HIGH FALSE HIGH FALSE NORMAL NORMAL NORMAL HIGH  LOW ZERO
##     VMCH
## 1 NORMAL

alarm.struct <- iamb(alarm, test = "x2")

The next diagram presents in red the edges of the true model that were found by the algorithm:

graphviz.plot(dag.alarm, highlight = list(arcs = arcs(alarm.struct)))

Bayesian Networks

João Neto

December 2013

Probability Calculus

The model

Cancer example

Happy example

D-Separation

Package `gRain`

Hidden Markov Models

Finding the structure of a Bayes Net

Bayesian Networks

João Neto

December 2013

Probability Calculus

The model

Cancer example

Happy example

D-Separation

Package gRain

Hidden Markov Models

Finding the structure of a Bayes Net

Package `gRain`