This page contains source code relating to section 4.4 of Bishop’s Pattern Recognition and Machine Learning (2009)

This section is about Laplace Approximation.

This technique tries to approximate a distribution $$p$$ that we only know the shape $$f$$ but not the normalization constant $$Z$$:

$p(z) = \frac{1}{Z} f(z)$

The approximation is done with a normal distribution $$q$$. To determine $$q$$ we need to find the mode/mean and the variance.

The approximation states that the mode is the same as the mode of $$p$$. We can find this mode, $$z_0$$, by just knowing the shape of $$p$$. This is possible by applying the first derivative, set it to zero and solve it:

$\frac{df}{dz} \rvert_{z=z_0} = 0$

Let’s use a small eg to clarify what we are doing here.

Let’s say that our distribution $$p$$ is a beta, then:

$p(\theta|y,n) = \frac{1}{Z} \theta^y (1-\theta)^{n-y}$

assuming we are unable to find $$Z$$ (in this case, $$Z$$ is available).

Eg: find the posterior of $$\theta$$ given $$n=18, y=10$$.

Shape $$f$$ is

$p(\theta) \propto f(\theta) = \theta^{10} (1-\theta)^8$

So, we calculate $$df/dz$$ and find its root $$z_0$$:

n <- 18
y <- 10 # 10 heads and 8 tails

df_dz   <- D(expression(theta^y * (1-theta)^(n-y)), "theta")
f_df_dz <- function(theta) eval(df_dz, envir=list(theta=theta, y=y, n=n))
z0      <- uniroot(f_df_dz, c(.001,.999))$root z0 ## [1] 0.5555318 With $$z_0$$, the normal approximation $$q(z)$$ is given by $q(z) = \left( \frac{A}{2\pi} \right)^{1/2} \exp\{ -\frac{A}{2} (z-z_0)^2 \}$ $A = -\frac{d^2}{dz^2} \log f(z) \rvert_{z=z_0}$ In R: d2f_dz2 <- D(D(expression(log(theta^y * (1-theta)^(n-y))),"theta"),"theta") A <- -eval(d2f_dz2, envir=list(theta=z0, y=y, n=n)) q_z <- function(z) { sqrt(A/(2*pi)) * exp(-(A/2) * (z-z0)^2) } To check, let’s compare it with the true posterior $$\text{Beta}(11,9)$$ assuming prior $\theta \sim \text{Beta(1,1)}$ curve(dbeta(x,y+1,n-y+1), 0 ,1, col="blue", lwd=2, xlab=expression(theta), ylab="posterior") # true posterior is a beta curve(q_z, 0, 1, lwd=2, col="red", add=T) # laplace approximation is a normal legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2) Let’s wrap this eg into a function: # returns approximating normal distribution of a beta lap_approx_beta <- function(y, n) { df_dz <- D(expression(theta^y * (1-theta)^(n-y)), "theta") f_df_dz <- function(theta) eval(df_dz, envir=list(theta=theta, y=y, n=n)) z0 <- uniroot(f_df_dz, c(.001,.999))$root

d2f_dz2 <- D(D(expression(log(theta^y * (1-theta)^(n-y))),"theta"),"theta")
A       <- -eval(d2f_dz2, envir=list(theta=z0, y=y, n=n))

function(z) {
sqrt(A/(2*pi)) * exp(-(A/2) * (z-z0)^2)
}
}

y   <- 4
n   <- 6
q_z <- lap_approx_beta(y,n)

curve(dbeta(x,y+1,n-y+1), 0 ,1, col="blue", lwd=2,
xlab=expression(theta), ylab="posterior")
curve(q_z, 0, 1, lwd=2, col="red", add=T)
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

## Approximating Bayesian Models

This section is based on Rasmus Bååth’s Easy Laplace Approximation of Bayesian Models in R

Let’s assume the following model:

$y_i \sim \mathcal{N}(\mu, \sigma^2)$

with priors

$\mu \sim \mathcal{N}(0, 100)$

$\sigma \sim \text{LogNormal}(0,4)$

The posterior is:

$\begin{array}{lclr} p(\mu, \sigma | y) & \propto & p(y | \mu, \sigma) p(\mu) p(\sigma) & \color{blue}{\text{Bayes Theorem}~\&~\mu \perp \sigma}\\ \log p(\mu, \sigma | y) & \propto & \log \left[ p(y | \mu, \sigma) p(\mu) p(\sigma) \right] & \\ & = & \log \prod_i p(y_i | \mu, \sigma) + \log p(\mu) + \log p(\sigma) & \color{blue}{y_i~\text{iid}} \\ & = & \sum_i \log p(y_i | \mu, \sigma) + \log p(\mu) + \log p(\sigma) & \\ \end{array}$

The next R function computes this log posterior:

# Model
#   y_i   ~ Normal(mu, sigma)
#   mu    ~ Normal(0,100)
#   sigma ~ LogNormal(0,4)

# argument p receives the parameter values, it's a vector c(mu=..., sigma=...)
# argument y is the available data

model <- function(p, y) {
log_lik         <- sum(dnorm(y, p["mu"], p["sigma"], log = T))    # log likelihood
log_prior_mu    <- dnorm(p["mu"],     0, 100, log = T)
log_prior_sigma <- dlnorm(p["sigma"], 0,   4, log = T)

log_lik + log_prior_mu + log_prior_sigma
}

Now we wish the optimize the parameter values to maximize this log-posterior:

# Laplace approximation
laplace_approx <- function(model, inits, ...) {

fit <- optim(inits, model,
control = list(fnscale = -1), # maximize
hessian = TRUE, ...)          # Hessian matrix defines curvature at max
param_mean    <- fit$par # the posterior modes are the mode param_cov_mat <- solve(-fit$hessian)       # this gives the co-variance matrix

list(param_mean=param_mean, param_cov_mat=param_cov_mat)
}

Let’s try with some data:

y <- rnorm(n=50, mean=10, sd=5)

report <- laplace_approx(model, inits=c(mu=0, sigma=1), y=y)

library(coda)
library(mvtnorm)
# using sampling to approx density
samples <- mcmc(rmvnorm(1e4, report$param_mean, report$param_cov_mat))
plot(samples[,1], trace=FALSE, main="density of mean")

# compare posterior with approximation
curve(dnorm(x,10,5), from=-10, to=30, col="blue", lwd=2, xlab="", ylab="posterior")
curve(dnorm(x,report$param_mean["mu"],report$param_mean["sigma"]), from=-10, to=30, col="red", lwd=2, add=T)
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

Another example (n is known):

$y_i \sim \text{Binomial}(n,\theta)$

with prior

$\theta \sim \text{Beta}(1, 1)$

model2 <- function(p, y, n) {

log_lik         <- sum(log(dbinom(y, n, p["theta"])))
log_prior_theta <- dbeta(p["theta"], 1, 1)

log_lik + log_prior_theta
}

Let’s also compare Bååth’s method with Bishop’s:

n <- 18
y <- 10 # 10 heads and 8 tails

report <- laplace_approx(model2, inits=c(theta=0.5), y=y, n=n) # using Bååth's method
z0  <- report$param_mean["theta"] sd0 <- sqrt(report$param_cov_mat)

q_z <- lap_approx_beta(y,n)                                    # using Bishop's method

curve(dbeta(x,y+1,n-y+1), 0 ,1, col="blue", lwd=2, xlab=expression(theta), ylab="posterior")
curve(q_z,            0, 1, lwd=4, col="red", add=T)
legend("topleft",c("true posterior","Bishop", "Bååth"), col=c("blue","red", "yellow"), lty=1, lwd=2)

As expected, they gave the same result :-)

## Reparametrization

We’ve seen that some approximations are not very good:

y <- 4
n <- 6

report <- laplace_approx(model2, inits=c(theta=0.5), y=y, n=n)
z0     <- report$param_mean["theta"] sd0 <- sqrt(report$param_cov_mat)

thetas    <- seq(0,1,len=100)
post_like <- dbeta(thetas,y+1,n-y+1)
plot(thetas, post_like, col="blue", lwd=2, xlab=expression(theta), ylab="posterior", type="l")

laplace_like <- dnorm(thetas, z0, sd0)
dist_diff    <- max(laplace_like) / max(post_like)
points(thetas, laplace_like/dist_diff, col="red", lwd=2, type="l")
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

When the posterior starts to skew, the normal cannot fit properly. This sometimes happen due to parameter boundaries (here, $$\theta \in [0,1]$$). We can try to reparametrize $$\theta$$ so to remove the boundary. Two typical transformations are $$\log$$, that translates positive numbers to the reals, and $$\text{logit}$$, that translates the interval $$[0,1]$$ to the reals.

In this eg let’s remake the computation applying $$\text{logit}$$ to parameter $$\theta$$. We will not optimize over $$\theta$$ but over $$\text{logit}(\theta)$$. This implies that the parameter that ‘feeds’ the model distributions is $$\text{sigmoid}(\text{logit}(\theta))$$:

logit   <- function(x) log(x/(1-x))
sigmoid <- function(x) 1/(1+exp(-x))  # logit & sigmoid are inverses

model3 <- function(p, y, n) {

log_lik         <- sum(log(dbinom(y, n, sigmoid(p["logit_theta"]))))
log_prior_theta <- dbeta(sigmoid(p["logit_theta"]), 1, 1)

log_lik + log_prior_theta
}

Let’s test the model with the same data:

n <- 6
y <- 4

report <- laplace_approx(model3, inits=c(logit_theta=0.5), y=y, n=n)
z0     <- report$param_mean["logit_theta"] sd0 <- sqrt(report$param_cov_mat)

thetas    <- seq(0,1,len=100)
post_like <- dbeta(thetas,y+1,n-y+1)
plot(thetas, post_like, col="blue", lwd=2, xlab=expression(theta), ylab="posterior", type="l")

# since the approx was made on a logit scale, we need to rescale thetas to logit
thetas_logit <- logit(thetas)
laplace_like <- dnorm(thetas_logit, z0, sd0)
dist_diff    <- max(laplace_like) / max(post_like)
points(thetas, laplace_like/dist_diff, col="red", lwd=2, type="l")
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

The results are now much better!