========================================================

Notes from course given by professor Maria Isabel Fraga Alves.

Refs:

Inverse Transformation

This is a method to generate random samples of distributions.

It is based on the following result called Probability Integral Transformation:

\[X \sim F_X \Rightarrow U = F_X(X) \sim \mathcal{U}(0,1)\]

We define the inverse transformation of \(F_X\) as

\[F_X^{-1}(u) = \text{inf} \{ x : F_X(x) = u \}, 0 < u < 1\]

With \(U \sim \mathcal{U}(0,1)\),

\[P(F_X^{-1}(U) \le x) = P(\text{inf} \{ t : F_X(t) = U \} \le x) = P(U \le F_X(x)) = F_U(F_X(x)) = F_X(x)\]

meaning that \(F_X^{-1}\) has the same distribution as \(X\)!

So, the method needs the inverse function \(F_X^{-1}(u)\) to work.

For each value required the method follows:

  1. Generate a random \(u\) from \(\mathcal{U}(0,1)\)

  2. Return \(F_X^{-1}(u)\)

Here’s the algorithm in R:

# return n samples based on the inverse of the target cdf
inv.transform <- function(inv.f, n) {
  
# Non-vectorized version (for explanation purposes only)  
#
#   result.sample <- rep(NA,n)
#   
#   for (i in 1:n) {
#     u <- runif(1,0,1)              # step 1
#     result.sample[i] <- inv.f(u)   # step 2
#   }
#   
#   result.sample
  
  # Vectorized version
  inv.f(runif(n,0,1))
}

Eg: used this method to simulate a random variable which pdf is \(f_X(x) = 3x^2, 0 < x < 1\)

The cdf is given by

\[F_X(x) = \int_0^x f_x(t) dt = x^3\]

so, the inverse is \(F_X^{-1}(u) = u^{1/3}\)

inv.f <- function(u) u^(1/3)

vals <- inv.transform(inv.f, 5e4)

# Checking if it went well
hist(vals, breaks=50, freq=FALSE, main=expression("Sample vs true Density [ f(x)=" ~3*x^2~"]"))
curve(3*x^2, 0, 1, col="red", lwd=2, add=T)

Let’s change the previous problem a bit: let’s assume the same shape but for \(-1 < x < 2\).

We would get:

\[f_X(x) = \frac{1}{3} x^2\] \[F_X(x) = \int_{-1}^x f_X(t) dt = \frac{x^3+1}{9}\] \[F_X^{-1}(u) = (9u-1)^{1/3}\]

So we need a new inverse function in R:

# The next R code is like this due to the fact that cubic squares of negative numbers also have complex roots
# But basically this is 
#   inv.f <- function(u) (9*u-1)^(1/3)
inv.f <- function(u) ifelse((9*u-1)>=0,  (9*u-1)^(1/3),  -(1-9*u)^(1/3))

vals <- inv.transform(inv.f, 5e4)

# Checking if it went well
hist(vals, breaks=50, freq=FALSE, main="Sample vs true Density")
curve((1/3)*x^2, -1, 2, col="red", lwd=2, add=T)

The next R code shows a way to generate an empirical cdf from some data (using the distr package), and how to use it to generate new random samples.

n <- 1e3
xs <- rnorm(n,.5,.2)    # a random sample

library(distr) 
## Loading required package: startupmsg
## Utilities for start-up messages (version 0.9)
## For more information see ?"startupmsg", NEWS("startupmsg")
## 
## Loading required package: sfsmisc
## Loading required package: SweaveListingUtils
## Utilities for Sweave together with TeX listings package (version 0.6.2)
## NOTE: Support for this package will stop soon.
## Package 'knitr' is providing the same functionality in a better way.
## Some functions from package 'base' are intentionally masked ---see SweaveListingMASK().
## Note that global options are controlled by SweaveListingoptions() ---c.f. ?"SweaveListingoptions".
## For more information see ?"SweaveListingUtils", NEWS("SweaveListingUtils")
## There is a vignette to this package; try vignette("ExampleSweaveListingUtils").
## 
## 
## Attaching package: 'SweaveListingUtils'
## 
## The following objects are masked from 'package:base':
## 
##     library, require
## 
## Object oriented implementation of distributions (version 2.5.3)
## Attention: Arithmetics on distribution objects are understood as operations on corresponding random variables (r.v.s); see distrARITH().
## Some functions from package 'stats' are intentionally masked ---see distrMASK().
## Note that global options are controlled by distroptions() ---c.f. ?"distroptions".
## For more information see ?"distr", NEWS("distr"), as well as
##   http://distr.r-forge.r-project.org/
## Package "distrDoc" provides a vignette to this package as well as to several extension packages; try vignette("distr").
## 
## 
## Attaching package: 'distr'
## 
## The following objects are masked from 'package:stats':
## 
##     df, qqplot, sd
emp.cdf <- DiscreteDistribution(xs)  # fit an empirical cdf
new.xs <- emp.cdf@r(250)             # generate new samples

# plot original sample vs new sample
plot(ecdf(xs), col="blue")  
plot(ecdf(new.xs), col="red", pch=".", add=T)
legend("topleft",c("empirical cdf","new sampling"), col=c("blue","red"), lwd=2) 

Another alternative is using the density function which uses (by default gaussian) kernels to estimate the density of the sample:

n<-1e4
xs <- rnorm(n,.5,.2)

d      <- density(xs)
new_xs <- sample(d$x, replace=TRUE, prob=d$y) 

# plot original sample vs new sample
plot(density(xs), col="blue")
lines(density(new_xs), col="red")
legend("topleft",c("empirical pdf","new sampling"), col=c("blue","red"), lwd=2) 

Acceptance-Rejection Method

Problem: Generate \(X \sim f\) from an arbitray pdf \(f\) (especially when it’s hard to sample from \(f\)).

We must find \(Y \sim g\) under the only restriction that \(\forall_{f(x)>0}: f(x) < c.g(x), c > 1\).

Instead of sampling from \(f(x)\) which might be difficult, we use \(c.g(x)\) to sample instead.

For each value required the method follows:

  1. Generate a random \(y\) from \(g\)

  2. Generate a random \(u\) from \(U(0,1)\)

  3. If \(u < f(y)/(c.g(y))\) then return \(y\) else reject \(y\) and goto 1.

The algorihtm in R:

# generate n samples from f using rejection sampling with g (rg samples from g)
accept.reject <- function(f, c, g, rg, n) { 
  n.accepts     <- 0
  result.sample <- rep(NA, n)
  
  while (n.accepts < n) {
    y <- rg(1)               # step 1
    u <- runif(1,0,1)        # step 2
    if (u < f(y)/(c*g(y))) { # step 3 (accept)
      n.accepts <- n.accepts+1
      result.sample[n.accepts] = y
    }
  }
  
  result.sample
}

From step 3,

\[P(accept|Y) = P(U < \frac{f(Y)}{c g(Y)} | Y) = \frac{f(Y)}{c g(Y)}\]

The total probability of acceptance is

\[P(accept) = \sum_y P(accept|Y=y) P(Y=y) = \sum_y \frac{f(y)}{c g(y)} g(y) = \frac{1}{c}\]

The number of rejections until acceptance has the geometric distribution with mean \(c\). On average each sample of \(X\) requires \(c\) iterations.

So, for this method to be efficient, \(Y\) should be easy to sample, and \(c\) should be as small as possible.

Eg, generate samples from distribution Beta(2,2), where we use the uniform has \(g\), since \(f(x) < 2 \times g(x)\):

f  <- function(x) 6*x*(1-x)     # pdf of Beta(2,2), maximum density is 1.5
g  <- function(x) x/x           # g(x) = 1 but in vectorized version
rg <- function(n) runif(n,0,1)  # uniform, in this case
c  <- 2                         # c=2 since f(x) <= 2 g(x)

vals <- accept.reject(f, c, g, rg, 10000) 

# Checking if it went well
hist(vals, breaks=30, freq=FALSE, main="Sample vs true Density")
xs <- seq(0, 1, len=100)
lines(xs, dbeta(xs,2,2), col="red", lwd=2)

Let’s visualize the method accepting (green dots) or rejecting (red dots) at some specified segments :

xs <- seq(0, 1, len=100)
plot(xs, dbeta(xs,2,2), ylim=c(0,c*1.3), type="l", col="red", lwd=2, ylab="densities")
lines(xs, c*g(xs), type="l", col="blue", lwd=2)
legend("topleft",c("f(x)","c*g(x)"), col=c("red","blue"), lwd=2) 

draw.segment <- function(begin.segment, end.segment) {
  segments(c(begin.segment,end.segment,end.segment,begin.segment), c(0,0,c*1.025,c*1.025), 
           c(end.segment,end.segment,begin.segment,begin.segment), c(0,c*1.025,c*1.025,0))
  n.pts <- 100
  us <- runif(n.pts, 0, 1)
  ys <- begin.segment + rg(n.pts)*(end.segment-begin.segment)
  accepted <- us < f(ys)/(c*g(ys))
  points(ys, c*us, col=ifelse(accepted,"green","red"), pch=18)  
}

draw.segment(0.10, 0.20) 
draw.segment(0.45, 0.55)
draw.segment(0.90, 1.00)

The higher the density of \(f\) the more points are accepted, as one would expect.

However if \(c g(x) >> f(x)\) we will have lots of rejections, which will decrease the quality of the simulation. Let’s see what would happen, for the same amount of points, if \(c=10\).

c <- 10

xs <- seq(0, 1, len=100)
plot(xs, dbeta(xs,2,2), ylim=c(0,c*1.25), type="l", col="red", lwd=2, ylab="densities")
lines(xs, c*g(xs), type="l", col="blue", lwd=2)
legend("topleft",c("f(x)","c*g(x)"), col=c("red","blue"), lwd=2) 

draw.segment(0.10, 0.20)
draw.segment(0.45, 0.55)
draw.segment(0.90, 1.00)

This number of points is not enough to get an estimate with the quality of the previous eg.

There is a R package, ars which performs an optimized algorithm named Adaptative Rejection Sampling. There’s a restriction that the original pdf must be log-concave.

Let’s try with the initial eg for the pdf \(f_X(x) = 3x^2\).

The function needs the log of the pdf, in this case, \(log(3x^2)\) and its first derivative:

\[\frac{d}{dx} log(3x^2) = \frac{2}{x}\]

library(ars)

log.f <- function(x) log(3*x^2) # log(f(x))
log.f.dx <- function(x) 2/x     # d/dx log(f(x))

vals <- ars(1e4,                # how many points are required
            log.f, log.f.dx,    # the needed functions
            lb=TRUE, ub=TRUE,   # there are a lower and upper bounds for the pdf
            xlb=0, xub=1,       # and which are those bounds
            x=c(.1,.5,.9))      # some initial points inside the pdf

# Checking if it went well
hist(vals, breaks=30, freq=FALSE, main="Sample vs true Density")
xs <- seq(0, 1, len=100)
curve(3*x^2, 0, 1, col="red", lwd=2, add=T)

Monte Carlo Integration

Monte Carlo integration is an estimation of the true integration based on random sampling and in the Strong Law of Large Numbers.

The estimator of \[\theta = \int_a^b~g(x)~dx\] is computed as follows:

  1. Generate \(X_1, X_2, \ldots, X_n\) iid from Unif(a,b)
  2. Compute \(\overline{g(X)} = \frac{1}{n} g(X_i)\)
  3. The estimation \(\hat{\theta} = (b-a)\overline{g(X)}\)

\(\hat{\theta}\) is itself a random variable, which by the Strong Law of Large Numbers: \(\hat{\theta} \rightarrow \theta\) as \(n \rightarrow \infty\)

in R:

# pre-condition: a < b
MC.simple.est <- function(g, a, b, n=1e4) {
  xi <- runif(n,a,b)      # step 1
  g.mean <- mean(g(xi))   # step 2
  (b-a)*g.mean            # step 3
}

The reason why this works:

\[\begin{array}{lcll} \int_a^b~g(x)~dx & = & \int_\mathbb{R} g(x) {\bf 1}_{[a,b]}(x)~dx & \color{blue}{ {\bf 1}_{[a,b]} = 1~\text{if}~x \in [a,b],~0~\text{otherwise} } \\ & = & (b-a) \int_\mathbb{R} g(x) f_U(x)~dx & \color{blue}{ f_U(x) = \frac{1}{b-a}~{\bf 1}_{[a,b]}(x), U \sim \text{Unif}(a,b) } \\ & = & (b-a)~E[g(U)] & \color{blue}{ E[g(X)] = \int g(x)f(x)~dx, X \sim f } \\ \end{array} \]

Eg: estimate \[\theta = \int_2^4 e^{-x}~dx = e^{-2} - e^{-4} = 0.1170\]

g <- function(x) exp(-x)

MC.simple.est(g, 2, 4)
## [1] 0.1161926

More generally, with pdf \(f\) over support \(A\), to estimate the integral

\[\theta = \int_A g(x)f(x)~dx\]

generate a random sample \(x_1,\ldots,x_n\) from the pdf \(f(x)\) and compute the mean of the sequence \(g(x_i)\), ie, \(\hat{\theta} = \frac{1}{n} \sum_i g(x_i)\).

With probability 1,

\[\lim_{n \to \infty} E[\hat{\theta}] = \theta\]

The variance of this estimator is:

\[var(\hat{\theta}) = var \Big( \frac{1}{n} \sum_i~g(X_i) \Big) = \frac{1}{n^2} \sum_i var (g(X_i)) = \frac{var(g(X))}{n}\]

Importance Sampling

There are two problems with the previous method:

Eg: \(X \sim N(0,1)\), estimate \(P(X > 4.5)\)

# True value:
pnorm(4.5, lower.tail=FALSE)  # theta (could also be computed by 1-pnorm(4.5))
## [1] 3.397673e-06
# MC estimation:
n <- 1e4
indicators <- rnorm(n)>4.5
sum(indicators)/n             # hat.theta
## [1] 0

Simulating directly, there will be a positive hit only every 300k iterations!

One way to solve this is to consider other more appropriate densities.

This leads to a general method called importance sampling.

Instead of evaluate \(E_f[g(X)] = \int_A g(x)f(x)~dx\) this method includes a candidate or auxiliar density \(h(x)\),

\[E_f[g(X)] = \int_A g(x)~\frac{f(x)}{h(x)}~h(x)~dx = E_h\Big[\frac{g(X)f(X)}{h(X)}\Big]\]

So,

\[E_f[g(X)] \approx \frac{1}{n} \sum_{i=1}^{n}~g(y_i)\frac{f(y_i)}{h(y_i)}\]

where \(y_1, y_2, \ldots, y_n\) are generated by pdf \(h\).

The candidate \(h\) should be chosen as to satisfy as much as possible: \(h(x) \approx |g(x)|f(x)\).

The restrictions are:

This is the same idea of the accept-reject method.

The importance-sampling function in R:

# rh generates samples from candidate pdf
i.sampling <- function(f, g, h, rh, n=1e4) {
  ys <- rh(n)
  mean(g(ys)*f(ys)/h(ys))
}

Let’s use the previous eg of estimating \(P(X > 4.5)\).

Our target function will be the exponential pdf truncated at \(4.5\):

\[h(x) = \frac{e^{-x}}{\int_{4.5}^{\infty} e^{-x}~dx} = e^{-(x-4.5)}\]

xs <- seq(0.1,10,by=0.05)
plot(xs,dexp(xs),col="blue", type="l")   # the exponential pdf
lines(xs,exp(-(xs-4.5)),col="red",lwd=2) # the truncated pdf
abline(v=4.5,lty=2)

Here’s a plot of the target pdf \(g\) (in blue) and the candidate pdf \(h\) (in red):

g <- dnorm
h <- function(x) exp(-(x-4.5))

xs <- seq(4.5,20,by=0.05)
plot(xs,g(xs),col="blue", type="l", ylim=c(0,0.5e-4))
lines(xs,h(xs),col="red",lwd=2)

# True value:
pnorm(4.5, lower.tail=FALSE)  # theta (could also be computed by 1-pnorm(4.5))
## [1] 3.397673e-06
# do the Importance Sampling
f  <- function(x) x/x         # uniform pdf
rh <- function(n) rexp(n)+4.5 # rexp() shifted to 4.5
i.sampling(f,g,h,rh)
## [1] 3.407674e-06

Another sampling choosing a pareto distribution for \(h\):

library(VGAM)
## Loading required package: stats4
## Loading required package: splines
## 
## Attaching package: 'VGAM'
## 
## The following object is masked from 'package:distr':
## 
##     Max
h <- function(x) {
  dpareto(x, scale=4.5, shape=10)
}

xs <- seq(4.5,20,by=0.05)
plot(xs,g(xs),col="blue", type="l", ylim=c(0,0.25e-3))
lines(xs,h(xs),col="red",lwd=2)

rh <- function(n) rpareto(n, 4.5, 10)

i.sampling(f,g,h,rh)
## [1] 3.384464e-06

MC in Inference

In statistical inference there is a sample \(x_1,\ldots,x_n\) taken from a certain probabilistic model.

An important task is to estimate a statistic \(\theta(x_1,\ldots,x_n)\), usually for estimation of a model parameter. If we don’t know the distribution of \(\theta\) we can study it using Monte Carlo methods.

Say, to estimate the mean \(E[\theta]\):

  1. For each \(i=1 \ldots N\):
    1. Generate a sample \(x_1,\ldots,x_n\)
    2. Compute \(\theta^{(i)} = \theta(x_1,\ldots,x_n)\)
  2. Compute \(E[\theta] = \frac{1}{N} \sum_i \theta^{(i)}\)

Eg: given two standard normal iid random vars \(X_1, X_2\) estimate \(E[|X_1 - X_2|]\)

The analytical solution is found by:

\[\begin{array}{lcll} \int \int |x_1-x_2| f(x_1,x_2) dx_1 dx_2 & = & \int \int |x_1-x_2| f(x_1)f(x_2) dx_1 dx_2 & \color{blue}{ X_1 \perp X_2 } \\ & = & \int_{-\infty}^{+\infty} \int_{-\infty}^{\infty} |x_1-x_2| \frac{1}{\sqrt{2\pi}} e^{-x_1^2/2} \frac{1}{\sqrt{2\pi}} e^{-x_2^2/2} dx_1 dx_2 & \color{blue}{ X_1, X_2 \sim \mathcal{N}(0,1) } \\ & = & \int_{-\infty}^{+\infty} \int_{-\infty}^{\infty} |x_1-x_2| \frac{1}{2\pi} e^{-(x_1^2+x_2^2)/2} dx_1 dx_2 & \color{blue}{ } \\ & = & \frac{2}{\sqrt{\pi}} & \color{blue}{ } \\ & \approx & 1.12838 & \color{blue}{ } \\ \end{array} \]

note: Mathematica formula to compute this integral:

Integrate[(Abs[x - y]*Exp[(-x^2 - y^2)/2])/(2*Pi), {x, -Infinity, Infinity}, {y, -Infinity, Infinity}]

In R:

n <- 1e5
X1 <- rnorm(n,0,1)
X2 <- rnorm(n,0,1)

f <- function(x1,x2) abs(x1-x2)

thetas <- f(X1,X2)
mean(thetas)
## [1] 1.127228

The standard error of a mean \(\overline{X}\) of a sample size \(n\) is \(\sqrt{var(X)/n}\). When the distribution is unknown we replace it by the empirical distribution of the sample \(x_1,\ldots,x_n\), which becomes

\[\widehat{var(X)} = \frac{1}{n} \sum_{i=1}^n (x_i-\overline{x})^2\]

And the standard error:

\[\widehat{se(\overline{x})} = \sqrt{\frac{1}{n^2}~\sum_{i=1}^n (x_i-\overline{x})^2} = \frac{1}{n} \sqrt{\sum_{i=1}^n (x_i-\overline{x})^2}\]

From the previous eg:

# standard error = sqrt( var(|X1-X2|)/n )
sqrt(var(thetas)/n)
## [1] 0.002694323

To estimate the cdf of \(\theta\),

\[F_T(x) = P(T \le x) \approx \widehat{F_T}(x) = \frac{1}{n} \sum_{i=1}^n {\bf 1}_{\theta^{(i)} \le x}\]

n <- 1e4
thetas <- rnorm(n,0,1)    # the sample

bins <- 50; xs <- seq(-3,3,len=bins)      # how many divisions of the x-axis we use

cdf.hat <- rep(NA,length(xs))
for (i in 1:bins) {
  cdf.hat[i] <- sum(thetas < xs[i]) / n   # compute the cdf estimation
}

# plot real cdf (in blue) vs estimated cdf (in red)
plot(xs,pnorm(xs,0,1), type="l", col="blue", lwd=2, xlab="x", ylab="cdf")
lines(xs,cdf.hat, col="red", lwd=2)
legend("topleft",c(expression(F[T](x)),expression(hat(F[T])(x))), col=c("blue","red"), lwd=2) 

Markov Chain Monte Carlo (MCMC) Integration

When we want to estimate \(E[g(\theta)]\) we can find the sample mean

\[\overline{g} = \frac{1}{m} \sum_{i=1}^m g(x_i)\]

where the sample \(x_1,\ldots,x_m\) is sampled from an appropriate density.

If \(x_1,\ldots,x_m\) are independent then by the laws of large numbers, the mean converges in probability to \(E[g(\theta)]\). This can be done by regular Monte Carlo integration.

However it can be difficult to implement a method to generate iid observations. But even if the observations are dependent, a MC integration can still be applied if the generated (dependent) observations have a joint density roughly the same of the joint density of a random, iid sample. To achieve this it is used Markov Chains, which provides the sampler that generates the dependent observations from the target distribution.

The Metropolis-Hastings algorithm is a MCMC method that tries to achieve this task. The main idea is to generate a Markov Chain \({X_t|t=0,1,\ldots}\) such that its stationary distribution is the target distribution. The algorithm, given \(X_t\), knows how to compute \(X_{t+1}\). To do that it must be able to generate a candidate point \(Y\) from a proposal distribution \(g(\cdot|X_t)\) which (probably) depends on the previous state. This point \(Y\) may or may not be accepted. If it is accepted, then \(X_{t+1} = Y\), otherwise the chain remains in the same place \(X_{t+1} = X_t\).

The proposal distribution \(g\) must be chosen so that the generated chain will converge to a stationary distribution, in this case the target distribution \(f\). The proposal distribution is the way we generate possible good points for the target distribution. If the proposal is not well chosen, the algorithm will produce lots of rejections and the time to converge to the target distribution might take more time than we have available.

If the generation of candidates does not depend on the current region of the chain, the proposal distribution can be independent of \(x_t\), and the algorithm will accept the new candidate \(y\) if \(f(y)/g(y) \geq f(x_t)/g(x_t)\)

Here’s the R code (algorithm details can be found at Rizzo’s book, pag.248):

metropolis.hastings <- function(f,  # the target distribution
                                g,  # the proposal distribution
                                rg, # a sample from the proposal distribution
                                x0, # initial value for chain, in R it is x[1]
                                chain.size=1e5,  # chain size
                                burn.perc=0.1) { # burn in percentage
  
  x <- c(x0, rep(NA,chain.size-1))  # initialize chain
  
  for(i in 2:chain.size)   {
    y <- rg(x[i-1])                 # generate Y from g(.|xt) using sampler rg
    alpha <- min(1, f(y)*g(x[i-1],y)/(f(x[i-1])*g(y,x[i-1])))
    x[i] <- x[i-1] + (y-x[i-1])*(runif(1)<alpha)  # update step
  }
  
  # remove initial part of the chain before output result
  x[(burn.perc*chain.size) : chain.size] 
}

This first eg samples from an uniform distribution (the proposal distribution) to generate a sample from a Beta(2.7, 6.3) distribution:

a<-2.7; b<-6.3; size<-1e4

f  <- function(x)   dbeta(x,a,b)
rg <- function(x)   runif(1,0,1)
g  <- function(x,y) 1 # i.e., dunif(x,0,1)

X <- metropolis.hastings(f,g,rg,x0=runif(1,0,1),chain.size=size)

par(mfrow=c(1,2),mar=c(2,2,1,1))
hist(X,breaks=50,col="blue",main="Metropolis-Hastings",freq=FALSE)
curve(dbeta(x,a,b),col="sienna",lwd=2,add=TRUE)
hist(rbeta(size,a,b),breaks=50,col="grey",main="Direct Sampling",freq=FALSE)
curve(dbeta(x,a,b),col="sienna",lwd=2,add=TRUE)