Let’s repeat the essential code here:

library(BRugs)
## Welcome to BRugs connected to OpenBUGS version 3.2.3
run.model <- function(model, samples, data=list(), chainLength=10000, burnin=0.10, 
                      init.func, n.chains=1, thin=1) {
  
  writeLines(model, con="model.txt")  # Write the modelString to a file
  modelCheck( "model.txt" )           # Send the model to BUGS, which checks the model syntax
  if (length(data)>0)                 # If there's any data available...
    modelData(bugsData(data))         # ... BRugs puts it into a file and ships it to BUGS
  modelCompile(n.chains)              # BRugs command tells BUGS to compile the model
  
  if (missing(init.func)) {
    modelGenInits()                   # BRugs command tells BUGS to randomly initialize a chain
  } else {
    for (chain in 1:n.chains) {       # otherwise use user's init data
      modelInits(bugsInits(init.func))
    }
  }
  
  modelUpdate(chainLength*burnin)     # Burn-in period to be discarded
  samplesSet(samples)                 # BRugs tells BUGS to keep a record of the sampled values
  samplesSetThin(thin)                # Set thinning
  modelUpdate(chainLength)            # BRugs command tells BUGS to randomly initialize a chain
}

For ploting with error bars:

plot.errors <- function(x, y, e) {
  plot(x, y, pch=19,ylim=c(min(y)*0.75,max(y)*1.2))
  segments(x, y-e,x, y+e)
  width.bar <- mean(e)/10
  segments(x-width.bar,y-e,x+width.bar,y-e)
  segments(x-width.bar,y+e,x+width.bar,y+e)
}

Bayesian Linear Regression

Given a data set \(D = (x_1,y_1), \ldots, (x_N,y_N)\) where \(x \in \mathbb{R}^d, y \in \mathbb{R}\), a Bayesian Linear Regression models the problem in the following way:

Prior: \[w \sim \mathcal{N}(0, \sigma_w^2 I_d)\]

\(w\) is vector \((w_1, \ldots, w_d)^T\), so the previous distribution is a multivariate Gaussian; and \(I_d\) is the \(d\times d\) identity matrix.

Likelihood: \[Y_i \sim \mathcal{N}(w^T x_i, \sigma^2)\]

We assume that \(Y_i \perp Y_j | w, i \neq j\)

For now we’ll use the precision instead of the variance, \(a = 1/\sigma^2\), and \(b = 1/\sigma_w^2\). We’ll also assume that \(a,b\) are known.

The prior can be stated as \[p(w) \propto \exp \Big\{ -\frac{b}{2} w^t w \Big\}\]

And the likelihood \[p(D|w) \propto \exp \Big\{ -\frac{a}{2} (y-Aw)^T (y-Aw) \Big\}\]

where \(y = (y_1,\ldots,y_N)^T\) and \(A\) is a \(n\times d\) matrix where the i-th row is \(x_i^T\).

The the posterior is \[p(w|D) \propto p(D|w) p(w)\]

After many calculations we discover that

\[p(w|D) \sim \mathcal{N}(w | \mu, \Lambda^{-1})\]

where (\(\Lambda\) is the precision matrix)

\[\Lambda = a A^T A + b I_d \] \[\mu = a \Lambda^{-1} A^T y\]

Notice that \(\mu\) is equal to the \(w_{MAP}\) of the regular linear regression, this is because for the Gaussian, the mean is equal to the mode.

Also, we can make some algebra over \(\mu\) and get the following equality (\(\Lambda = aA^TA+bI_d\)):

\[\mu = (A^T A + \frac{b}{a} I_d)^{-1} A^T y\]

and compare with \(w_{MLE}\):

\[w_{MLE} = (A^T A)^{-1} A^T y\]

The extra expression in \(\mu\) corresponds to the prior. This is similar to the expression for the Ridge regression, for the special case when \(\lambda = \frac{b}{a}\). Ridge regression is more general because the technique can choose improper priors (in the Bayesian perspective).

For the predictive posterior distribution:

\[p(y|x,D) = \int p(y|x,D,w) p(w|x,D) dw = \int p(y|x,w) p(w|D) dw\]

it is possible to calculate that

\[y|x,D \sim \mathcal{N}(\mu^Tx, \frac{1}{a} + x^T \Lambda^{-1}x)\]

Using BUGS

modelString = "
  model {
      for (i in 1:5) {
        y[i] ~ dnorm(mu[i], tau)
        mu[i] <- beta0 + beta1 * (x[i] - mean(x[]))
      }
  
      # Jeffreys priors
      beta0 ~ dflat()
      beta1 ~ dflat()
      tau   <- 1/sigma2
      log(sigma2) <- 2*log.sigma
      log.sigma ~ dflat()
  }
"

# data
x <- c(  8,  15,  22,  29,  36)  # day of measure
y <- c(177, 236, 285, 350, 376)  # weight in grams

data.list = list(
    x = x, 
    y = y  
)

# initializations
n.chains <- 1
log.sigmas <- c(0)
betas0 <- c(0)
betas1 <- c(0)

genInitFactory <- function()  {
  i <- 0
  function() {
    i <<- i + 1
    list( 
      log.sigma = log.sigmas[i],
      beta0 = betas0[i],
      beta1 = betas1[i]
    ) 
  }
}

run.model(modelString, samples=c("beta0", "beta1", "sigma2"), data=data.list, chainLength=15000,
          init.func=genInitFactory(), n.chains=n.chains)
## model is syntactically correct
## data loaded
## model compiled
## Initializing chain 1:
## model is initialized
## 1500 updates took 0 s
## monitor set for variable 'beta0'
## monitor set for variable 'beta1'
## monitor set for variable 'sigma2'
## 15000 updates took 0 s
samplesStats(c("beta0", "beta1", "sigma2"))
##           mean        sd  MC_error val2.5pc  median val97.5pc start sample
## beta0  284.900    8.2160  0.056790  270.100 284.900   300.300  1501  15000
## beta1    7.311    0.8012  0.006087    5.845   7.308     8.783  1501  15000
## sigma2 326.100 1015.0000 37.170000   37.950 144.300  1570.000  1501  15000
# Extract chain values:
beta0  <- samplesSample( "beta0" )
beta1  <- samplesSample( "beta1" )
sigma2 <- samplesSample( "sigma2" )

# Posterior prediction [from Kruschke - Doing Bayesian Data Analysis (2010)]
# Specify x values for which predicted y's are needed:
xPostPred <- seq( min(x)-5 , max(x)+5 , length=100 ) # just make a bunch of them
# Define matrix for recording posterior predicted y values at each x value.
# One row per x value, with each row holding random predicted y values.
postSampSize <- length(beta0)
yPostPred <- matrix( 0 , nrow=length(xPostPred) , ncol=postSampSize )
# Define matrix for recording HDI limits of posterior predicted y values:
yHDIlim <- matrix( 0 , nrow=length(xPostPred) , ncol=2 )
# Generate posterior predicted y values.
# This gets only one y value, at each x, for each step in the chain.
xM <- mean(xPostPred)
# generate values according to the model specified in BUGS:
# y[i] ~ dnorm(mu[i], tau)
# mu[i] <- beta0 + beta1 * (x[i] - mean(x[]))
for ( chainIdx in 1:postSampSize ) {
  yPostPred[,chainIdx] <- rnorm( length(xPostPred) ,  # rnorm(n, mean, sd)
                                 beta0[chainIdx] + beta1[chainIdx] * (xPostPred - xM),
                                 sqrt(sigma2) )
}

source("HDIofMCMC.R") # call Kruschke's Highest Density Interval (HDI) script
for ( xIdx in 1:length(xPostPred) ) {  # get 95% HDI for each predicted x
    yHDIlim[xIdx,] <- HDIofMCMC( yPostPred[xIdx,] )
}
head(yHDIlim)
##          [,1]     [,2]
## [1,] 106.8501 186.2428
## [2,] 108.1011 188.3614
## [3,] 111.7592 190.7913
## [4,] 116.8344 193.5995
## [5,] 117.2369 195.2489
## [6,] 120.2614 197.5548
# Display data with HDIs of posterior predictions.
plot( x , y, xlim=c(8,36) , ylim=c(100,500), type="n", ylab=expression(hat(y)),
      main="Data with 95% HDI & Mean of Posterior Predictions")
polygon(c(xPostPred,rev(xPostPred)), c(yHDIlim[,1],rev(yHDIlim[,2])), col="lightgray")  # HDI's
points(x,y, pch=19)
lines( xPostPred , apply(yPostPred,1,mean) , col="red" ) # the linear regression

Checking the analytical solution vs. the simulated one:

a <- 1/var(x)
b <- 1/20
A <- matrix(x, ncol=1)                           # d=1, ie, 1D samples

Lambda <- a * t(A) %*% A + b * diag(ncol(A))
mu     <- a * solve(Lambda) %*% t(A) %*% y

i <- 31  # select one of the estimated values
hist(yPostPred[i,], prob=T, breaks=100) # plot the estimated histogram

new.x <- xPostPred[i]
new.x
## [1] 14.51515
ys <- seq(100,300,1)
lines(ys, dnorm(ys, t(mu)%*%new.x, sqrt(1/a+t(new.x)%*%solve(Lambda)%*%new.x)) , lwd=2, col="red")

TODO: this does not fit :-(

Bayesian Regression with outliers

We got this data somehow (in fact, from here).

x <- c(0,  3,  9, 14, 15, 19, 20, 21, 30, 35, 40, 41, 42, 43, 54, 56, 67, 69, 72, 88)
y <- c(33, 68, 34, 34, 37, 71, 37, 44, 48, 49, 53, 49, 50, 48, 56, 60, 61, 63, 44, 71)
e <- c(3.6, 3.9, 2.6, 3.4, 3.8, 3.8, 2.2, 2.1, 2.3, 3.8, 2.2, 2.8, 3.9, 3.1, 3.4, 2.6, 3.4, 3.7, 2.0, 3.5) # error of y

plot.errors(x,y,e)

We wish to model this using a linear model:

\[\hat{y}(x|\theta) = \theta_0 + \theta_1 x\]

and assume that the likelihood for each point is modelled by a Gaussian:

\[p(x_i,y_i,e_i | \theta) \propto \exp \Big\{ -\frac{1}{2e_i^2} (y - \hat{y}(x|\theta))^2 \Big\}\]

The traditional linear regression gives:

fit <- lm(y~x, data=data.frame(x=x,y=y))
plot.errors(x,y,e)
abline(fit,col="red",lwd=2)                        # showing the linear fit
points(x[c(2,6,19)],y[c(2,6,19)],col="red",pch=19) # showing the outliers

which does not seem right at all! This is because of the three obvious outliers (above in red) that influence the result.

We can hack a bit and use the Huber loss function, which is useful to deal with outliers in a classic statistics setting, providing a robust linear regression:

# a: residuals, ie, y - hat.y
huber <- function(a, delta) {
  ifelse(abs(a)<delta, a^2/2, delta*(abs(a)-delta/2))      # ifelse is a vectorized conditional
}

huber.loss <- function(theta, x=x, y=y, e=e, delta=3) {
  sum( huber((y - theta[1] - theta[2]*x)/e, delta) )
}

fit.huber <- optim(par=c(0,0), fn=huber.loss, x=x, y=y, e=e) # find best values using optimization

plot.errors(x,y,e)
abline(fit,col="lightgrey",)                               # showing the linear fit  (in grey)
abline(fit.huber$par[1],fit.huber$par[2], col="red",lwd=2) # showing the robust fit

which is way better. However the Huber loss function, and the choice of its parameter value (set here to \(3\)), are somewhat hacks, ad hoc tools to attack the outlier problem.

In Bayesian terms acknowledging the outliers exist, we should modify the model in order to account them.

Now the likelihood will be the following:

\[p(x_i,y_i,e_i | \theta, g_i,\sigma_B) = \frac{g_i}{\sqrt{2\pi e_i^2}} \exp \Big\{ -\frac{1}{2e_i^2} (y - \hat{y}(x|\theta))^2 \Big\} + \frac{1-g_i}{\sqrt{2\pi \sigma_B^2}} \exp \Big\{ -\frac{1}{2 \sigma_B^2} (y - \hat{y}(x|\theta))^2 \Big\}\]

herein, parameter \(g_i=0\) means that data point \(x_i\) is an outlier, while \(g=1\) means that \(x_i\) is not. If the i-th point is an outlier the likelihhod will use a Gaussian of variance \(\sigma_B\) that might be considered an extra nuissance parameter, or set to a high value (we’ll choose value \(50\)). Our model has now 22 parameters instead of the initial two (\(\theta_0\) and \(\theta_1\)).

Since BUGS cannot sample from an arbitrary distribution, we can use the zeros trick to plug the likelihood directly:

modelString = "
  model {
      for (i in 1:n) {

        phi[i] <- -log( (g[i]/sqrt(2*pi*pow(e[i],2))) * exp(-0.5*pow(y[i]-mu[i],2)/pow(e[i],2)) + ((1-g[i])/sqrt(2*pi*pow(sigmaB,2))) * exp(-0.5*pow(y[i]-mu[i],2)/pow(sigmaB,2)) ) + C
        
        dummy[i] <- 0
        dummy[i] ~ dpois( phi[i] )

        mu[i] <- theta0 + theta1 * x[i]
        g[i] ~ dunif(0,1)
        
      }
 
      theta0 ~ dflat()
      theta1 ~ dflat()
      
      C <- 10000   # for the zero's trick
      pi <- 3.14159
  }
"

# data
data.list = list(
    x = x, 
    y = y,
    e = e,
    n = length(x),
    sigmaB = 50
)

# initializations
n.chains <- 1
theta0 <- c(0)
theta1 <- c(0)
g <- rep(0.01,length(x))

genInitFactory <- function()  {
  i <- 0
  function() {
    i <<- i + 1
    list( 
      theta0 = theta0[i],
      theta1 = theta1[i],
      g = g
    ) 
  }
}

run.model(modelString, samples=c("theta0", "theta1", "g"), data=data.list, 
          chainLength=25000, init.func=genInitFactory(), n.chains=n.chains)
## model is syntactically correct
## data loaded
## model compiled
## Initializing chain 1:
## model is initialized
## 2500 updates took 0 s
## monitor set for variable 'theta0'
## monitor set for variable 'theta1'
## monitor set for variable 'g'
## 25000 updates took 2 s
samplesStats(c("theta0", "theta1", "g"))
##           mean      sd  MC_error val2.5pc  median val97.5pc start sample
## theta0 31.3000 1.63900 0.0295200 28.11000 31.3000   34.5300  2501  25000
## theta1  0.4691 0.03808 0.0006846  0.39430  0.4690    0.5443  2501  25000
## g[1]    0.6410 0.25290 0.0019210  0.10000  0.6821    0.9862  2501  25000
## g[2]    0.3323 0.23560 0.0019900  0.01169  0.2920    0.8410  2501  25000
## g[3]    0.6454 0.25100 0.0021120  0.10510  0.6887    0.9873  2501  25000
## g[4]    0.6259 0.26080 0.0019830  0.07859  0.6691    0.9861  2501  25000
## g[5]    0.6428 0.25060 0.0019230  0.10660  0.6839    0.9868  2501  25000
## g[6]    0.3309 0.23590 0.0018550  0.01264  0.2898    0.8412  2501  25000
## g[7]    0.6030 0.26970 0.0021080  0.06083  0.6436    0.9857  2501  25000
## g[8]    0.6251 0.25990 0.0020680  0.08289  0.6666    0.9860  2501  25000
## g[9]    0.6355 0.25490 0.0018990  0.09566  0.6808    0.9864  2501  25000
## g[10]   0.6408 0.25150 0.0020230  0.10290  0.6826    0.9851  2501  25000
## g[11]   0.6269 0.25850 0.0020520  0.08543  0.6678    0.9859  2501  25000
## g[12]   0.6447 0.25020 0.0020680  0.11000  0.6864    0.9879  2501  25000
## g[13]   0.6407 0.25150 0.0020320  0.10280  0.6832    0.9852  2501  25000
## g[14]   0.6280 0.25930 0.0020200  0.08421  0.6697    0.9853  2501  25000
## g[15]   0.6410 0.24980 0.0018970  0.10880  0.6814    0.9864  2501  25000
## g[16]   0.6396 0.25300 0.0021610  0.10210  0.6811    0.9871  2501  25000
## g[17]   0.6395 0.25200 0.0021170  0.10650  0.6781    0.9869  2501  25000
## g[18]   0.6387 0.25220 0.0022050  0.10650  0.6804    0.9860  2501  25000
## g[19]   0.3346 0.23480 0.0019490  0.01284  0.2962    0.8390  2501  25000
## g[20]   0.6368 0.25400 0.0021190  0.09704  0.6780    0.9856  2501  25000
theta0.hat <- mean(samplesSample("theta0"))
theta1.hat <- mean(samplesSample("theta1"))

plot.errors(x,y,e)
abline(fit,col="lightgrey",)                                # showing the linear fit (in grey)
abline(fit.huber$par[1],fit.huber$par[2], col="grey",lwd=2) # showing the robust fit (in solid grey)
abline(theta0.hat, theta1.hat, col="red",lwd=2)
# define outliers as those which g[i] is less than 0.5
posterior.g <- rep(NA,length(g))
for(i in 1:length(g)) {
  posterior.g[i] <- mean(samplesSample(paste0("g[",i,"]")))
}
outliers <- which(posterior.g<0.5)
# plot outliers
points(x[outliers], y[outliers], col="red", pch=12)

If we do not care of outlier accounting, we can just use a more heavy-tailed distribution to prevent the outlier influence. The next model uses a t distribution for that effect:

modelString = "
  model {
      for (i in 1:n) {
        tau[i] <- 1/pow(e[i],2)
        y[i] ~ dt(mu[i], tau[i], 4)
    
        mu[i] <- theta0 + theta1 * x[i]
      }
  
      theta0 ~ dflat()
      theta1 ~ dflat()
  }
"

data.list = list(
    x = x, 
    y = y,
    e = e,
    n = length(x)
)

# initializations
n.chains <- 1
theta0 <- c(0)
theta1 <- c(0)

genInitFactory <- function()  {
  i <- 0
  function() {
    i <<- i + 1
    list( 
      theta0 = theta0[i],
      theta1 = theta1[i]
    ) 
  }
}

run.model(modelString, samples=c("theta0", "theta1"), data=data.list, chainLength=15000,
          init.func=genInitFactory(), n.chains=n.chains)
## model is syntactically correct
## data loaded
## model compiled
## Initializing chain 1:
## model is initialized
## 1500 updates took 0 s
## monitor set for variable 'theta0'
## monitor set for variable 'theta1'
## 15000 updates took 0 s
samplesStats(c("theta0", "theta1"))
##           mean      sd MC_error val2.5pc  median val97.5pc start sample
## theta0 32.1600 1.70100 0.045930  28.9400 32.1400    35.580  1501  15000
## theta1  0.4451 0.03785 0.001012   0.3689  0.4454     0.518  1501  15000
theta0.hat <- mean(samplesSample("theta0"))
theta1.hat <- mean(samplesSample("theta1"))

plot.errors(x,y,e)
abline(fit,col="lightgrey",)                                # showing the linear fit (in grey)
abline(fit.huber$par[1],fit.huber$par[2], col="grey",lwd=2) # showing the robust fit (in solid grey)
abline(theta0.hat, theta1.hat, col="red",lwd=2)

Non-linear Regression

For this eg we’ll use the Gelfand’s dataset of dugongs (sea cows).

The goal is to estimated the size \(y_i\) given the age \(x_i\).

The model is the following:

\[Y_i = \alpha - \beta \gamma^{x_i} + \epsilon_i, i = 1\ldots,n\]

where \(\alpha,\beta>0, 0 \leq \gamma \leq 1\), and the noise \(\epsilon_i\) are iid with \(\mathcal{N}(0,\sigma^2)\)

The interpretation is the following:

The priors will be:

modelString = "
  model {
    for( i in 1:N ) {
            y[i] ~ dnorm(mu[i], tau)
            mu[i] <- alpha - beta * pow(gamma, x[i])            
        }

        alpha ~ dflat()
        beta  ~ dflat()
        gamma ~ dunif(0.5, 1.0)
    sigma ~ dunif(0.01, 100)
    tau   <- 1/(sigma*sigma)  
  }
"

x = c( 1.0,  1.5,  1.5,  1.5, 2.5,   4.0,  5.0,  5.0,   7.0,
       8.0,  8.5,  9.0,  9.5, 9.5,  10.0, 12.0, 12.0,  13.0,
       13.0, 14.5, 15.5, 15.5, 16.5, 17.0, 22.5, 29.0, 31.5)

y = c(1.80, 1.85, 1.87, 1.77, 2.02, 2.27, 2.15, 2.26, 2.47,
      2.19, 2.26, 2.40, 2.39, 2.41, 2.50, 2.32, 2.32, 2.43,
      2.47, 2.56, 2.65, 2.47, 2.64, 2.56, 2.70, 2.72, 2.57)

data.list = list(
  x = x,
    y = y, 
  N = length(x))

# initializations
n.chains <- 3
alpha <- c(1,10,100)
beta  <- c(1,10,100)
gamma <- c(.9,.7,.5)
sigma <- c(1,10,100)

genInitFactory <- function()  {
  i <- 0
  function() {
    i <<- i + 1
    list( 
      alpha = alpha[i],
      beta  = beta[i],
      gamma = gamma[i],
      sigma = sigma[i]
    ) 
  }
}

run.model(modelString, samples=c("alpha", "beta", "gamma", "sigma"), data=data.list, 
          chainLength=50000, init.func=genInitFactory(), n.chains=n.chains)
## model is syntactically correct
## data loaded
## model compiled
## Initializing chain 1:
## initial values loaded and chain initialized but another chain contain uninitialized variables
## Initializing chain 2:
## initial values loaded and chain initialized but another chain contain uninitialized variables
## Initializing chain 3:
## model is initialized
## 5000 updates took 3 s
## monitor set for variable 'alpha'
## monitor set for variable 'beta'
## monitor set for variable 'gamma'
## monitor set for variable 'sigma'
## 50000 updates took 31 s
samplesStats(c("alpha", "beta", "gamma", "sigma"))
##         mean      sd  MC_error val2.5pc  median val97.5pc start sample
## alpha 2.6520 0.07366 9.429e-04   2.5280 2.64600    2.8160  5001 150000
## beta  0.9739 0.07879 5.028e-04   0.8230 0.97240    1.1330  5001 150000
## gamma 0.8619 0.03379 3.910e-04   0.7839 0.86570    0.9163  5001 150000
## sigma 0.1009 0.01572 6.209e-05   0.0758 0.09904    0.1370  5001 150000

So, let’s plot the results:

alpha.hat <- mean(samplesSample("alpha"))
beta.hat  <- mean(samplesSample("beta"))
gamma.hat <- mean(samplesSample("gamma"))

plot(x,y,pch=19,xlab="Age (in years)", ylab="Size (in meters)",main="Dugong dataset")
curve(alpha.hat - beta.hat * gamma.hat^x, from=0, to=35, col="red", lwd=2, add=T) # plot estimation

# draw 2.5% and 97.5% growth curves
alpha2.5 <- samplesStats(c("alpha"))$val2.5pc
beta2.5  <- samplesStats(c("beta"))$val2.5pc
gamma2.5 <- samplesStats(c("gamma"))$val2.5pc

alpha97.5 <- samplesStats(c("alpha"))$val97.5pc
beta97.5  <- samplesStats(c("beta"))$val97.5pc
gamma97.5 <- samplesStats(c("gamma"))$val97.5pc

curve(alpha97.5 - beta97.5 * gamma97.5^x, from=0, to=35, col="grey", lwd=2, lty=2, add=T) 
curve(alpha2.5  - beta2.5  * gamma2.5^x,  from=0, to=35, col="grey", lwd=2, lty=2, add=T)

Generalized Linear Modelling

Linear Regression predicts that the expected value of the dependent/response variable \(y\) is a linear combination of the independent/observed variables \(x_i\). This is appropriate when the response variable is modelled by a normal distribution. If this does not happen, linear regression is not suitable.

If, for eg, the response \(y\) is an exponential combination of \(x\), then it is its logarithm that has a linear combination. This is called a log-linear model.

Another eg is if \(y\) is a boolean response. Then, say, if an increase of \(x\) increases the probability of \(y=1\), then the response might be a linear combination of the odds or the logarithm of the odds. This last eg is called a log-odds (aks logit) model.

Generalized Linear Modelling (GLM) is a generalization of linear regression in the sense that it allows \(y\) to have an arbitrary distribution, and that a certain function \(g\) of \(y\) to vary linearly with \(x\). The function \(g\) is called the link function.

For the previous exponential eg \(y\) can be modelled by a Poisson and \(g(x)=log(x)\). For the 2nd eg, \(y\) might be modelled by a Bernoulli while the link function is \(g(x)=logit(x)=log(x/(1-x))\)

The GLM assumes \(Y\) has distribution from the exponential family, like a Bernoulli, Binomial, Poisson, Exponential, Laplace, Normal, Gamma, …

\[E[Y_i] = g^{-1}(\beta_0 + \sum_k \beta_k x_{ki})\]

For GLM to work there’s the need of a distribution for \(Y\), a link function \(g\), and the linear predictor which can be found using several techniques, herein we’ll use BUGS (surprise!).

GLM with a Bernoulli: Logistic Regression

Assume we have a binary version of the previous dugong dataset, considering only a criteria to decide if a dugong is full grown (we no longer have access to \(y_i\)). In this case, our new response \(Z\) is defined as \(1\) if \(y>2.4\), or \(0\) otherwise:

z <- as.numeric(y>2.4)
z
##  [1] 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1

To apply GLM (cf. section above) we choose:

A logistic model for \(p_i = P(Z_i=1|X_i)\) is

\[logit(p_i) = \log(\frac{p_i}{1-p_i}) = \beta_0 + \beta_1 \log(x_i)\]

modelString = "
  model {
    for( i in 1:N) {    
      z[i] ~ dbern( p[i] )

      # necessary to subtract the mean (ie, center x) to prevent MCMC convergence issues
      logit(p[i]) <- beta0 + beta1 * (logAge[i] - mean(logAge[]))  # logistic model
      logAge[i]  <- log(x[i])    
    }

    beta0 ~ dflat()
    beta1 ~ dflat()
  }
"

data.list = list(
  x = x,
  z = z, 
  N = length(x))

# initializations
beta0  <- c(0)
beta1  <- c(0)

genInitFactory <- function()  {
  i <- 0
  function() {
    i <<- i + 1
    list( 
      beta0 = beta0[i],
      beta1 = beta1[i]
    ) 
  }
}

run.model(modelString, samples=c("beta0", "beta1"), data=data.list, 
          chainLength=50000, init.func=genInitFactory(), n.chains=1)
## model is syntactically correct
## data loaded
## model compiled
## Initializing chain 1:
## model is initialized
## 5000 updates took 0 s
## monitor set for variable 'beta0'
## monitor set for variable 'beta1'
## 50000 updates took 1 s
samplesStats(c("beta0", "beta1"))
##         mean    sd MC_error val2.5pc median val97.5pc start sample
## beta0 -1.505 0.961  0.01582   -3.704 -1.420    0.1096  5001  50000
## beta1  6.154 2.482  0.04349    2.358  5.811   12.1300  5001  50000

Plotting the results:

beta0.hat  <- mean(samplesSample("beta0"))
beta1.hat  <- mean(samplesSample("beta1"))

plot(x,z,pch=19,xlab="Age (in years)", ylab="Adult", main="Dugong dataset")
# draw the estimated values for E[Y_i] (it is a logistic curve)
xs <- seq(0.1,32,.1)
exp.logit <- exp( beta0.hat + beta1.hat*(log(xs) - mean(log(x))) )
lines(xs,exp.logit/(1+exp.logit), col="red", lwd=2) # plot estimation 

For instance, what is the probability that a dugong with \(8\) years is full grown?

x.new <- 8
exp.logit.new <- exp( beta0.hat + beta1.hat*(log(x.new) - mean(log(x))) )
exp.logit.new/(1+exp.logit.new)
## [1] 0.1962679

GLM: Poisson Regression

In Poisson Regression, usually used to express number of successes within a fixed time interval, we choose:

A log-linear model for \(\lambda\) is

\[\log(\lambda_i) = \beta_0 + \beta_1 x_i\]

set.seed(101)
n <- 20
x <- runif(n,0,30)
beta.0 <- -4
beta.1 <- 0.3
y <- exp( beta.0 + beta.1*x + rpois(n,1.0) )
plot(x,y,pch=19)
xs <- seq(.1,30,len=101)
lines(xs,exp(beta.0 + beta.1*xs),lwd=2,col="blue")

modelString = "
  model {
    for( i in 1:N) {    
      y[i] ~ dpois( lambda[i] )

      log(lambda[i]) <- beta0 + beta1 * x[i]
    }

    beta0 ~ dnorm(0,0.0001)
    beta1 ~ dnorm(0,0.0001)
  }
"

data.list = list(
  x = x,
  y = y, 
  N = n)

# initializations
beta0  <- c(0)
beta1  <- c(1)

genInitFactory <- function()  {
  i <- 0
  function() {
    i <<- i + 1
    list( 
      beta0 = beta0[i],
      beta1 = beta1[i]
    ) 
  }
}

run.model(modelString, samples=c("beta0", "beta1"), data=data.list, 
          chainLength=250000, init.func=genInitFactory(), n.chains=1)
## model is syntactically correct
## data loaded
## model compiled
## Initializing chain 1:
## model is initialized
## 25000 updates took 0 s
## monitor set for variable 'beta0'
## monitor set for variable 'beta1'
## 250000 updates took 5 s
samplesStats(c("beta0", "beta1"))
##          mean      sd  MC_error val2.5pc  median val97.5pc start sample
## beta0 -2.2020 0.30540 0.0025220  -2.8150 -2.1960   -1.6150 25001 250000
## beta1  0.2465 0.01261 0.0001021   0.2222  0.2463    0.2718 25001 250000
beta0.hat  <- mean(samplesSample("beta0"))
beta1.hat  <- mean(samplesSample("beta1"))

plot(x,y,pch=19)
xs <- seq(.1,30,len=101)
lines(xs,exp(beta.0 + beta.1*xs),lwd=2,col="blue")      # target function
lines(xs,exp(beta0.hat + beta1.hat*xs),lwd=2,col="red") # proposed function

Mixture of Gaussians

Let’s say we have this dataset (taken from Bayesian Methods for Hackers):

samples <- read.csv("mixture_data.csv", header=F)[,1]
hist(samples, breaks=40)

It appears the data has a bimodal form, one peak around 120 and another at 200.

Our model proposes that the dataset has two clusters of data, each produced by a normal distribution. The construction was:

  1. For each data point \(y_i\):

  2. choose distribution 1 with probability \(p\), or distribution 2 with probability \(1-p\)

  3. Draw one random sample from the chosen distribution \(\mathcal{N}(\mu_k, \sigma_k)\), \(k=1,2\)

So, our model is

\[G_i \sim \text{Binomial}(p)\]

\[y_i \sim \mathcal{N}(\mu_{G_i}, \sigma_{G_i})\]

\[\mu_k \sim \mathcal{N}(0,1000)\]

\[\sigma_k \sim \text{Gamma}(0.01, 0.01)\]

\(G_i\) means the cluster of the i-th datapoint; in this problem \(G_i = \{1,2\}\). In the following model, it is used dcat instead of a binomial, just because this way it can also work with more than two clusters.

modelString = "
# BUGS model specification begins ...

model {

  for( i in 1 : N ) {
     y[i]   ~  dnorm(mu[i], tau[i])  # likelihood
     mu[i]  <- lambda[G[i]]          # prior for mean
     tau[i] <- lambdaTau[G[i]]       # prior for precision
     G[i]   ~  dcat(P[])             # the cluster attributions for each y_i
  }   

  P[1:2] ~ ddirch(alpha[])           # dirichlet distribution (in this case just for 2 clusters)
  alpha[1] <- 0.5                    # It generalizes the beta (with K=2 we could have used the beta), and
  alpha[2] <- 0.5                    # is the conjugate for the categorical distribution

  lambda[1] ~ dnorm(0.0, 1.0E-6)     # hyperparameters for mean
  lambda[2] <- lambda[1] + theta
  theta ~ dnorm(0.0, 1.0E-6)I(0.0, )

  lambdaTau[1] ~ dgamma(0.01,0.01)   # hyperparameters for precision/standard deviation
  lambdaTau[2] ~ dgamma(0.01,0.01)

  sigma[1] <- 1 / sqrt(lambdaTau[1])
  sigma[2] <- 1 / sqrt(lambdaTau[2])
}

# ... BUGS model specification ends.
" # close quote to end modelString

There is a trick here. To prevent divergence issues (all datapoints selected to a single cluster), the second mean hyperparameter, \(\lambda_2\), is determined this way: \(\lambda_2 = \lambda_1 + \theta, \theta \gt 0\) (check Bugs Book, pages 280-3).

data.list = list(
  y = samples, 
  N = length(samples),
  G = c(1, rep(NA,length(samples)-2), 2)  # TODO: do not understand these 1 and 2
)

# let's apply some thinning to the mcmc:
run.model(modelString, samples=c("sigma", "lambda", "P", "G"), data=data.list, chainLength=3e4, thin=4)
## model is syntactically correct
## data loaded
## model compiled
## initial values generated, model initialized
## 3000 updates took 1 s
## monitor set for variable 'sigma'
## monitor set for variable 'lambda'
## monitor set for variable 'P'
## monitor set for variable 'G'
## 30000 updates took 19 s
samplesStats("sigma")  
##           mean    sd MC_error val2.5pc median val97.5pc start sample
## sigma[1] 31.53 5.310  0.25370    23.59  30.61     44.25  3001   7500
## sigma[2] 22.18 2.204  0.08703    17.61  22.23     26.41  3001   7500
samplesStats("lambda")  
##            mean    sd MC_error val2.5pc median val97.5pc start sample
## lambda[1] 122.8 8.485   0.4445    110.4  121.1     144.5  3001   7500
## lambda[2] 200.7 3.047   0.1237    195.0  200.6     206.8  3001   7500
samplesStats("P")  
##        mean      sd MC_error val2.5pc median val97.5pc start sample
## P[1] 0.3975 0.07069 0.003455   0.2898  0.386    0.5789  3001   7500
## P[2] 0.6025 0.07069 0.003455   0.4211  0.614    0.7102  3001   7500
samplesCorrel("sigma[1]", "sigma[2]") # the correlation is negative: one sd larger means the other is smaller
##          sigma.2.
## sigma[1]  -0.6214

Let’s vizualize the results:

p <- samplesStats("P")$mean[1] 

mu1.hat    <- samplesStats("lambda")$mean[1]
mu2.hat    <- samplesStats("lambda")$mean[2]
sigma1.hat <- samplesStats("sigma")$mean[1]
sigma2.hat <- samplesStats("sigma")$mean[2]

hist(samples, breaks=40, prob=T)
xs <- 0:300
lines(xs,  p  *dnorm(xs,mu1.hat,sigma1.hat), col="red",  lwd=2)
lines(xs,(1-p)*dnorm(xs,mu2.hat,sigma2.hat), col="blue", lwd=2)

prob.cluster.1 <- 2 - samplesStats("G")[,1]
# need to remove the first and last observations until the previous TODO is solved
plot(samples[c(-1,-300)], prob.cluster.1, col = c("blue","red")[1+round(prob.cluster.1)], xlim=c(0,300), 
     xlab="data point", ylab="probability", main="probability of belonging to first cluster", pch=19)

Let’s also use BRugs graphical tools:

# cannot use samplesHistory("*") because of nodes G[]
samplesHistory("lambda[1]", mfrow = c(1, 1))

samplesDensity("lambda[1]", mfrow = c(1, 1)) 

samplesAutoC("lambda[1]", mfrow = c(1, 1), 1)