Lincoln Index, Bayesian Style

This webpage emulates the Bayesian solution for the Lincoln index provided by Allen Downey.

The problem is:

Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn’t very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.

I follow Downey’s notation:

\(k_i\) is the number of bugs found by tester i
\(c\) is the number of common bugs
\(p_i\) is the (unknown) probability of tester i finding a certain bug
\(n\) is the (unknown) total number of bugs

Downey provides the likelihood function for the data \(\mathcal{D} = (k_1, k_2, c)\) given the parameters \(\theta = (n, p_1, p_2)\):

\[p(\mathcal{D} | \theta) = p(k_1, k_2, c | n, p_1, p_2) = p(k_1 | n, p_1, p_2) \times p(k_2, c | k_1, n, p_1, p_2) = {n \choose k_1} p_1^{k_1} (1-p_1)^{n-k_1} \times {k_1 \choose c} {n-k_1 \choose k_2-c} p_2^{k_2} (1-p_2)^{n-k_2}\]

The prior for the parameters \(p_1\) and \(p_2\) is the uniform distribution between \(0\) and \(1\), while the prior for the total number of bugs is uniform that starts at \(k_1+k_2-c\) (the minimum possible number of bugs) and ends at a large enough number, say 350.

Herein I’ll use BUGS to find a solution.

First, my RBugs ‘boilerplate’ code:

library(BRugs)

## Welcome to BRugs connected to OpenBUGS version 3.2.3

run.model <- function(model, samples, data=list(), chainLength=10000, burnin=0.10, init.func, n.chains=1) {
  
  writeLines(model, con="model.txt")  # Write the modelString to a file
  modelCheck( "model.txt" )           # Send the model to BUGS, which checks the model syntax
  if (length(data)>0)                 # If there's any data available...
    modelData(bugsData(data))         # ... BRugs puts it into a file and ships it to BUGS
  modelCompile(n.chains)              # BRugs command tells BUGS to compile the model
  
  if (missing(init.func)) {
    modelGenInits()                   # BRugs command tells BUGS to randomly initialize a chain
  } else {
    for (chain in 1:n.chains) {       # otherwise use user's init data
      modelInits(bugsInits(init.func))
    }
  }
    
  modelUpdate(chainLength*burnin)     # Burn-in period to be discarded
  samplesSet(samples)                 # BRugs tells BUGS to keep a record of the sampled values
  modelUpdate(chainLength)            # BRugs command tells BUGS to randomly initialize a chain
}

So let’s use BUGS to describe this model. I used the ‘Zero’s trick’ since the likelihood function is not standard:

modelString = "
  model {

      # Likelihood function 

      phi <- -log(choose1 * binom1 * choose2 * choose3 * binom2) + CZERO
      dummy <- 0
      dummy ~ dpois( phi )
      CZERO <- 1000000    # for the zero's trick

      # compute binomial coefficients
      # cf. http://stats.stackexchange.com/questions/62418/binomial-coefficient-in-jags
      choose1 <- exp( loggam(n+1) - (loggam(k1+1) + loggam(n-k1+1)) )           # choose(n,k1)
      binom1  <- pow(p1, k1) * pow(1-p1, n-k1)

      choose2 <- exp( loggam(k1+1) - (loggam(c+1) + loggam(k1-c+1)) )           # choose(k1,c)
      choose3 <- exp( loggam(n-k1+1) - (loggam(k2-c+1) + loggam(n-k1-k2+c+1)) ) # choose(n-k1,k2-c)
      binom2  <- pow(p2, k2) * pow(1-p2, n-k2)

      # Priors 

      p1 ~ dunif(0,1)
      p2 ~ dunif(0,1)
      n  ~ dunif(m1,m2)         # uniform prior
      #n ~ dexp(0.001)I(m1,m2)  # truncated exponential prior (another option for the prior of n)

      # Some needed constants

      m1 <- k1+k2-c    # the minimum possible number of errors
      m2 <- 350        # we don't know the max, but let's allow some breathing space
  }
"

Let’s say that tester 1 found 20 bugs, tester 2 found 15 bugs, and there are 3 bugs in common, ie, \(k_1 = 20, k_2=15, c=3\).

We need to include this data and define some other values for BUGS to run:

# data
k1 <- 20 # the input from the problem
k2 <- 15
c  <- 3

# initializations (with some more boilerplate)
genInitFactory <- function()  {
  i <- 0
  function() {
    i <<- i + 1
    list( 
      p1 = 0.5,
      p2 = 0.5,
      n  = k1*k2/c # let's start with the Lincoln index estimate
    ) 
  }
}

# Everything is ready. Run the model!
run.model(modelString, samples=c("n","p1","p2"), data=list(k1=k1,k2=k2,c=c), chainLength=2e5, init.func=genInitFactory())

## model is syntactically correct

## data loaded

## model compiled

## Initializing chain 1:

## model is initialized

## 20000 updates took 0 s

## monitor set for variable 'n'

## monitor set for variable 'p1'

## monitor set for variable 'p2'

## 200000 updates took 5 s

Let’s vizualize some stats:

# Get stats from the MCMC run
stats <- samplesStats(c("n", "p1", "p2"))
stats

##        mean       sd  MC_error val2.5pc  median val97.5pc start sample
## n  105.2000 49.14000 0.4483000 49.14000 91.7300  241.2000 20001 200000
## p1   0.2302  0.09645 0.0007866  0.07860  0.2189    0.4457 20001 200000
## p2   0.1756  0.07695 0.0006006  0.05776  0.1654    0.3516 20001 200000

n.chain  <- samplesSample( "n" )    # Extract chain values for number of bugs
# Show the posterior distribution for n
hist(n.chain, breaks=80, prob=TRUE, xlab="Number of bugs", main="Posterior p(n|D)") 
dst <- density(n.chain)
map.n.bugs <- dst$x[which.max(dst$y)]    # get the MAP from the estimated density
map.n.bugs

## [1] 72.69129

lines(dst, col="red", lwd=2)

# Now for the testers' abilities
p1.chain <- samplesSample( "p1" )  
dst.p1 <- density(p1.chain)
p2.chain <- samplesSample( "p2" )   
dst.p2 <- density(p2.chain)

par(mfrow=c(1,2))
hist(p1.chain, breaks=80, prob=TRUE, xlab="p1", main="Posterior p(p1|D)") 
lines(dst.p1, col="red", lwd=2)
hist(p2.chain, breaks=80, prob=TRUE, xlab="p2", main="Posterior p(p2|D)") 
lines(dst.p2, col="red", lwd=2)

map.p1 <- dst.p1$x[which.max(dst.p1$y)]
map.p1

## [1] 0.1974846

map.p2 <- dst.p2$x[which.max(dst.p2$y)]
map.p2

## [1] 0.1408324

Stan solution

Next is the same problem coded in Stan:

library(rstan)

model <- '
  functions {
     real lincoln_log(vector D, real n, real p1, real p2) {

        real k1; real k2; real c; 
        real choose1; real binom1; real binom2; real choose2; real choose3;

        k1 <- D[1]; k2 <- D[2]; c <- D[3];
        choose1 <- exp( lgamma(n+1)    - (lgamma(k1+1)   + lgamma(n-k1+1)) );
        binom1  <- pow(p1, k1) * pow(1-p1, n-k1);
        choose2 <- exp( lgamma(k1+1)   - (lgamma(c+1)    + lgamma(k1-c+1)) );           
        choose3 <- exp( lgamma(n-k1+1) - (lgamma(k2-c+1) + lgamma(n-k1-k2+c+1)) );
        binom2  <- pow(p2, k2) * pow(1-p2, n-k2);

        return log( choose1 * binom1 * choose2 * choose3 * binom2 );
     }
  }

  data {
    int<lower=0>  k1;   // bugs found by tester 1
    int<lower=0>  k2;   // bugs found by tester 2
    int<lower=0>  c;    // number of common bugs 
  }

  transformed data {
    int<lower=0>  m1;   // lower bound for n
    int<lower=0>  m2;   // upper bound for n
    vector[3] D;

    m1 <- k1+k2-c; 
    m2 <- 350;
    D[1] <- k1; D[2] <- k2; D[3] <- c;
  }

  parameters {
    real<lower=m1, upper=m2> n;   // the number of total bugs
    real<lower=0, upper=1>  p1;   // performance of tester 1
    real<lower=0, upper=1>  p2;   // performance of tester 2
  }

  model {
    n  ~ uniform(m1,m2);   // priors
    p1 ~ uniform(0,1);
    p2 ~ uniform(0,1);

    D ~ lincoln(n,p1,p2);  // likelihood, equivalent to: increment_log_prob(lincoln_log(D,n,p1,p2));  
  }
'

fit  <- stan(model_code = model, data = list(k1=20, k2=15, c=3), iter = 1000,  chains = 4, verbose = FALSE)
fit2 <- stan(fit = fit,          data = list(k1=20, k2=15, c=3), iter = 50000, chains = 4, verbose = FALSE, seed=101, warmup=5000)

print(fit2)

## Inference for Stan model: 45069a74b8fc8985e8d40ac503d4d878.
## 4 chains, each with iter=50000; warmup=5000; thin=1; 
## post-warmup draws per chain=45000, total post-warmup draws=180000.
## 
##        mean se_mean    sd   2.5%   25%   50%    75%  97.5% n_eff Rhat
## n    105.36    0.32 49.98  48.98 70.68 91.67 124.39 245.61 24898    1
## p1     0.23    0.00  0.10   0.08  0.16  0.22   0.29   0.45 30916    1
## p2     0.18    0.00  0.08   0.06  0.12  0.17   0.22   0.35 31480    1
## lp__  -7.27    0.01  1.32 -10.73 -7.87 -6.93  -6.31  -5.77 36432    1
## 
## Samples were drawn using NUTS(diag_e) at Tue May 24 15:05:14 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

la <- extract(fit2, permuted = TRUE) 
hist(as.vector(la$n), breaks=100, prob=TRUE, xlab="Number of bugs", main="Posterior p(n|D)")
dst <- density(la$n)
lines(dst, col="red", lwd=2)

stan.n.bugs <- dst$x[which.max(dst$y)]    # get the MAP from the estimated density
stan.n.bugs

## [1] 71.41219

Stan’s MAP estimation matches BUGS estimation! Let check them against the analytic solution:

Analytic Solution

Assuming uniform priors the posterior for the marginal distribution of \(n\) is proportional to

\[p(n | \mathcal{D}) \propto \int_0^1 \int_0^1 {n \choose k_1} p_1^{k_1} (1-p_1)^{n-k_1} \times {k_1 \choose c} {n-k_1 \choose k_2-c} p_2^{k_2} (1-p_2)^{n-k_2} ~ dp_1 dp_2\]

Using Mathematica to evaluate the next expression:

Integrate[Binomial[n, k1]*p1^k1*(1 - p1)^(n - k1)*Binomial[k1, c]*Binomial[n - k1, k2 - c]*p2^k2*(1 - p2)^(n - k2), {p1,0,1}, {p2,0,1}]

we get:

\[p(n | \mathcal{D}) \propto \frac{ {k_1 \choose c} {n \choose k_1} {n-k_1 \choose k_2-c} \Gamma(k_1+1) \Gamma(k_2+1) \Gamma(n+1-k_1) \Gamma(n+1-k_2)}{\Gamma(n+2)^2}\]

Replacing the data with our specific values, ie, \(k_1=20, k_2=15, c=3\):

\[p(n | \mathcal{D}) \propto \frac{ {n \choose 20} {n-20 \choose 12} \Gamma(n-19) \Gamma(n-14)}{\Gamma(n+2)^2}\]

# pre: n>19
# needed to exp.log it to prevent overflows
f <- function (n) {
  exp(log(choose(n,20)) + log(choose(n-20,12)) + lgamma(n-19) + lgamma(n-14) - 2*lgamma(n+2))
}

mode.f <- optimize(f, lower=32, upper=350, maximum=TRUE)$maximum
mode.f

## [1] 72.18495

n <- 32:350; plot(n,f(n),type="l", xlim=c(min(n),max(n)), yaxt="n", ylab="p(n|D)")
segments(mode.f, 0, mode.f, f(mode.f), col="red", lty=2)

Lincoln Index, Bayesian Style

João Neto

August 5, 2014

Stan solution

Analytic Solution