We will do some of the resampling with the two buckets model:

# multiplier == 0 represents infinite population (ie, no replacement)
make.bucket1 <- function(universe, multiplier=1) {
  if (multiplier>0)
    universe <- c(replicate(multiplier, universe))
  function(n.sample) {
    sample(universe, n.sample, rep=ifelse(multiplier==0, TRUE, FALSE))

# uses the bucket1 urn to generate a sample of size 'size.sample'
# and applies the given statistic function 
make.bucket2 <- function(bucket1, size.sample, statistic) {
  function(n) {
    replicate(n, bucket1(size.sample) %>% statistic) %>% 

compute.p.value <- function(results, observed.effect, precision=3) {
  # n = #experiences
  n <- length(results)
  # r = #replications at least as extreme as observed effect
  r <- sum(abs(results) >= observed.effect)  
  # compute Monte Carlo p-value with correction (Davison & Hinkley, 1997)
  list(mc.p.value=round((r+1)/(n+1), precision), r=r, n=n)

present_results <- function(results, observed.effect, label="", breaks=50, precision=3) {
  lst <- compute.p.value(results, observed.effect, precision=precision)
  hist(results, breaks=breaks, prob=T, main=label,
       sub=paste0("MC p-value for H0: ", lst$mc.p.value),
       xlab=paste("found", lst$r, "as extreme effects for", lst$n, "replications"))
  abline(v=observed.effect, lty=2, col="red")

Permutation Tests

From Cobb:

How many of us devote much time in our class to the difference between model and reality here? One of the most important things our students should take away from an introductory course is the habit of always asking,was the randomization, and what inferences does it support" How many of us ask our students, with each new data set, to distinguish between random sampling and random assignment, and between the corresponding differences in the kinds of inferences supported – from sample to population if samples have been randomized, and from association to causation if assignment to treatment or control has been randomized, as illustrated

From Ramsey and Shafter, 2002

The null hypothesis is the status quo, where we assume no effect. The alternative hypothesis is the statement that there is a real effect. Usually they are denoted \(H_0, H_A\).

Hypotheses involve a statement about a population parameter, usually denoted \(\theta\). Then we’ll have \(H_0: \theta = \theta_0\) and \(H_A: \theta \text{ op } \theta_0\), where op usually is \(<,>,\neq\).

In order to do this, it is chosen a test statistic, T, which is a numerical function of the data whose value determines the result of the test. When we apply T to the original data \(x\), we get the observed effect, \(t=T(x)\).

A p-value is the probability that chance alone (\(H_0\) being true) could produce a test statistic as extreme as the observed effect, \(P(T\geq t)\).

It is possible to apply lots of different statistical tests to check if \(H_0\) is credible or not (ie, if chance alone produced the effect). An alternative to those tests (like the t-test) is the permutation test.

Consider having a dataset with treatment and control data. For a permutation test, this is all data. It randomly decides where to place each datum (a resampling of the original sample). For that resample it then applies the test statistic and save it. Then it repeats the previous steps as many as we want. As Cobb says: _Reshuffe, redeal, and recompute__.

There is no need for the usual assumptions, no normality is needed.

There must be some care if there are two groups that we sampled from different distributions. Usually permutation tests are robust even in this situation. The major exception is when populations have different variance and are quite dissimilar in size.

We must also assume that observations are independent and exchangeable.

One-sided permutation test

A study was conducted at a bar to check how many drinks and hotwings males and females had. The sample is in dataset Beerwings:

##   ID Hotwings Beer Gender
## 1  1        4   24      F
## 2  2        5    0      F
## 3  3        5   12      F
## 4  4        6   12      F
## 5  5        7   12      F
## 6  6        7   12      F

Let’s consider we want to check if males eat more hotwings than females. We are using the mean as the statistic here. The null hypothesis, then, is the status quo, \(\mu_M = \mu_F\), while the alternative hypothesis is that males eat more than females:

\[H_0: \mu_M = \mu_F; ~ H_A: \mu_M > \mu_F\] This is an eg of a one-sided permutation test.

group.M  <- Beerwings$Hotwings[Beerwings$Gender=="M"] # group 1
group.F  <- Beerwings$Hotwings[Beerwings$Gender=="F"] # group 2

observed.effect <- mean(group.M) - mean(group.F)

The next code executes a permutation test with these hypotheses: <- c(group.M, group.F)

diff.means <- function(indexes) {
  group1 <- 1:length(group.M)
  indexes.group1 <- indexes[ group1]
  indexes.group2 <- indexes[-group1]
  mean([indexes.group1]) - mean([indexes.group2])

bucket1 <- make.bucket1(1:length(, 1)
bucket2 <- make.bucket2(bucket1, length(, diff.means)

results <- bucket2(1e4)
present_results(results, observed.effect, "Difference of Means", breaks=40)

Since the p-value is 0.001 it is evidence that supports the alternative hypothesis, that there is a significant difference of means between the genres in this study.

If these people were selected inside a bar, then we cannot say anything about the population, since the group is not a random sample.

We can easily use any other statistic. The next eg uses just the mean of male drinking: <- function(indexes) {
  group1 <- 1:length(group.M)  # males were placed at the first indexes
  indexes.group1 <- indexes[group1]

bucket2 <- make.bucket2(bucket1, length(,

results <- bucket2(1e4)
present_results(results, mean(group.M), "Difference of Means", breaks=30)

We could choose statistics, like the median or the trimmed mean, to achieve more robust results, if large outliers are in the sample.

Two-sided permutation tests

Above we start assuming that males drink more than females. However this assumption can be wrong. We can just try to find evidence that there is a drink gender difference (females could drink more). So, our hypothesis becomes a two-sided test:

\[H_0: \mu_M = \mu_F; ~ H_A: \mu_M \neq \mu_F\]

observed.effect <- abs(mean(group.M) - mean(group.F))

diff.means <- function(indexes) {
  group1 <- 1:length(group.M)
  indexes.group1 <- indexes[ group1]
  indexes.group2 <- indexes[-group1]
  abs(mean([indexes.group1]) - mean([indexes.group2]))

bucket1 <- make.bucket1(1:length(, 1)
bucket2 <- make.bucket2(bucket1, length(, diff.means)

results <- bucket2(1e4)
present_results(results, observed.effect, "Difference of Means", breaks=20)

Non independent observations

Consider the following results of a diving competition:

##               Name     Country Semifinal  Final
## 1 CHEONG Jun Hoong    Malaysia    325.50 397.50
## 2         SI Yajie       China    382.80 396.00
## 3         REN Qian       China    367.50 391.95
## 4       KIM Mi Rae North Korea    346.00 385.55
## 5       WU Melissa   Australia    318.70 370.20
## 6    KIM Kuk Hyang North Korea    360.85 360.00

We would like to check if the result differences between semifinal and final are due to chance, or is there some effect happening.

This can be handled as a difference of means. However, the data is not independent. There are paired values from the same athlete!

To solve this, we can simply swap the colunms randomly, for each athlete, of each resampling:

group.semi  <- Diving2017$Semifinal <- Diving2017$Final

observed.effect <- ( - group.semi) %>% mean %>% abs

The swap is done by multiplying the row differences with +1 or -1.

n.rows <- length(

diff.means.matched <- function(signals) {
  (( - group.semi) * signals) %>% mean %>% abs

bucket1 <- make.bucket1(c(-1,1), 0)  # generate random +1s/-1s
bucket2 <- make.bucket2(bucket1, n.rows, diff.means.matched)

results <- bucket2(1e4)
present_results(results, observed.effect, "Difference of Means", breaks=40)

So, the p-value is around 26% so chance alone could account for the differences.

Sampling Distributions

The sampling distribution is the probability distribution of the values taken from a given statistic. Since the statistic value depends of the values of a random sample, it can be seen as a random variable that follows a certain distribution.

Many examples are known analytically. For eg, given \(n\) iid normal random vars, the mean statistic also follows a normal sampling distribution:

\[X_i \sim \mathcal{N}(\mu,\sigma^2) \implies \bar{X} \sim \mathcal{N}(\mu,\frac{\sigma^2}{n})\] Let’s check it:

n <- 30
results <- replicate(1e4, rnorm(n) %>% mean) # get 1e3 means

hist(results, breaks=50, main="sampling distribution", prob=T)
curve(dnorm(x,0,1/sqrt(n)), from=-.6, to=.6, col="red", lwd=2, add=T)

Generating RVs from a given distribution

Let \(X_i, i=1\ldots 10\) be a random sample from distribution with pdf \(f(x)=2/x^3, x \geq 1\). Let \(X_{min}\) be the minimum of the sample (itself a random var). What is \(p(X_{min}\leq 1.2)\)?

Analytically, we need the cdf:

\[F(x) = \int_1^x \frac{2}{t^3} dt = 1 - \frac{1}{x^2}\]

Using a theorem that states that the minimum of \(n\) iid random vars (\(X_i \sim f\)) follows a distribution with pdf:

\[f_{min}(x) = n(1-F(x))^{n-1}f(x)\] we compute

\[f_{min} = \frac{20}{x^{21}}, x\geq 1\]

and so,

\[p(X_{min}\leq 1.2) = \int_1^{1.2} \frac{20}{x^{21}} dx = 1-\frac{1}{1.2^{20}} = 0.974\]

To use a sampling strategy, we need to generate random values from \(f\). For that, we use the Probability Integral Transformation that states

\[X \sim f_x \implies F_X(X) \sim \mathcal{U}(0,1)\]

F._1 <- function(y) { 1/sqrt(1-y) } # solving y=1-1/x^2 for x

results <- replicate(5e4, F._1(runif(10,0,1)) %>% min)
(results<=1.2) %>% mean
## [1] 0.97364

Using the Central Limit Theorem

The CLT states that given iid \(X_i, i=1,\ldots n\) with mean \(\mu\) and variance \(\sigma^2\), then for any constant \(z\)

\[lim_{n \rightarrow \infty} P \Big( \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \leq z \Big) = \Phi(z)\] where \(\Phi\) is the cdf of \(\mathcal{N}(0,1)\).

Let’s have iid \(X_1, \ldots, X_{30} \sim \Gamma(5,2)\), a Gamma with shape=5 and rate=2. What’s \(P(\bar{X}>3)\)?

The mean for this gamma is \(5/2\) and its variance is \(5/2^2\). Then using the CLT,

\[P(\bar{X}>3) = P \Big( \frac{\bar{X}-5/2}{\sqrt{5/2^2}/\sqrt{30}} > \frac{3-5/2}{\sqrt{5/2^2}/\sqrt{30}} \Big) \approx P(Z > 2.4495)\] this is:

## [1] 0.007152735

Using simulation (R already has a random generator for gammas):

results <- replicate(5e4, rgamma(30,5,2) %>%  mean)
(results>3) %>%  mean
## [1] 0.00984

Bootstrap and Confidence Intervals

If we do not know the parameters of the population, we need to move from probability to statistics. Herein, we only have the data and a statistic estimated from the data. How to estimate the sampling distribution of that statistic?

The bootstrap says that since the sample approximates the population, resampling from the sample approximates taking many samples from the population. So, the bootstrap distribution of a statistic approximates the sampling distribution of that statistic.

bootstrap <- function(n, original.sample, statistic) {
  size <- length(original.sample)

  replicate(n, sample(original.sample, size, rep=T) %>% statistic)

Let’s check an eg: <- 3    # unknown parameters <- 1.5

original.sample <- rnorm(30,,

results <- bootstrap(1e4, original.sample, mean)

hist(results, breaks=50, prob=T)
#red curve shows the sampling distribution
curve(dnorm(x,,, col="red", lwd=2, add=T)

# get bootstrap confidence interval
quantile(results, probs=c(0.025,0.975))
##     2.5%    97.5% 
## 2.303473 3.332275

THe bootstrap distribution for statistic \(\hat{\theta}\) approximates de spread and shape of the sampling distribution for \(\hat{\theta}\). But the distribution center is the center of the original sample. If the center of the distribution differs from the observed statistic, it indicates bias.

## [1] 2.815898
## [1] 2.817397

Let’s try an eg with a skewed distribution:

n <- 30
real.shape <- 1
real.rate <- .5
original.sample <- rgamma(n, real.shape, real.rate)

results <- bootstrap(1e4, original.sample, mean)

hist(results, breaks=50, prob=T)
#red curve shows the sampling distribution
curve(dgamma(x,n*real.shape,n*real.rate), col="red", lwd=2, add=T)

If we have two samples and wish to, say, compare the difference of means, we need to bootstrap twice, once per sample.

Here’s a dataset with testosterone levels per gender in the practice of skateboarding:

##   Age Experimenter Testosterone
## 1  18       Female        206.0
## 2  18       Female        197.0
## 3  18       Female        135.8
## 4  18       Female        170.2
## 5  19       Female        107.3
## 6  19       Female        351.6

Is the difference of means of testosterone per gender significant?

group.F <- Skateboard[Skateboard$Experimenter=="Female",]$Testosterone
group.M <- Skateboard[Skateboard$Experimenter=="Male",  ]$Testosterone

results.F <- bootstrap(1e4, group.F, mean)
results.M <- bootstrap(1e4, group.M, mean)

results <- results.F - results.M

hist(results, breaks=50, prob=T)
abline(v=mean(group.F)-mean(group.M), col="red", lwd=2)

quantile(results, c(.025,.975))
##      2.5%     97.5% 
##  23.97626 139.10690

Replacing Friedman Rank Sum Test

The Friedman Test is a frequencist test used to detect differences in several different treatments. Example: n wine judges each rate k different wines. Are any of the k wines ranked consistently higher or lower than the others?. To do this, this procedure ranks the same values across treatments, and has connections with ANOVA.

The used statistic for the permutation test will be the F value.

# Computing the F statistic
#  = variance of group means / 
#    mean of within group variances
# ref:
f.stat <- function(df) {
  means <- apply(df, 2, mean)
  vars  <- apply(df, 2, var)
  var(means) / mean(vars)

So, what consists of the resampling procedure? We will compute the ranks for the observed data and find its F value. Then, we will resample these ranks by shuffling, for each row (each datum), the ranks of the different treatments and compute the resampling F values.

rank.test <- function(, size=2500) {
  # function for column-wise shuffling of data frame (row by row)
  permute.rows <- function(df) {
    apply(df, 1, function(row) sample(row))

  observed.ranks   <- apply(, 1, rank) %>% t    <- replicate(size, observed.ranks %>% permute.rows %>% t)
  permuted.effects <- apply(, 3, f.stat)

  list(permuted.effects = permuted.effects,
       observed.effect  = observed.ranks %>% f.stat)

Let’s see it in action:

# create fake data
n <- 100 <- data.frame( gp1 = rnorm(n, 1.00, 0.5)
                           , gp2 = rnorm(n, 1.00, 0.5)
                           , gp3 = rnorm(n, 1.15, 0.5)
                           , gp4 = rnorm(n, 1.00, 0.5)

results <- rank.test( 

present_results(results$permuted.effects, results$observed.effect, "Difference of Ranks")

Here’s the classic Friedman Rank Sum Test to compare its p-value:

f.test <- %>% as.matrix %>% friedman.test
## [1] 0.0181958

Replacing One-Way ANOVA

One-way ANOVA estimates how a quantitative dependent variable changes according to the levels of one of its categorical independent variables.

# data set from <- read.csv("", header = TRUE, 
                      colClasses = c("factor", "factor", "factor", "numeric"))

##  density block  fertilizer     yield      
##  1:48    1:24   1:32       Min.   :175.4  
##  2:48    2:24   2:32       1st Qu.:176.5  
##          3:24   3:32       Median :177.1  
##          4:24              Mean   :177.0  
##                            3rd Qu.:177.4  
##                            Max.   :179.1

In this data set we have three categorical independent variables that might or not be significant to the value of its yield.

The next code will sample the categorical variable while keeping the original values of yield to compute the F statistic (check previous section). We do it lots of time and keep all permuted effects in order to compare with the observed effect.

f.stat <- function(means, vars) {
  var(means) / mean(vars)

# assume data frame has two columns, the first is the independent variable
# the second is the dependent variable
anova.test <- function(df, size=2500) {

  vals <- sort(unique(df[,1]))
  permuted.effects <- replicate(size, 
  { <- data.frame( x = sample(df[,1]),
                                 y = df[,2])  
    means <- sapply(vals, function(i)[$x==i,2] %>% mean)
    vars  <- sapply(vals, function(i)[$x==i,2] %>% var)
    f.stat(means, vars)
  # observed effect
  means <- sapply(vals, function(i) df[df[,1]==i,2] %>% mean)
  vars  <- sapply(vals, function(i) df[df[,1]==i,2] %>% var)
  observed.effect <- f.stat(means, vars)
  list(permuted.effects = permuted.effects,
       observed.effect  = observed.effect)

Let’s check the significance of fertilizer:

df <-[,c(3,4)] # select fertilizer and yield
results <- anova.test(df, 15000)
present_results(results$permuted.effects, results$observed.effect, precision=5)