Refs
Statistical Consequences of Fat Tails, Nassim Taleb (all unnamed citations are Taleb’s)
Dilettante Data Science, David Salazar
Fooled by Correlation: Common Misinterpretations in Social “Science”, Nassim Taleb
library(magrittr)
library(latex2exp) # use latex expressions
library(actuar) # pareto distribution
# use case: sampler(rnorm, mean=1, sd=1)
sampler.factory <- function(rng=runif, ...) {
function (n) rng(n, ...)
}
thin_rng <- function(n) {
sampler <- sampler.factory(rnorm, mean=0, sd=20)
sampler(n)
}
thick_rng <- function(n) {
sampler <- sampler.factory(rlnorm, meanlog=0, sdlog=5)
sampler(n)
}
Taleb introduce the two imaginary domains of Mediocristan (thin tails) and Extremistan (thick tails).
In Mediocristan, when a sample under consideration gets large, no single observation can really modify the statistical properties. In Extremistan, the tails (the rare events) play a disproportionately large role in determining the properties.
Assume a large deviation X. - In Mediocristan, the probability of sampling higher than X twice in a row is greater than sampling higher than 2X once. - In Extremistan, the probability of sampling higher than 2X once is greater than the probability of sampling higher than X twice in a row.
par(mfrow=c(1,2))
set.seed(131)
xs <- seq(0,3.45,0.05)
two.xs <- 2*xs
ratio.thin <-((1 -pnorm(xs))^2)/(1-pnorm(two.xs))
plot(xs, ratio.thin, type="l", col="dodgerblue", lwd=2, main=TeX('$p(X)^2 / p(2X)$'))
ratio.thick <-((1 -plnorm(xs,0,5))^2)/(1-plnorm(two.xs,0,5))
plot(xs, ratio.thick, type="l", col="dodgerblue", lwd=2, main=TeX('$p(X)^2 / p(2X)$'))
In Mediocristan if a sum of two events is a \(4\sigma\) event, the most probable hypothesis is that it is the sum of two \(2\sigma\) events. In Extremistan, the most probable hypothesis is that it is the sum of small event plus one near \(4\sigma\) event.
An example: if the sum of two person heights is 4 meters, the most probable event was measuring 2m persons. If the sum of two persons’ wealth is 40 millions, the most probable combination is not two persons with 20 millions, but something like one person with 40.999 millions and another with one thousand.
According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed. [ref]
\[X_i ~\text{iid} \wedge E[X_i]=\mu \implies \lim_{n \to +\infty} \frac{1}{n} \sum_{i=1}^n X_i = \mu\]
The country of Mediocristan and many from Extremistan follow LLN, but the problem is how many samples are needed to converge. LLN guarantee results with an infinite amount of data, which is never what we have.
The next plots show how for a Gaussian even with large variance, the cumulative mean quickly goes to the expected mean, while the logNormal is much harder. This means the sample from the Gaussian is informative, but the logNormal sample not so much…
set.seed(141)
thins <- thin_rng(2000)
thicks <- thick_rng(1e6)
cum_mean <- function(numbers) {
x <- seq(1, length(numbers))
cum_mean <- cumsum(numbers)/x
cum_mean
}
par(mfrow=c(1,2))
plot(cum_mean(thins), type="l", ylab="mean", xlab="", col="dodgerblue")
abline(h=0, col="red", lty=2)
plot(cum_mean(thicks), type="l", ylab="mean", xlab="", col="dodgerblue")
abline(h=exp(12.5), col="red", lty=2)
It goes as Taleb says:
The mean of a distribution will rarely correspond to the sample mean; it will have a persistent small sample effect (downward or upward)
The central limit theorem (CLT) establishes that, in some situations [finite mean and variance], when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a bell curve) even if the original variables themselves are not normally distributed. [ref]
The next plot shows, for the Gaussian, the density of a single random sample (blue) and the density of sample where each point is the mean of 30 random samples (yellow). We see how fast the samples approaches the average Gaussian.
A thick distribution also takes much longer to approach the average Gaussian. The next plot show densities for samples using means from 10, 100, 500 and 10000 random samples:
A thick tailed distribution is determined by rare events. A sample from such a distribution will almost never be enough to infer the true parameters of the distribution. The intuition from studying the Gaussian will do us a disservice in this context.
the thicker the tails of the distribution, the more the tail wags the dog, that is, the information resides in the tails and less so in the “body” (the central part) of the distribution. Effectively, for very fat tailed phenomena, all deviations become informationally sterile except for the large ones. The center becomes just noise. This property also explains the slow functioning of the law of large numbers in certain domains as tail deviations, where the information resides, are – by definition - rare. […] It also explains why one should never compare random variables driven by the tails (say, pandemics) to ones driven by the body (say, number of people who drown in their swimming pool).
The PCA projects the data onto a lower dimensional hyperplane such that most of the variance is preserved. These hyperplanes should reflect some structure on the data.
This works well in Mediocristan. With random data with no structure, if PCA processes enough data it will show all variances with the same magnitude:
library(MASS)
library(stats)
library(Matrix)
dim <- 20
small.sample <- mvrnorm(1e2, rep(0,dim), Diagonal(dim, 1))
large.sample <- mvrnorm(1e5, rep(0,dim), Diagonal(dim, 1))
pca.small <- prcomp(small.sample)
pca.large <- prcomp(large.sample)
par(mfrow=c(1,2))
barplot(pca.small$sdev, col="dodgerblue", main="small sample")
barplot(pca.large$sdev, col="dodgerblue", main="large sample")
But what happens in Extremistan?
set.seed(101)
dim <- 20
small.sample <- matrix(rcauchy(1e2*dim, 0, 5), ncol=dim)
large.sample <- matrix(rcauchy(1e5*dim, 0, 5), ncol=dim)
pca.small <- prcomp(small.sample)
pca.large <- prcomp(large.sample)
par(mfrow=c(1,2))
barplot(pca.small$sdev, col="dodgerblue", main="small sample")
barplot(pca.large$sdev, col="dodgerblue", main="large sample")
Now PCA is fooled by the thick-tail distributions. Dimension reduction does not work.
The method of moments (MoM) fails to work. Higher moments are uninformative or do not exist.
Let’s try MoM with a Normal and a t-Student
set.seed(121)
n <- 1e3
thin.sample <- rnorm(n, 3, 2)
thick.sample <- rt(n, 15, 3)
For the normal
\[\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i = \overline{X}\] \[\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{X})^2\]
mu.hat <- mean(thin.sample)
sigma.hat <- sd(thin.sample)
n <- 50
xs <- seq(-3,9,len=n)
plot(xs, dnorm(xs, 3, 2), type='l', lwd=2, col='dodgerblue', ylab='', xlab='')
curve(dnorm(x, mu.hat, sigma.hat), type='l', lwd=2, col='red', add=T)
For the t-Student, we need the sample variance and the sample kurtosis \(\kappa\):
\[\hat{\nu} = \frac{6}{\kappa} + 4\]
\[\hat{\beta} = \frac{\sigma^2}{\hat{\nu}/(\hat{\nu}-2)}\]
library(e1071)
hat.df <- 4 + 6/kurtosis(thick.sample) # degrees of freedom
hat.beta <- var(thick.sample) / (hat.df/(hat.df-2)) # non-centrality parameter
n <- 50
xs <- seq(-3,9,len=n)
plot(xs, dt(xs, 15, 3), type='l', lwd=2, col='dodgerblue', ylab='', xlab='')
curve(dt(x, hat.df, hat.beta), type='l', lwd=2, col='red', add=T)
Let’s assume we would like to fatten the tails of a given distribution. To do that we need to assign more probabilistic mass to the tails. But since the pdf must sum to 1, this mass must come from somewhere.
Taleb presents a method to increase the higher moments of a distribution while keeping the lower moments invariant.
We will create a fattened version of a Normal using the following algorithm:
with probability p=1/2, \(X \sim \mathcal{N}(0,\sigma\sqrt{1-a})\)
with probability 1-p=1/2, \(X \sim \mathcal{N}(0,\sigma\sqrt{1+a})\)
set.seed(101)
n <- 1e4
fatten <- function(a) {
p <- runif(n) < 1/2
xs <- rep(NA, n)
xs[ p] <- rnorm(sum( p), sd=sqrt(1-a))
xs[!p] <- rnorm(sum(!p), sd=sqrt(1+a))
xs
}
d1 <- density(rnorm(n) , n=1e4)
d2 <- density(fatten(0.7), n=1e4)
d3 <- density(fatten(0.9), n=1e4)
plot(NA, xlim=c(-4,4), ylim=c(0,max(d1$y,d2$y,d3$y)),
main="Fattening the Normal", xlab="")
lines(d1, lwd=2, col='red')
lines(d2, lwd=2, col='blue')
lines(d3, lwd=2, col='purple')
legend(3, .8, legend=c('a=0.9', 'a=0.7', 'normal'), pch=15,
col=c('purple', 'blue', 'red'))
By stochastize the variance of the distribution we found:
The tails grow fatter, the decay to zero slows
The peak becomes higher
Intermediate values are less likely
And now we know: the increased probabilistic mass for the tails comes from the intermediate events.
There are (at least) two known ways to quantify dispersion of a random sample:
Standard Deviation (STD) \(\sqrt{\frac{1}{n} \sum (x_i-\overline{x})^2}\)
Mean Absolute Deviation (MAD) \(\frac{1}{n} \sum | x_i - \overline{x}|\)
Many people’s intuitions are closer to MAD than to STD despite the almost universal use of STD.
When deciding between estimators, statistics textbooks talk about two asymptotic properties of the estimator: consistency and efficiency. Consistency means that the estimator with lots of observations has a distribution tightly centered around the parameter. Efficiency means that the asymptotic variance is the lowest possible. Here we are concerned with efficiency.
The coefficient of variation is a measure of relative variability, computed by the ratio between the deviation and the mean. The Asymptotic Relative Efficiency (ARE) is a way to compare different measures of dispersion
\[\text{ARE}(m_1, m_2) = \lim_{n \rightarrow +\infty} \frac{V(m_1)/E[m_1]^2}{V(m_2)/E[m_2]^2}\]
Let’s check \(\text{ARE(STD,MAD)}\) in Mediocristan with a Gaussian:
MAD <- function(xs) {
n <- length(xs)
x_bar <- mean(xs)
sum(abs(xs-x_bar))/n
}
STD <- function(xs) {
sd(xs)
}
ARE <- function(size, m1, m2, runs, rng, ...) {
m1s <- rep(NA, runs)
m2s <- rep(NA, runs)
for(i in 1:runs) {
xs <- rng(size, ...)
m1s[i] <- m1(xs)
m2s[i] <- m2(xs)
}
(var(m1s)/mean(m1s)^2) / (var(m2s)/mean(m2s)^2)
}
set.seed(101)
sample.size <- c(10, 50, 100, 500, 1000, 5000)
results <- rep(NA, length(sample.size))
for(i in 1:length(results))
results[i] <- ARE(sample.size[i], STD, MAD, 1e4, rnorm)
plot(sample.size, results, type='l', ylim=c(0.5,1.5), log='x',
ylab="Asymptotic Relative Efficiency")
points(sample.size, results, pch=19)
abline(h=1, col='black', lty=2)
abline(h=0.875, col='red', lty=2)
The red line is the theoretical value of \(0.875\). This means that STD is 12% more efficient than MAD under Gaussian observations. This result was found by Fisher in a 1920s dispute with Eddington (which favored MAD) and it settled the matter then.
But not everything is Gaussian…
Let’s check Extremistan with a t-Student:
set.seed(101)
sample.size <- c(10, 50, 100, 500, 1000, 5000)
results <- rep(NA, length(sample.size))
for(i in 1:length(results))
results[i] <- ARE(sample.size[i], STD, MAD, 1e4, rt, df=2)
plot(sample.size, results, type='l', log='x', ylab="Asymptotic Relative Efficiency")
points(sample.size, results, pch=19)
abline(h=1, col='black', lty=2)
STD is completely destroyed. For large samples we got a value of 250 for ARE (!) in favor of MAD which makes it a much more efficient measure of dispersion.
With fatter tails STD is worse than useless. It reports levels of efficiency that simply aren’t there. This happens because every deviation from the mean is squared. A large ‘outlier’ will have enormous weight, being too sensitive to them.
Let’s check the cumulative results for a sample from a t-student:
library(cumstats)
set.seed(101)
n <- 50000
df <- 2.1 # degrees of freedom
samples <- rt(n, df)
cumstd <- sqrt(cumvar(samples))
means <- cummean(samples)
cummad <- cummean(abs(samples - means))
plot(1:n, cumstd, type='l', col='red', ylab="value of dispersion", ylim=c(0,8))
points(1:n, cummad, type='l', col='blue')
legend(4e4,8,legend=c('STD', 'MAD'), lty=1, col=c('red','blue'), cex=.8)
abline(h=sqrt(df/(df-2)), col='red', lty=2)
In conclusion, in Extremistan MAD is a much better estimator of \(E[|X-\mu|]\) than STD is an estimator of \(\sigma\).
Many statistical phenomena and processes have “infinite variance” (such as the popular Pareto 80/20 rule) but have finite, and sometimes very well behaved, mean deviations. Whenever the mean exists, MAD exists. The reverse (infinite MAD and finite STD) is never true.
Correlation as several pitfalls
Correlation is easily misused when people take non-random sub-samples from it and expect the same correlation. This mistake is the source of Berkson’s and Simpson’s paradoxes
A simple example:
set.seed(121)
n <- 1e3
xs <- mvrnorm(n, rep(0,2), matrix(c(1,.75,.75,1), ncol=2))
correlation = round(cor(xs[,1],xs[,2]),3)
plot(xs[,1],xs[,2], pch=20, col=rgb(0,1,0,0.2),
main=paste0('Correlation ',correlation))
abline(v=0, h=0, lty=2)
The correlation is around \(0.75\) as expected, \(x,y\) are independent. However, if you non-randomly sub-sample, say per quadrant, we found that correlation is subadditive in Mediocristan:
In Extremistan for a bivariate t-Student:
library(fMultivar)
set.seed(121)
n <- 1e3
xs <- rt2d(n, 0.75, nu=2/3)
correlation = round(cor(xs[,1],xs[,2]),3)
plot(xs[,1],xs[,2], pch=20, col=rgb(0,1,0,0.2),
main=paste0('Correlation ',correlation), log='xy')
abline(v=1, h=1, lty=2) # log(1)=0