Refs:

Introduction

The typical way to fit a distribution is to use function MASS::fitdistr:

library(MASS)
set.seed(101)
my_data <- rnorm(250, mean=1, sd=0.45)      # unkonwn distribution parameters

fit <- fitdistr(my_data, densfun="normal")  # we assume my_data ~ Normal(?,?)
fit
##       mean          sd    
##   0.99613836   0.42405694 
##  (0.02681972) (0.01896440)
hist(my_data, pch=20, breaks=25, prob=TRUE, main="")
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T)

fitdistr uses optim to estimate the parameter values by maximizing the likelihood function. So it works like this:

log_likelihood <- function(params) { -sum(dnorm(my_data, params[1], params[2], log=TRUE)) }
fit2 <- optim(c(0,1), log_likelihood)    # c(0,1) are just initial guesses
fit2
## $par
## [1] 0.9962070 0.4240523
## 
## $value
## [1] 140.2628
## 
## $counts
## function gradient 
##       65       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
hist(my_data, pch=20, breaks=25, prob=TRUE)
curve(dnorm(x, fit2$par[1],     fit2$par[2]),     col="blue", lwd=6, add=T) # optim fit
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red",  lwd=2, add=T) # fitdistr fit

Using fitdistrplus

This tutorial uses the fitdistrplus package for fitting distributions.

library(fitdistrplus)

Choosing which distribution to fit

Let’s use one of the datasets provided by the packge:

data("groundbeef", package = "fitdistrplus")
my_data <- groundbeef$serving
plot(my_data, pch=20)

We can first plot the empirical density and the histogram to gain insight of the data:

plotdist(my_data, histo = TRUE, demp = TRUE)

Another tool is to show some descriptive statistics, like the moments, to help us in making a decision:

descdist(my_data, discrete=FALSE, boot=500)

## summary statistics
## ------
## min:  10   max:  200 
## median:  79 
## mean:  73.64567 
## estimated sd:  35.88487 
## estimated skewness:  0.7352745 
## estimated kurtosis:  3.551384

The plot also includes a nonparametric bootstrap procedure for the values of kurtosis and skewness.

Fitting a distribution

Say, in the previous eg, we chose the weibull, gamma and log-normal to fit:

fit_w  <- fitdist(my_data, "weibull")
fit_g  <- fitdist(my_data, "gamma")
fit_ln <- fitdist(my_data, "lnorm")
summary(fit_ln)
## Fitting of the distribution ' lnorm ' by maximum likelihood 
## Parameters : 
##          estimate Std. Error
## meanlog 4.1693701 0.03366988
## sdlog   0.5366095 0.02380783
## Loglikelihood:  -1261.319   AIC:  2526.639   BIC:  2533.713 
## Correlation matrix:
##         meanlog sdlog
## meanlog       1     0
## sdlog         0     1

we can plot the results:

par(mfrow=c(2,2))
plot.legend <- c("Weibull", "lognormal", "gamma")
denscomp(list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
cdfcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
qqcomp  (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
ppcomp  (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)

The fitting can work with other non-base distribution. It only needs that the correspodent, d, p, q functions are implemented.

In the next eg, the endosulfan dataset cannot be properly fit by the basic distributions like the log-normal:

data("endosulfan", package = "fitdistrplus")
my_data <- endosulfan$ATV

fit_ln <- fitdist(my_data, "lnorm")
cdfcomp(fit_ln, xlogscale = TRUE, ylogscale = TRUE)

To solve this it is used the Burr and Pareto distributions available at package actuar

library(actuar)

fit_ll <- fitdist(my_data, "llogis", start = list(shape = 1, scale = 500))
fit_P  <- fitdist(my_data, "pareto", start = list(shape = 1, scale = 500))
fit_B  <- fitdist(my_data, "burr",   start = list(shape1 = 0.3, shape2 = 1, rate = 1))
cdfcomp(list(fit_ln, fit_ll, fit_P, fit_B), xlogscale = TRUE, ylogscale = TRUE,
        legendtext = c("lognormal", "loglogistic", "Pareto", "Burr"), lwd=2)

The package also provides some goodness-of-it statistics:

gofstat(list(fit_ln, fit_ll, fit_P, fit_B), fitnames = c("lnorm", "llogis", "Pareto", "Burr"))
## Goodness-of-fit statistics
##                                  lnorm    llogis     Pareto       Burr
## Kolmogorov-Smirnov statistic 0.1672498 0.1195888 0.08488002 0.06154925
## Cramer-von Mises statistic   0.6373593 0.3827449 0.13926498 0.06803071
## Anderson-Darling statistic   3.4721179 2.8315975 0.89206283 0.52393018
## 
## Goodness-of-fit criteria
##                                   lnorm   llogis   Pareto     Burr
## Aikake's Information Criterion 1068.810 1069.246 1048.112 1045.731
## Bayesian Information Criterion 1074.099 1074.535 1053.400 1053.664

Visually and using the previous statistics, it seems that the Burr distribution seems the preferred one among the candidates we chose to explore.

Parameter estimates

We can apply a bootstrap to estimate the uncertainty in the parameters:

ests <- bootdist(fit_B, niter = 1e3)
summary(ests)
## Parametric bootstrap medians and 95% percentile CI 
##           Median       2.5%    97.5%
## shape1 0.2024304 0.09504934 0.377547
## shape2 1.5802731 1.01204175 2.959759
## rate   1.4655764 0.65204235 2.702179
## 
## The estimation method converged only for 995 among 1000 iterations
plot(ests)

quantile(ests, probs=.05) # 95% percentile bootstrap confidence interval
## (original) estimated quantiles for each specified probability (non-censored data)
##             p=0.05
## estimate 0.2939259
## Median of bootstrap estimates
##             p=0.05
## estimate 0.3077854
## 
## two-sided 95 % CI of each quantile
##           p=0.05
## 2.5 %  0.1707940
## 97.5 % 0.4960287
## 
## The estimation method converged only for 995 among 1000 bootstrap iterations.