Refs:

# Introduction

The typical way to fit a distribution is to use function MASS::fitdistr:

library(MASS)
set.seed(101)
my_data <- rnorm(250, mean=1, sd=0.45)      # unkonwn distribution parameters

fit <- fitdistr(my_data, densfun="normal")  # we assume my_data ~ Normal(?,?)
fit
##       mean          sd
##   0.99613836   0.42405694
##  (0.02681972) (0.01896440)
hist(my_data, pch=20, breaks=25, prob=TRUE, main="")
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T)

fitdistr uses optim to estimate the parameter values by maximizing the likelihood function. So it works like this:

log_likelihood <- function(params) { -sum(dnorm(my_data, params[1], params[2], log=TRUE)) }
fit2 <- optim(c(0,1), log_likelihood)    # c(0,1) are just initial guesses
fit2
## $par ## [1] 0.9962070 0.4240523 ## ##$value
## [1] 140.2628
##
## $counts ## function gradient ## 65 NA ## ##$convergence
## [1] 0
##
## $message ## NULL hist(my_data, pch=20, breaks=25, prob=TRUE) curve(dnorm(x, fit2$par[1],     fit2$par[2]), col="blue", lwd=6, add=T) # optim fit curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T) # fitdistr fit # Using fitdistrplus This tutorial uses the fitdistrplus package for fitting distributions. library(fitdistrplus) ## Choosing which distribution to fit Let’s use one of the datasets provided by the packge: data("groundbeef", package = "fitdistrplus") my_data <- groundbeef$serving
plot(my_data, pch=20)

We can first plot the empirical density and the histogram to gain insight of the data:

plotdist(my_data, histo = TRUE, demp = TRUE)

Another tool is to show some descriptive statistics, like the moments, to help us in making a decision:

descdist(my_data, discrete=FALSE, boot=500)

## summary statistics
## ------
## min:  10   max:  200
## median:  79
## mean:  73.64567
## estimated sd:  35.88487
## estimated skewness:  0.7352745
## estimated kurtosis:  3.551384

The plot also includes a nonparametric bootstrap procedure for the values of kurtosis and skewness.

## Fitting a distribution

Say, in the previous eg, we chose the weibull, gamma and log-normal to fit:

fit_w  <- fitdist(my_data, "weibull")
fit_g  <- fitdist(my_data, "gamma")
fit_ln <- fitdist(my_data, "lnorm")
summary(fit_ln)
## Fitting of the distribution ' lnorm ' by maximum likelihood
## Parameters :
##          estimate Std. Error
## meanlog 4.1693701 0.03366988
## sdlog   0.5366095 0.02380783
## Loglikelihood:  -1261.319   AIC:  2526.639   BIC:  2533.713
## Correlation matrix:
##         meanlog sdlog
## meanlog       1     0
## sdlog         0     1

we can plot the results:

par(mfrow=c(2,2))
plot.legend <- c("Weibull", "lognormal", "gamma")
denscomp(list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
cdfcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
qqcomp  (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
ppcomp  (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)

The fitting can work with other non-base distribution. It only needs that the correspodent, d, p, q functions are implemented.

In the next eg, the endosulfan dataset cannot be properly fit by the basic distributions like the log-normal:

data("endosulfan", package = "fitdistrplus")
my_data <- endosulfan\$ATV

fit_ln <- fitdist(my_data, "lnorm")
cdfcomp(fit_ln, xlogscale = TRUE, ylogscale = TRUE)

To solve this it is used the Burr and Pareto distributions available at package actuar

library(actuar)

fit_ll <- fitdist(my_data, "llogis", start = list(shape = 1, scale = 500))
fit_P  <- fitdist(my_data, "pareto", start = list(shape = 1, scale = 500))
fit_B  <- fitdist(my_data, "burr",   start = list(shape1 = 0.3, shape2 = 1, rate = 1))
cdfcomp(list(fit_ln, fit_ll, fit_P, fit_B), xlogscale = TRUE, ylogscale = TRUE,
legendtext = c("lognormal", "loglogistic", "Pareto", "Burr"), lwd=2)

The package also provides some goodness-of-it statistics:

gofstat(list(fit_ln, fit_ll, fit_P, fit_B), fitnames = c("lnorm", "llogis", "Pareto", "Burr"))
## Goodness-of-fit statistics
##                                  lnorm    llogis     Pareto       Burr
## Kolmogorov-Smirnov statistic 0.1672498 0.1195888 0.08488002 0.06154925
## Cramer-von Mises statistic   0.6373593 0.3827449 0.13926498 0.06803071
## Anderson-Darling statistic   3.4721179 2.8315975 0.89206283 0.52393018
##
## Goodness-of-fit criteria
##                                   lnorm   llogis   Pareto     Burr
## Aikake's Information Criterion 1068.810 1069.246 1048.112 1045.731
## Bayesian Information Criterion 1074.099 1074.535 1053.400 1053.664

Visually and using the previous statistics, it seems that the Burr distribution seems the preferred one among the candidates we chose to explore.

## Parameter estimates

We can apply a bootstrap to estimate the uncertainty in the parameters:

ests <- bootdist(fit_B, niter = 1e3)
summary(ests)
## Parametric bootstrap medians and 95% percentile CI
##           Median       2.5%    97.5%
## shape1 0.2024304 0.09504934 0.377547
## shape2 1.5802731 1.01204175 2.959759
## rate   1.4655764 0.65204235 2.702179
##
## The estimation method converged only for 995 among 1000 iterations
plot(ests)

quantile(ests, probs=.05) # 95% percentile bootstrap confidence interval
## (original) estimated quantiles for each specified probability (non-censored data)
##             p=0.05
## estimate 0.2939259
## Median of bootstrap estimates
##             p=0.05
## estimate 0.3077854
##
## two-sided 95 % CI of each quantile
##           p=0.05
## 2.5 %  0.1707940
## 97.5 % 0.4960287
##
## The estimation method converged only for 995 among 1000 bootstrap iterations.