Fitting Probability Distributions in R

In [200]:
library(fitdistrplus)
library(ggplot2)

1. Introduction

In this notebook we are going to look at how to fit distributions to data in R. Here we examine continuous distributions, but analogous methods work in the discrete setting. Our goal is to show how to do these tasks manually first, and then show how the process can be done quickly and easily using the fitdistrplus package.

Before we begin, let's describe the problem of fitting data to a distribution. Suppose we are given a sample collection of numbers $X = \{ x_1, x_2, ..., x_n\}$, and we want to generate more values from the same population. We may not have access to more data, but if we could know the underlying distribution then we could generate pseudorandom values from it. Thus the problem is to find the distribution that most likely produced the sample $X$.

Our first step is to show how data can be fit to a distribution. Next, we show common methods for visualizing how well a distribution fits a sample of data. After this, we look at numerical values that can help decide which distribution fits best. Finally, we show how the fitdistrplus package let's us do all of this work quickly and painlessly. We end with an appendix on common continuous distributions.

2. Fitting Data to a Distribution

In this section, we will examine methods to fit a dataset to a distribution. Our sample dataset will be 60 values drawn from a logistic distribution with location parameter 20 and scale parameter 2. We will try to fit this data to a normal distribution.

In [201]:
sample <- rlogis(60, location = 20, scale = 2)
hist(sample)

Fitting Distributions with Maximum Liklihood Estimator

One method for fitting a distribution is finding the parameters that maximize the liklihood estimator. Recall the liklihood function of the parameters $\theta = (\theta_1, ..., \theta_k)$ is defined to be the probability of drawing the values ${x_1, \ldots, x_n}$ given that the parameter being estimated has value $\theta$.:

${\displaystyle {\mathcal {L}}(\theta \, ; x_1, \ldots, x_n)=P(x_1, \ldots, x_n) \mid \theta)}$

If this function is differentiable, we can find the maximum by setting the derivative equal to zero. The log-liklihood function, ${\displaystyle \ell (\theta \,;x)=\ln {\mathcal {L}}(\theta \,;x),}$, is often more amenable to calculations. Since the log function is always increasing, maximizing the liklihood and log-liklihood are equivalent.

In the present case, we are estimating $\theta = (\mu, \sigma^2)$, the mean and variance of our normal distribution. In this instance we have

$\displaystyle{P(x_1, \ldots, x_n \mid \mu, \sigma^2) = \prod _{i=1}^{n}f(x_{i}\mid \mu ,\sigma ^{2})=\left({\frac {1}{2\pi \sigma ^{2}}}\right)^{n/2}\exp \left({\frac {\displaystyle{-\sum _{i=1}^{n}}(x_{i}-\mu )^{2}}{2\sigma ^{2}}}\right)}$

If we take the log of this function, we find the log-liklihood to be

$\displaystyle{\ell(\mu, \sigma^2 \, ; x) = -{\frac {\,n\,}{2}}\log(2\pi \sigma ^{2})-{\frac {1}{2\sigma ^{2}}}\sum _{i=1}^{n}(\,x_{i}-\mu \,)^{2}}$

Taking partial derivatives with respect to $\mu$ and $\sigma$ and setting the equations equal to zero, we find that the sample mean $\hat{\mu} = \overline{x}$ and the "unadjusted sample variance" $\hat\sigma^2 = \frac{1}{n}\sum_1^n(x_i - \overline{x})^2$ are the values that maximize the liklihood function.

In [202]:
n <- length(sample)
MLEmean <- sum(sample)/n
MLEsigma <- sqrt(sum((sample - MLEmean)^2)/n)
print(c(MLEmean, MLEsigma))
[1] 19.696458  3.381263

In this case we had a simple closed form for the liklihood function, but that is often not the case. Often other methods have to be employed to maximize the function, include numerical techniques like Newton's method or gradient descent.

Fitting Distributions with Method of Moments

Another way to fit a distribution to a set of data is using the method of moments. Recall the $n^\text{th}$ moment of a distribution of $X$ is defined to be $E(X^n)$. The idea behind the method of moments is to determine the parameters by forcing the moments of the proposed distribution to equal the moments of the sample. So E(X) must be $\overline{x} = \sum_{i=1}^nx_i$, $E(X^2)$ must be $\sum_{i=1}^nx_i^2$, etc.

Thus, if we are fitting the parameters $\mu$ and $\sigma^2$ of a normal distribution $X$, we must have $E(X) = \mu = \overline{x}$, and $E(X^2) = \sigma^2 + E(X)^2$. Equating this with the second moment of the sample, $\frac{1}{n}\sum_{i=1}^nx_i^2$, we calculate $\sigma^2 = \frac{1}{n}\sum_{i=1}^nx_i^2 - \mu^2$.

In [203]:
MOMmean <- sum(sample)/n
MOMsigma <- sqrt(sum((sample)^2)/n - MLEmean^2)
print(c(MOMmean, MOMsigma))
[1] 19.696458  3.381263

We got the same values! It's a fun exercise to show that the MLE and MOM estimators for variance are the same for normal distributions. In general these fits can be different, though.

Fitting Distributions with Quantile Matching

This method is very similar to the method of moments, but instead of setting the moments of the proposed distribution equal to the moments of the sample, we demand that they have the same quantile.

Again, we illustrate with the normal distribution. We want to write $\mu$ and $\sigma$ in terms of the quantiles. The median of a normal distribution is equal to it's mean, so we can set $q_2 = \mu$. Using a Z-score table, we find the $25^\text{th}$ percentile of a standard normal distribution is approximately -.67. Thus $\frac{(q_1- \mu)}{\sigma} = -.67$, where $q_1$ is the first quartile of our distribution. Thus, using our calculation for $\mu$, we can find $\sigma$.

In [204]:
q1 <- as.numeric(quantile(sample)[2])
q2 <- as.numeric(quantile(sample)[3])

QMmean <- q2
QMsigma <- (q1 - QMmean)/-.67

print(c(QMmean, QMsigma))
[1] 20.15311  2.72167

There are a couple things to notice here. First, the values are not the same as those found use MLE and MOM. Secondly, the values depend on which quartiles we choose to match. For instance, suppose we had matched the third quartiles instead. Note that the third quartile of a standard normal distribution is approximately .67 (positive this time).

In [205]:
q3 <- as.numeric(quantile(sample)[4])
q2 <- as.numeric(quantile(sample)[3])

QMmean <- q2
QMsigma <- (q3 - QMmean)/.67

print(c(QMmean, QMsigma))
[1] 20.153114  2.822198

Our estimate for the standard deviation of our fit changed pretty drastically. In general, this method can be good if you want to make sure the fit is very good in certain quantiles, but the tradeoff is that it may perform much worse in other places.

3. Visualizing Fits

In this section we will look at four different plots that can be useful in assessing how well a distribution fits a dataset: the Probability-Probability plot, the Quantile-Quantile plot, the CDF comparison plot, and the density comparison plot. We will use our example above to illustrate each of these grahps.

Probability-Probability Plot

This plot (also called the P-P plot) shows how the probabilities of our empirical distribution compares with the probabilities of the distribution that we fit. It is defined with the theoretical probabilities along the x-axis and the empirical probabilities along the y-axis.

In [206]:
F_60 <- ecdf(sample)

y <- F_60(sample)

mean <- MLEmean
sd <- MLEsigma
n <- 60

x <- pnorm(sample, mean = mean, sd = sd)

plot(x, y, main = 'P-P plot', xlab = 'Theoretical Probabilities', ylab = 'Empirical Probabilities')
lines(x,x)

Quantile-Quantile Plot

This plot (also called the Q-Q plot) is very similar as the P-P plot. We graph the theoretical quantiles of our proposed distribution on the x-axis, and the empirical quantiles on the y-axis.

In [207]:
probs <- seq(.01, .99, .01)
y <- quantile(sample, probs)

mean <- MLEmean
sd <- MLEsigma

x <- qnorm(probs, mean = mean, sd = sd)

plot(x, y, main = 'Q-Q Plot', xlab = 'Theoretical Quantiles', ylab = 'Empirical Quantiles')
lines(x,x)

CDF Comparison Plot

This plot simply superimposes the CDF of our proposed distribution onto the CDF of our empirical distribution.

In [208]:
F_60 <- ecdf(sample)

mean = MLEmean
sd = MLEsigma
n = 60

F <- function(x){
    pnorm(x, mean = mean, sd = sd)
}

plot(F_60, main = "CDF Comparison Plot", ylab = 'CDF', xlab = 'Data', xlim = c(10,30))
par(new = T)
curve(pnorm(x, mean = MLEmean, sd = MLEsigma), 10, 30, main = '', ylab ='', yaxt = 'n', xlab = '', xlim = c(10,30), col = 'red')

Density Comparison Plot

This is the last plot we talk about. It superimposes the density function of our proposed distribution on top of the histogram of our sample dataset.

In [209]:
hist(sample, xlim = c(10,35), yaxt = 'n', main = 'Density Comparison Plot')
par(new = T)
curve(dnorm(x, mean = MLEmean, sd = MLEsigma), 10, 30, main = '', ylab ='', yaxt = 'n', xlab = '', xlim = c(10,35), col = 'red')

4. Goodness of Fit Tests

In this section, we are going to review a variety of goodness of fit statistics. Above we fit our sample data to a normal distribution using MLE, and it is natural to ask how good this fit is. We will present five statistics that can be used to gauge this. The first three - the Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling statistics - are tests that can rule out whether a dataset came from a specified distribution (note, though, they can't say whether the data actually does follow a specified distribution). The latter two, Akaike Information Criteria and Bayesian Information Criteria cannot be interpreted on their own, but can be used to compare the fits of distributions. In general, the best use of all of these statistics is to compare the fits of various distributions, and can be used along with graphical evidence and expert judgement to choose the best distribution for the task at hand.

Again for the first three statistics, we are considering a hypothesis tests with the following null and alternate hypotheses:

$H_0$: The data follows the specified distribution.

$H_a$: The data doesn't follow the specified distribution.

Kolmogorov-Smirnov Statistic

Let $F_n$ be the empirical CDF for a dataset with $n$ elements. The Kolmogorov-Smirnov statistic measures how far away $F_n$ is from the cdf of the specified distribution, which we denote $F$. It is defined as:

$\text{KS} = \textit{sup}_\mathbb{R}|F_n(x) - F(x)|$

If $\text{KS}$ is larger than a value which we find in this table, then we reject the null hypothesis.

$\textbf{Mathematical Note:}$ (The symbol $\textit{sup}_\mathbb{R}$, read "the supremum over the reals", may look intimidating, but it's not so scary. It is a way of specifying the "maximum" element of a collection that doesn't have an honest maximum. For instance, 1 is the supremum of values over the set of numbers strictly between 0 and 1, but it is not the maximum value (because it is not in that set!). In fact that set has no maximum value (what would it be? .9? .99? .999?). Thankfully, even if that doesn't make sense, this value can easily be calculated.)

To calculate this, we first order the points in our dataset as $x_1 \leq x_2 \leq \cdots \leq x_n$, then we can also calculate the Kolmogorov-Smirnov statistic as:

$\text{KS} = max_{1\leq i \leq n}(|F_n(x_i) - F(x_i)|, |F_n(x_{i-1}) - F(x_i)|)$

For our sample data, let $F_{60}$ be the empirical cumulative distribution function, and let F be the cumulative distribution function of the normal distribution fit with MLE as above. In this case we will let F be the CDF of the normal distribution with the mean and standard deviation found using MLE above.

In [210]:
sorted = sort(sample)
vals_F_60 = sapply(sorted, F_60)
lag_vals_F_60 = c(0, vals_F_60[2:60])
vals_F = sapply(sorted, pnorm, mean = MLEmean, sd = MLEsigma)
KS = max(c(abs(vals_F_60 - vals_F), abs(lag_vals_F_60 - vals_F)))
KS
0.0929515275053109

According to the table linked above, we will reject the hypothesis that the data comes from our specified normal distribution if $D$ is larger than $1.358/\sqrt{60}$.

In [211]:
1.3581/sqrt(60)
0.17532995608281

Thus we fail to reject the null hypothesis, so our data...probably...doesn't...not..come from our specified distribution. As I said above, the conclusions of these tests aren't the most useful because they will never tell you that your proposed distribution is correct, and these statistics are better used as sanity checks and for comparison purposes.

Cramer-von Mises and Anderson-Darling Statistics

The Cramer-von Mises and Anderson-Darling statistics are two more test statistics for analyzing hypotheses above. Again, they are a measure of the "distance" between the empirical cdf and the cdf of our proposed distribution. They are part of a more general family of statistics called "quadratic edf statistics" defined as:

$Q$ = $n\int _{{-\infty }}^{\infty }(F_{n}(x)-F(x))^{2}\,w(x)\,dF(x)$

Here the function $w(x)$ is called the weighting function. For Cramer-von Mises we take $w(x) = 1$, and for Anderson-Darling we take $w(x) = \frac{1}{F(x)(1-F(x))}$. Thus

$\displaystyle{CM = n\int _{{-\infty }}^{\infty }(F_{n}(x)-F(x))^{2}\,dF(x)}$

$\displaystyle{AD = n\int _{{-\infty }}^{\infty }\frac{(F_{n}(x)-F(x))^{2}}{F(x)(1-F(x))}\,dF(x)}$

We can calculate these by splitting the real line at our data points and using u-substitution. We illustrate this with the Cramer-von Mises statistic and leave the Anderson-Darling as an exercise.

To start with, order our dataset $x_1 \leq x_2 \leq \cdots \leq x_n$ and rewrite the Cramer-von Mises statistic as

$\displaystyle{\text{CM} = n \left(\int_{-\infty}^{x_1}(F_n(x)-F(x))^2\,dF(x) + \int_{x_1}^{x_2}(F_n(x)-F(x))^2\,dF(x) + \cdots + \int_{x_n}^{\infty}(F_n(x)-F(x))^2\,dF(x)\right)}$

On the interval $\displaystyle{(-\infty, x_1\mathclose]}$, the function $F_n(x)$ is zero. On the interval $\displaystyle{\mathopen[x_1,x_2\mathclose]}$ it is one, etc. So we can rewrite this integral:

$\displaystyle{\text{CM} = n \left(\int_{-\infty}^{x_1}(F(x))^2\,dF(x) + \int_{x_1}^{x_2}(1-F(x))^2\,dF(x) + \cdots + \int_{x_n}^{\infty}(n-F(x))^2\,dF(x)\right)}$

Letting $u = F(x)$, so $du = dF(x)$, and doing some symbol manipulation, we can solve this as:

$\text{CM} = \displaystyle{\frac {1}{12n}}+\sum _{i=1}^{n}\left[{\frac {2i-1}{2n}}-F(x_{i})\right]^{2}$

We apply this to our case, where n=60 and $F$ is the cdf for the normal distribution we fit with the MLE method.

In [212]:
1/(12*n) + sum(sapply(c(1:n), function(i) ((2*i-1)/(2*n) - F(sorted[i]))^2))
0.134849377028096

As we said, the integral for Anderson-Darling can be evaluated in a similar fashion, and leads to the formula:

$\displaystyle{\text{AD} = -n -\sum _{{i=1}}^{n}{\frac {2i-1}{n}}\left[\ln(F (x_{i}))+\ln \left(1-F (x_{{n+1-i}})\right)\right]}$

In [213]:
-n - 1/n*sum(sapply(c(1:n), function(i) (2*i-1)*(log(F(sorted[i])) + log(1- F(sorted[n-i+1])))))
0.906498562261191

Akaike and Bayesian Information Criteria

The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are two more goodness of fit statistics. It doesn't make sense to say a fit has a good or bad AIC or BIC, but rather they are used to compare several fits. The lowest AIC (including the most negative if the values are negative) or BIC is considered the best. They both work by comparing the log-liklihood functions of several fits, but simultaneously penalizing the number of parameters used in the calculations. Additionally, BIC also takes into account the number of datapoints in our sample dataset. They are defined as:

${\displaystyle \mathrm {AIC} =2k-2\ln({\hat {L}})}$

${\displaystyle \mathrm {BIC} ={\ln(n)k-2\ln({\hat {L}})}.\ }$

where $\hat{L}$ is the liklihood function, k is the number of parameters, and n is the number of datapoints in the sample. In our example case, the normal distribution has k=2 parameters, and n = 60.

In [214]:
2*2 - 2*sum(sapply(sample, function(x) log(dnorm(x, mean = mean, sd = sd))))
320.462556869629
In [215]:
log(n)*2 - 2*sum(sapply(sample, function(x) log(dnorm(x, mean = mean, sd = sd))))
324.651245994073

5. Fitdistrplus

Now we show how all of the visualizations and goodness of fit statistics can be calculated quickly and painlessly using the Fitdistrplus package in R. First, we use the fitdist function to fit our sample to a normal distribution.

In [216]:
normfit <- fitdist(sample, 'norm')
normfit
Fitting of the distribution ' norm ' by maximum likelihood 
Parameters:
      estimate Std. Error
mean 19.696458  0.4365192
sd    3.381263  0.3086656

By default, this function uses MLE. Wecan also fit using MOM or QM.

In [217]:
normfitMOM <- fitdist(sample, 'norm', method = 'mme')
normfitMOM
Fitting of the distribution ' norm ' by matching moments 
Parameters:
      estimate
mean 19.696458
sd    3.381263
In [218]:
normfitQM <- fitdist(sample, 'norm', method = 'qme', probs = c(.5, .75))
normfitQM
Fitting of the distribution ' norm ' by matching quantiles 
Parameters:
      estimate
mean 20.153072
sd    2.803398

Notice we had to include the probabilities we used to match for the quantile matching estimation. Also notice the standard deviation is a little different from the one we calculated above. This is simply because we rounded the Z statistic to be .67 above.

Now, let's look at the plots that we discussed for visualizing the fits. This is as simple as using the plot function.

In [219]:
logisfit <- fitdist(sample, 'logis')
logisfit
Fitting of the distribution ' logis ' by maximum likelihood 
Parameters:
          estimate Std. Error
location 19.991608  0.4069712
scale     1.829695  0.1991876
In [220]:
plot(logisfit)

Finally, we can view the goodness of fit statistics that we talked about above using the gofstat() function.

In [221]:
gofstat(normfit)
Goodness-of-fit statistics
                             1-mle-norm
Kolmogorov-Smirnov statistic  0.1096182
Cramer-von Mises statistic    0.1348494
Anderson-Darling statistic    0.9064986

Goodness-of-fit criteria
                               1-mle-norm
Akaike's Information Criterion   320.4626
Bayesian Information Criterion   324.6512

This fit actually seems pretty good, but we know it actually came from a logistic distribution. Let's see how good the fitting process does at showing the logistic distribution is a better fit.

In [222]:
logisfit <- fitdist(sample, 'logis')
logisfit
Fitting of the distribution ' logis ' by maximum likelihood 
Parameters:
          estimate Std. Error
location 19.991608  0.4069712
scale     1.829695  0.1991876
In [223]:
plot(logisfit)
In [224]:
gofstat(list(normfit, logisfit),fitnames = list('normal', 'logistic'))
Goodness-of-fit statistics
                                normal   logistic
Kolmogorov-Smirnov statistic 0.1096182 0.05831532
Cramer-von Mises statistic   0.1348494 0.03947354
Anderson-Darling statistic   0.9064986 0.47112031

Goodness-of-fit criteria
                                 normal logistic
Akaike's Information Criterion 320.4626 317.5921
Bayesian Information Criterion 324.6512 321.7808

A. Fantastic Distributions and Where to Find Them

In [225]:
ggplot(data = data.frame(x = c(-3,9)), aes(x)) +
  stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1), col = 'red') + 
  stat_function(fun = dnorm, n = 101, args = list(mean = 3, sd = 1), col = 'blue') + 
  stat_function(fun = dnorm, n= 101, args = list(mean = 6, sd = 1), col = 'green') +
  ylab("") +
  scale_y_continuous(breaks = NULL)
In [226]:
ggplot(data = data.frame(x = c(-9,9)), aes(x)) +
  stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1), col = 'red') + 
  stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 2), col = 'blue') + 
  stat_function(fun = dnorm, n= 101, args = list(mean = 0, sd = 3), col = 'green') +
  ylab("") +
  scale_y_continuous(breaks = NULL)
In [227]:
ggplot(data = data.frame(x = c(-3,9)), aes(x)) +
  stat_function(fun = dgamma, n = 101, args = list(shape = 1, scale = 1), col = 'red') + 
  stat_function(fun = dgamma, n = 101, args = list(shape = 3, scale = 1), col = 'blue') + 
  stat_function(fun = dgamma, n= 101, args = list(shape = 6, scale = 1), col = 'green') +
  ylab("") +
  scale_y_continuous(breaks = NULL)
In [228]:
ggplot(data = data.frame(x = c(-3,20)), aes(x)) +
  stat_function(fun = dgamma, n = 101, args = list(shape = 3, scale = 1), col = 'red') + 
  stat_function(fun = dgamma, n = 101, args = list(shape = 3, scale = 2), col = 'blue') + 
  stat_function(fun = dgamma, n= 101, args = list(shape = 3, scale = 3), col = 'green') +
  ylab("") +
  scale_y_continuous(breaks = NULL)
In [229]:
ggplot(data = data.frame(x = c(-3,9)), aes(x)) +
  stat_function(fun = dlogis, n = 101, args = list(location = 1, scale = 1), col = 'red') + 
  stat_function(fun = dlogis, n = 101, args = list(location = 3, scale = 1), col = 'blue') + 
  stat_function(fun = dlogis, n= 101, args = list(location = 6, scale = 1), col = 'green') +
  ylab("") +
  scale_y_continuous(breaks = NULL)
In [230]:
ggplot(data = data.frame(x = c(-3,9)), aes(x)) +
  stat_function(fun = dlogis, n = 101, args = list(location = 3, scale = 1), col = 'red') + 
  stat_function(fun = dlogis, n = 101, args = list(location = 3, scale = 2), col = 'blue') + 
  stat_function(fun = dlogis, n= 101, args = list(location = 3, scale = 3), col = 'green') +
  ylab("") +
  scale_y_continuous(breaks = NULL)
In [231]:
ggplot(data = data.frame(x = c(-3,20)), aes(x)) +
  stat_function(fun = dlnorm, n = 101, args = list(meanlog = 1, sdlog = 1), col = 'red') + 
  stat_function(fun = dlnorm, n = 101, args = list(meanlog = 1.3, sdlog = 1), col = 'blue') + 
  stat_function(fun = dlnorm, n= 101, args = list(meanlog = 1.6, sdlog = 1), col = 'green') +
  ylab("") +
  scale_y_continuous(breaks = NULL)
In [232]:
ggplot(data = data.frame(x = c(-3,15)), aes(x)) +
  stat_function(fun = dlnorm, n = 101, args = list(meanlog = 1, sdlog = 1), col = 'red') + 
  stat_function(fun = dlnorm, n = 101, args = list(meanlog = 1, sdlog = 1.5), col = 'blue') + 
  stat_function(fun = dlnorm, n= 101, args = list(meanlog = 1, sdlog = 2), col = 'green') +
  ylab("") +
  scale_y_continuous(breaks = NULL)