Fit a Nonparametric Distribution with Pareto Tails

Open Live Script

This example shows how to fit a nonparametric probability distribution to sample data using Pareto tails to smooth the distribution in the tails.

Step 1. Generate sample data.

Generate sample data that contains more outliers than expected from a standard normal distribution.

rng('default')  % For reproducibility
left_tail = -exprnd(1,10,1);
right_tail = exprnd(5,10,1);
center = randn(80,1);
data = [left_tail;center;right_tail];

The data contains 80% values from a standard normal distribution, 10% from an exponential distribution with a mean of 5, and 10% from an exponential distribution with mean of -1. Compared to a standard normal distribution, the exponential values are more likely to be outliers, especially in the upper tail.

Step 2. Fit probability distributions to the data.

Fit a normal distribution and a t location-scale distribution to the data, and plot for a visual comparison.

probplot(data);
hold on
p = fitdist(data,'tlocationscale');
h = plot(gca,p,'PlotType',"probability"); 
set(h,'color','r','linestyle','-');
title('Probability Plot')
legend('Normal','Data','t location-scale','Location','SE')
hold off

Figure contains an axes object. The axes object with title Probability Plot, xlabel Data, ylabel Probability contains 3 objects of type functionline, line. One or more of the lines displays its values using only markers These objects represent Normal, Data, t location-scale.

Both distributions appear to fit reasonably well in the center, but neither the normal distribution nor the t location-scale distribution fit the tails very well.

Step 3. Generate an empirical distribution.

To obtain a better fit, use ecdf to generate an empirical cdf based on the sample data.

figure
ecdf(data)

Figure contains an axes object. The axes object with xlabel x, ylabel F(x) contains an object of type stair.

The empirical distribution provides a perfect fit, but the outliers make the tails very discrete. Random samples generated from this distribution using the inversion method might include, for example, values near 4.33 and 9.25, but no values in between.

Step 4. Fit a distribution using Pareto tails.

Use paretotails to generate an empirical cdf for the middle 80% of the data and fit generalized Pareto distributions to the lower and upper 10%.

pfit = paretotails(data,0.1,0.9)

pfit = 
Piecewise distribution with 3 segments
      -Inf < x < -1.24623    (0 < p < 0.1): lower tail, GPD(-0.334156,0.798745)
   -1.24623 < x < 1.48551  (0.1 < p < 0.9): interpolated empirical cdf
        1.48551 < x < Inf    (0.9 < p < 1): upper tail, GPD(1.23681,0.581868)

To obtain a better fit, paretotails fits a distribution by piecing together an ecdf or kernel distribution in the center of the sample, and smooth generalized Pareto distributions (GPDs) in the tails. Use paretotails to create paretotails probability distribution object. You can access information about the fit and perform further calculations on the object using the object functions of the paretotails object. For example, you can evaluate the cdf or generate random numbers from the distribution.

Step 5. Compute and plot the cdf.

Compute and plot the cdf of the fitted paretotails distribution.

x = -4:0.01:10;
plot(x,cdf(pfit,x))

Figure contains an axes object. The axes object contains an object of type line.

The paretotails cdf closely fits the data but is smoother in the tails than the ecdf generated in Step 3.