Transforming a right skewed data set to normal

I am attempting to fit an ARIMA model to a set of data. The issue is I cannot get a good fit due to the data set following a weibel distribution, and when attempting to transform the data so it follows a normal distribution, a second peak emerges. So far I have tried using a square root, cube root, natural log, log10, log2, and log(x/1-x). Figure 1 is the raw data before any transform.

Réponses (2)

Adam Danz
Adam Danz le 19 Mar 2019
Modifié(e) : Adam Danz le 19 Mar 2019
Have you tried fitting the data to a Weibull distribution? Matlab's mblfit() reutrns the maximum likelihood estimates of the parameters that best fit the underlying Weibull distribution of your data.
You could then use wblpdf() to plot the results and compare them to your data's distribution.
[Updated] Here's a demo
%create data
data = wblrnd(8,2,1000,1);
% do fiting
[parmhat, parmci] = wblfit(data);
% Plot fitting
figure
h = histogram(data);
hold on
% Calculate pdf and scale it to your data
Y = wblpdf(sort(data),parmhat(1), parmhat(2));
yScaled = Y * (1/max(Y)) * max(h.Values);
% Plot scaled pdf (the pdf should overlap with the hist)
plot(sort(data), yScaled, 'r-', 'LineWidth', 3)
legend('Your data', 'scaled pdf')
190319 104753-Figure 1.jpg

19 commentaires

I had not, I just tried and recieved an error, "second argument must be finite" the line I used was Test = wblfit(Data);
Adam Danz
Adam Danz le 19 Mar 2019
The line you shared doesn't have a second argument so I'm not sure what's causing the error. Could you share the entire copy-pasted error message and the line that is causing the error? Perhaps a sample of the variables would be helpful, too.
Error using fzero (line 230)
Second argument must be finite.
Error in evfit (line 186)
[sigmahat, lkeqnval, err] = fzero(@lkeqn, bnds, options, x0, freq, wgtmeanUnc);
Error in wblfit (line 76)
parmhatEV = evfit(log(x),alpha,censoring,freq,options);
unfortunately I cannot share any of the data.
Could the issue be caused by NaNs in the data?
Yes. I just created fake data, inserted NaNs, and then ran mblfit() and the same error message appeared. Remove your NaN values and try again.
data(isnan(data)) = [];
[parmhat,parmci] = mblfit(data);
The issue was NaNs in the data. I I used the following Code:
test = wblfit(data);
Wdist = wblpdf(data,test(1),test(2));
histogram(Wdist);
This was the result:
Weibel test.PNG
am I not using wlbpdf correctly?
Adam Danz
Adam Danz le 19 Mar 2019
Modifié(e) : Adam Danz le 19 Mar 2019
I think it makes more sense to plot a histrogram of the raw data you're fitting and then to plot the probability density function (pdf) as a line using sorted data. The pdf needs scaled to the range of your data. I'll update my solution to include a demo.
Thank you for the demo, it helped alot. Applied the fit and this was the result. Is my best bet applying transforms to make the data better fit the Weibull? Or should I continue to try and fit the normal?Weibel fit.PNG
Adam Danz
Adam Danz le 19 Mar 2019
Are you sure the red line was produced using the parameters from the fit of your data? It looks like there could be a much better fit. If the data were available in a mat file, I could tinker with it.
What are the values of your two fit parameters?
To verify the fit I cleard my workspace and re-ran the code obtaining the same result.
paramhat = [8.3637, 1.7130]
parmci = [8.3192, 1.7017; 8.4085, 1.7243]
I do have a set of test data that I pulled from another source that follows a similar distribution and when run with the same code exhibits the same issue. I have attached that, and the code using is as follows:
[paramhat,parmci] = wblfit(data);
figure
h = histogram(data);
hold on
Y = wblpdf(sort(data),paramhat(1),paramhat(2));
yScaled = Y*(1/max(Y))*max(h.Values);
plot(sort(data),yScaled,'r-','LineWidth',2)
legend('Data','Scaled PDF');
The resulting plot is:
Test data plot.PNG
Could you share all of the relevant code?
When I plot the pdf and the histogram using your parameters, they agree and they match your pdf (below) but not your distribution. That suggests that the paramhat values you're using either aren't from your distribution or your distribution isn't Weibullian.
paramhat = [8.3637, 1.7130];
x = 0:.1:25;
ymax = 3400;
Y = wblpdf(x,paramhat(1), paramhat(2));
yScaled = Y * (1/max(Y)) * ymax;
figure
plot(x,yScaled, 'LineWidth', 3)
hold on
data = wblrnd(paramhat(1), paramhat(2),120000,1);
histogram(data)
190319 112050-Figure 1.jpg
that is all the code. The only line not included was loading the data, which used "load" followed by the directory.
Adam Danz
Adam Danz le 19 Mar 2019
I just saw your response posted while I was replying with my previous message. The distribution in your test data isn't Weibullian or there aren't enough samples to see the full distribution. Perhaps with more samples, the error would decrease between the histogram and the red pdf line. This solution will only work if your data are from a Weibull distribution which was indicated in the question. Looking at the original distribtion in your question, it looks like this solution is still a good one. What does the pdf look like for the data in your question?
The pdf I sent originally (the parameters you used when you compared the histogram to the line) was fitted using the original data. That distribution again is:
Weibel fit.PNG
Like I had mentioned earlier unfortunately I cannot share the raw data set.
Adam Danz
Adam Danz le 19 Mar 2019
Hmmm the original distribution from you question is different (below). This looks Weibullian but the one above in your previous comment doesn't. Is it supposed to be a Weibullian distribution or was that just an initial guess? Perhaps a different appraoch is needed such as a transformation as you originally suggested.
Raw Data.png
That is because I accidentally posted the wrong image in the question. The original plot (the one you posted in your last reply) is after a transform was done. I do not recall which transfrom was applied. I will update the original question to show the correct plot.
Adam Danz
Adam Danz le 19 Mar 2019
Modifié(e) : Adam Danz le 19 Mar 2019
The updated distribution doesn't look as much like a Weibull distribution as the mistaken one did. If your data should come from a Weibull distribution because of the principals behind your data collection, then you can use these methods to do the fitting. But the updated plot doesn't look like a Weibull distribution. It doesn't look normal either, due to the rightward tail. The skew doesn't look strong enough to be fixed by a log transform either but you could at least try it (with low expectations).
Your original question asks how to make a bimodal distribution more like a normal distribution and that question made sense with the original example distribution which appeared to be Weibullian with 2 peaks. But the updated example isn't Weibullian nor does it have two peaks. So I've lost track of the goal. If you want your data to be more normal and less skewed, I'm sure there's a complicated transformation that could be created but what's the goal? Any distribution can be transformed into another but that usually results in uninterpretable data.
The goal is to take the current data set and make it normal. Apllying any form of a transform (log, sqrt, cube root, etc) has created a bimodal distribution with different degrees of skewness. The issue is whatever I do to the data to make it normal, I need to be able to undo on predicted values produced with an ARIMA model.

Connectez-vous pour commenter.

Jeff Miller
Jeff Miller le 20 Mar 2019

0 votes

One very general two-step approach is to
  1. convert the original scores to percentiles within the original distribution
  2. replace each original score with the standard normal (z) score having the same percentile.
The arima model will then predict z scores, and you can convert back to the original scores by reversing the steps (i.e., find the percentile of the predicted z score and then find the original score at that percentile).

2 commentaires

Michael Mueller
Michael Mueller le 20 Mar 2019
Modifié(e) : Michael Mueller le 20 Mar 2019
I attempted this this morning. I obtained the percentile values, as well as the z-scores, however when I go to create my arima model I get a warning message:
Warning: Error in calculation of parameter covariance matrix. Matrix of NaN's returned.
> In arima/estimate (line 1137)
The current code used to generate the transform is as follows:
Test = percentile(Final_test,Final_test);
z = @(Test) -sqrt(2) * erfcinv(Test*2);
Zs = z(Test);
Where percentile is a user-defined function:
function x = percentile(datas,value)
perc = prctile(datas,1:100);
x = zeros(length(value),1);
for ii = 1:length(value)
[c index] = min(abs(perc'-value(ii)));
x(ii) = (index+1)./100;
end
end
Where Final_test is a 52561 x 1 double containing 52477 real values and 84 NaNs. Zs is also a 52561 x 1 with 52477 real values and 84 NaNs. Of the 52477 real values 26281 are negative and 632 are Inf. Of the 632 Inf values, the corresponding values in Final _ test vary without repeating, all being above a certain value, I'll call said min "X", specific to the data set.
The values of inf in Zs correspond to a value in Test of 1.
Using:
check = Final_test(Test == 1);
verify = find(Final_test >= min(check));
Where check is the values of Final_test corresponding to Inf in Zs, which also equals the values of Final_test corresponding to values of 1 in Test.
The output of Verify is a 715x1 double, which makes me think there are values greater than X that result in values less than 1 in Test, which would result in a z-score not equal to Inf in Zs.
I am using a temporary work around by replacing all values of 1 in test with 0.9999, but is there a better, more accurate work around?
Also for the inverse, I am using normcdf to ge teh percentile, and multipling by 100. My thought to finish the transform was to use prctile(X,p) where P is the Percentile, and X is the data set. Should I be using the original data set for X, or should I be using a different function all together?
Jeff Miller
Jeff Miller le 20 Mar 2019
I don't really understand this very well, but some more comments that might help.
First, you have to get rid of the nans before you even start, don't you? I'm not too familiar with arima models, but I wouldn't think they would allow nans. And even if they did, I'm not sure how they would fit into a normal distribution.
Second, once you get Final_testWithoutNans, I think you should get the percentile scores for each data value more precisely. Use this technique at stackoverflow to rank the values in Final_testWithoutNans (use the method that allows for ties if you have them). Then divide the ranks by numel(Final_testWithoutNans)+1 to get the percentile values of each point. The +1 avoids values of 1, which gives you those pesky infinite z values, and it's the right thing to do anyway.
After you've got the percentile values this way you can convert those back to z scores with norminv or erfinv. At this point you might take a look at the histogram of those z scores and make sure it looks normal. It must if you have no ties, but if there are lots of ties it might not. Anyway, if this plot of z's doesn't look normal (e.g., you might have a whole bunch of scores tied at the maximum value, which would never happen in a normal distribution), then you can be sure that you will never find any other transformation of your original data that does look normal.

Connectez-vous pour commenter.

Produits

Version

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by