Transforming a right skewed data set to normal
Afficher commentaires plus anciens
I am attempting to fit an ARIMA model to a set of data. The issue is I cannot get a good fit due to the data set following a weibel distribution, and when attempting to transform the data so it follows a normal distribution, a second peak emerges. So far I have tried using a square root, cube root, natural log, log10, log2, and log(x/1-x). Figure 1 is the raw data before any transform.

Réponses (2)
Have you tried fitting the data to a Weibull distribution? Matlab's mblfit() reutrns the maximum likelihood estimates of the parameters that best fit the underlying Weibull distribution of your data.
[Updated] Here's a demo
%create data
data = wblrnd(8,2,1000,1);
% do fiting
[parmhat, parmci] = wblfit(data);
% Plot fitting
figure
h = histogram(data);
hold on
% Calculate pdf and scale it to your data
Y = wblpdf(sort(data),parmhat(1), parmhat(2));
yScaled = Y * (1/max(Y)) * max(h.Values);
% Plot scaled pdf (the pdf should overlap with the hist)
plot(sort(data), yScaled, 'r-', 'LineWidth', 3)
legend('Your data', 'scaled pdf')

19 commentaires
Michael Mueller
le 19 Mar 2019
Adam Danz
le 19 Mar 2019
The line you shared doesn't have a second argument so I'm not sure what's causing the error. Could you share the entire copy-pasted error message and the line that is causing the error? Perhaps a sample of the variables would be helpful, too.
Michael Mueller
le 19 Mar 2019
Michael Mueller
le 19 Mar 2019
Adam Danz
le 19 Mar 2019
Yes. I just created fake data, inserted NaNs, and then ran mblfit() and the same error message appeared. Remove your NaN values and try again.
data(isnan(data)) = [];
[parmhat,parmci] = mblfit(data);
Michael Mueller
le 19 Mar 2019
I think it makes more sense to plot a histrogram of the raw data you're fitting and then to plot the probability density function (pdf) as a line using sorted data. The pdf needs scaled to the range of your data. I'll update my solution to include a demo.
Michael Mueller
le 19 Mar 2019
Adam Danz
le 19 Mar 2019
Are you sure the red line was produced using the parameters from the fit of your data? It looks like there could be a much better fit. If the data were available in a mat file, I could tinker with it.
What are the values of your two fit parameters?
Michael Mueller
le 19 Mar 2019
Michael Mueller
le 19 Mar 2019
Adam Danz
le 19 Mar 2019
Could you share all of the relevant code?
When I plot the pdf and the histogram using your parameters, they agree and they match your pdf (below) but not your distribution. That suggests that the paramhat values you're using either aren't from your distribution or your distribution isn't Weibullian.
paramhat = [8.3637, 1.7130];
x = 0:.1:25;
ymax = 3400;
Y = wblpdf(x,paramhat(1), paramhat(2));
yScaled = Y * (1/max(Y)) * ymax;
figure
plot(x,yScaled, 'LineWidth', 3)
hold on
data = wblrnd(paramhat(1), paramhat(2),120000,1);
histogram(data)

Michael Mueller
le 19 Mar 2019
Adam Danz
le 19 Mar 2019
I just saw your response posted while I was replying with my previous message. The distribution in your test data isn't Weibullian or there aren't enough samples to see the full distribution. Perhaps with more samples, the error would decrease between the histogram and the red pdf line. This solution will only work if your data are from a Weibull distribution which was indicated in the question. Looking at the original distribtion in your question, it looks like this solution is still a good one. What does the pdf look like for the data in your question?
Michael Mueller
le 19 Mar 2019
Adam Danz
le 19 Mar 2019
Hmmm the original distribution from you question is different (below). This looks Weibullian but the one above in your previous comment doesn't. Is it supposed to be a Weibullian distribution or was that just an initial guess? Perhaps a different appraoch is needed such as a transformation as you originally suggested.

Michael Mueller
le 19 Mar 2019
The updated distribution doesn't look as much like a Weibull distribution as the mistaken one did. If your data should come from a Weibull distribution because of the principals behind your data collection, then you can use these methods to do the fitting. But the updated plot doesn't look like a Weibull distribution. It doesn't look normal either, due to the rightward tail. The skew doesn't look strong enough to be fixed by a log transform either but you could at least try it (with low expectations).
Your original question asks how to make a bimodal distribution more like a normal distribution and that question made sense with the original example distribution which appeared to be Weibullian with 2 peaks. But the updated example isn't Weibullian nor does it have two peaks. So I've lost track of the goal. If you want your data to be more normal and less skewed, I'm sure there's a complicated transformation that could be created but what's the goal? Any distribution can be transformed into another but that usually results in uninterpretable data.
Michael Mueller
le 19 Mar 2019
Jeff Miller
le 20 Mar 2019
0 votes
One very general two-step approach is to
- convert the original scores to percentiles within the original distribution
- replace each original score with the standard normal (z) score having the same percentile.
The arima model will then predict z scores, and you can convert back to the original scores by reversing the steps (i.e., find the percentile of the predicted z score and then find the original score at that percentile).
2 commentaires
Michael Mueller
le 20 Mar 2019
Modifié(e) : Michael Mueller
le 20 Mar 2019
Jeff Miller
le 20 Mar 2019
I don't really understand this very well, but some more comments that might help.
First, you have to get rid of the nans before you even start, don't you? I'm not too familiar with arima models, but I wouldn't think they would allow nans. And even if they did, I'm not sure how they would fit into a normal distribution.
Second, once you get Final_testWithoutNans, I think you should get the percentile scores for each data value more precisely. Use this technique at stackoverflow to rank the values in Final_testWithoutNans (use the method that allows for ties if you have them). Then divide the ranks by numel(Final_testWithoutNans)+1 to get the percentile values of each point. The +1 avoids values of 1, which gives you those pesky infinite z values, and it's the right thing to do anyway.
After you've got the percentile values this way you can convert those back to z scores with norminv or erfinv. At this point you might take a look at the histogram of those z scores and make sure it looks normal. It must if you have no ties, but if there are lots of ties it might not. Anyway, if this plot of z's doesn't look normal (e.g., you might have a whole bunch of scores tied at the maximum value, which would never happen in a normal distribution), then you can be sure that you will never find any other transformation of your original data that does look normal.
Catégories
En savoir plus sur Exploration and Visualization dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!



