Automatically select the right number of bins (or combine the bins) for the expected frequencies in crosstab, in order to guarantee at least 5 elements per bin

I have two observed datasets, "x" and "y", representing "future stock prices", and I want to compare the observed frequencies in bins of "x" and "y" against the other, through crosstab. To do so, I first need to place the elements of "x" and "y" into bins, by using the histcounts function. The resulting binned arrays, "cx" and "cy", are then compared to each other with a chi-square test, perfomed by crosstab. The chi-square test of independence is performed to determine if there is a significant association between the frequencies of "x" and "y" across the bins.
However, the chi-square test "is not valid for small samples, and if some of the counts (in the expected frequency) are less than five, you may need to combine some bins in the tails.". In the following example, several bins of the observed frequencies "cx" and "cy" have zero elements, and I do not know if they affect the expected frequencies calculated within/by crosstab.
Therefore, is there a way in crosstab to automatically select the right number of bins for the expected frequencies, or to combine them if some are empty, in order to guarantee at least 5 elements per bin?
rng default; % for reproducibility
a = 0;
b = 100;
nb = 50;
% Create two log-normal distributed random datasets, "x" and "y'
% (but we can use any randomly distributed data)
x = (b-a).*round(lognrnd(1,1,1000,1)) + a;
y = (b-a).*round(lognrnd(0.88,1.1,1000,1)) + a;
% Counts/frequency of "x" and "y"
cx = histcounts(x,'NumBins',nb);
cy = histcounts(y,'NumBins',nb);
[~,chi2,p] = crosstab(cx,cy)
chi2 = 476.6926
p = 2.9412e-28

Réponses (1)

One option for small samples is to use the fishertest function.

6 commentaires

Thanks @Star Strider :-)
Do you mean to place all my data inside 2 bins only in this way?
I indeed remember that MATLAB only supports contingency table of size 2 × 2:
rng default; % for reproducibility
a = 0;
b = 100;
nb = 2;
% Create two log-normal distributed random datasets, "x" and "y'
% (but we can use any randomly distributed data)
x = (b-a).*round(lognrnd(1,1,1000,1)) + a;
y = (b-a).*round(lognrnd(0.88,1.1,1000,1)) + a;
% Counts/frequency of "x" and "y"
cx = histcounts(x,'NumBins',nb);
cy = histcounts(y,'NumBins',nb);
t = table(cx',cy')
t = 2x2 table
Var1 Var2 ____ ____ 996 997 4 3
[h,p,stats] = fishertest(t)
h = logical
0
p = 1.0000
stats = struct with fields:
OddsRatio: 0.7492 ConfidenceInterval: [0.1673 3.3563]
Otherwise I could use the Fisher's exact test with R×C contingency table, or the
However, does "small samples" refer to the size of "x" and "y", or to the size of the binned arrays "cx" and "cy"?
(in the specific case of my question "cx" and "cy" are composed of 50 bins)
My pleasure!
I have no idea what you are actually doing or what your data are. The only approach I can think of is to limit the bin number to guarantee that all the bins have whatever size you need them to have.
.
Sim
Sim le 23 Août 2024
Modifié(e) : Sim le 23 Août 2024
Thanks a lot for your extra comment :-)
My "x" and "y" represent two datasets of "future stock prices", which follow a log-normal distribution... I updated the question with this information :-)
I am still not following what you are doing. (Also consider using histcounts2 since that will produce a matrix.)
Another option might be to use the friedman function. I am not familiar with the statistics of what you are doing, so I cannot make any specific recommendations.
My pleasure!
Since your data are not normally distributed, friedman may be the most appropriate, since like other nonparametric distributions (explore them, such as ranksum as well), it only requires that the values to be compared share the same distribution, regardless of what that particular distribution is. I usually use it or other nonparametric analysis functions to compare lognormally-distributed data, since most of what I deal with (physiological data) are lognormally distributed.

Connectez-vous pour commenter.

Question posée :

Sim
le 23 Août 2024

Commenté :

le 23 Août 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by