How to best determine the probability of a distribution given an outlying observation?

4 vues (au cours des 30 derniers jours)
Hi,
I have a classification problem. I have a set of data from a reference process (let's call that "known") and a set of data from a second process (let's call that "test").
Hypothesis 0 is that the test sample came from an identical process as the "known", and will therefore have the same distribution.
Hypothesis 1 is that the test sample came from a different process. However, here is the catch: for all but one sample, this process has an identical distribution to the "known". Just one sample will be "suspiciously" low.
I will add a picture to better explain:
In this case, the red histogram is the reference "known" distribution. The blue histogram is the questioned "test" distribution. In this case, I already know that the test came from a different process. It might not be completely clear due to the overlaying, but it can be seen that the distributions pretty well match, except for a single blue sample which is suspiciously low.
What I need now is to take each distribution and work out some method of returning a probability that the extremely low blue value would be observed given the distribution is the "known" distribution. I know how to calculate the probability of a particular single observation, but how do I properly balance this with the number of observations? Would just a KS test be appropriate? It strikes me as stats 101, but it's been a while, and I don't want to get this wrong.
Thanks in advance.

Réponse acceptée

Ilya
Ilya le 12 Sep 2012
Modifié(e) : Ilya le 12 Sep 2012
If you know the reference distribution analytically, you can compute its cdf at the smallest observed value. Suppose this cdf value is p. The p-value for your test would be then one minus the binomial probability of not observing any successes in N trials, where N is the sample size and p is the success probability. That is, it would be 1-(1-p)^N.
  1 commentaire
Tim
Tim le 19 Sep 2012
Oh, so obvious now! Thank you. I was over-thinking it with the variance of the variance and all that jazz. My only excuses are lack of sleep and rusty stats - honestly, I avoid them when I can.

Connectez-vous pour commenter.

Plus de réponses (1)

per isakson
per isakson le 12 Sep 2012
See: FBD - "Find the Best Distribution" tool in the File Exchange
  1 commentaire
Tim
Tim le 12 Sep 2012
Thanks for your answer, per, but I'm not sure that this is what I'm looking for. I'll try and clarify with a simple code example.
KnownSet = randn(1000,1);
TestSet1 = randn(100,1);
TestSet2 = [randn(99,1); -4];
In this case, I know all three sets of data are mostly drawn from the same Gaussian distribution. However, TestSet2 has an outlier. The value -4 is very unlikely, and I'm hoping to use that single outlying value to provide a probability that each TestSet is purely from the same distribution as KnownSet. In this case, TestSet1 should have a high 'p-value', and TestSet2 should have a low 'p-value' and be rejected. I use the term p-value, but there might be something else.
FBD would help me determine the distribution of KnownSet (which I can assume is at least for the most part the same as that of the TestSets), but that is only the first step. How do I go from there to determining how likely/unlikely the set of observations is, given the distribution, and given the outlier?

Connectez-vous pour commenter.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by