Effacer les filtres
Effacer les filtres

How can I remove outliers by using mahalanobis distance?

57 vues (au cours des 30 derniers jours)
Mooklada Chaisorn
Mooklada Chaisorn le 1 Sep 2020
I have a normalized data table of 3568 rows and 24 columns. I calculate mahalanobis distance for each row of data using the code below. But how can I use mahalanobis distance I found to remove outliers?Is there any principle like distance above or below how many percent should be removed? Please advice me as I try to create several scenarios for my dataset.
For example,
  • scenario 0, just clean missing data but no outlier remove
  • scenario 1, remove outliers by using mean method
  • scenario 2, remove outliers by mahalanobis distance
Thank you for all your help
%DATA = 3568 x 24 table
k = size(DATA);
n = k(1); %row
m = k(2); %column
Y = DATA;
a = zeros(1,m); %one observation
b = zeros(n-1,m); %new table dif dimension
c = zeros(1,m);
d_mahal_DATA = zeros(n,1); %mahalonobis
format short e
for i=1:n
if i==1
a(i,:)=Y(i,:);
c = removerows(Y(i,:));
Y(1,:)=[];
d_mahal_DATA(i,:) = mahal(c,Y);
elseif i>1
a(i,:)=Y(1,:); %row 1:i
c = removerows(Y(1,:)); %row i only
Y(1,:)=[]; %row i+1 onwards
b = [a(1:i-1,:);Y]; %row 1:i-1;i+1:-end (skip row i)
d_mahal_DATA(i,:) = mahal(c,b);
end
end
d_mahal_DATA % size 3568 x 1

Réponse acceptée

Pratyush Roy
Pratyush Roy le 4 Sep 2020
One can use p-values obtained from a chi-squared distribution to remove outliers using Mahalanobis Distance.
The p-values for the Mahalanobis distance array ‘d_mahal_DATA’ can be computed using the function chi2cdf available in Statistics and Machine Learning Toolbox.
P_val = chi2cdf(d_mahal_DATA,n) % n denotes the degrees of freedom for the chi-squared distribution.
Perform a thresholding on this P_val array based on a certain significance value α for the distribution.
If P_val(i) is less than α for certain i, the ith data is to be considered an outlier. The value for alpha and n can be varied to obtain different thresholding for rejecting outliers.
Typically the values of alpha and n are taken as 0.05 and 2 respectively.
You can go through the following documentation link for further link:
  1 commentaire
Mooklada Chaisorn
Mooklada Chaisorn le 6 Sep 2020
Thank you so much! I'll try the one you suggest. But at first I try using isoutlier by percentiles [0,95]
dmh = array2table(d_mahal_DATA);
lowPercent = 0;
highPercent = 95;
[outlierInd,pLow,pHigh,~] = isoutlier(dmh.d_mahal_DATA,"percentiles",[lowPercent, highPercent]);
T_mahal = DATA(~outlierInd,:); % size 3390 x 24 table
I'm not sure if this considers removing too many outliers from the DATA

Connectez-vous pour commenter.

Plus de réponses (0)

Produits

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by