How to remove outliers?
7 vues (au cours des 30 derniers jours)
Daniel Shub le 9 Jan 2012
If you haven't thought about how you are going to deal with outliers before inspecting your data, then don't remove them. If you do then you are going down the road of: We looked at our entire data set and didn't see the effect we wanted, so we only analyzed the subset of the data that showed what we wanted.
Walter Roberson le 9 Jan 2012
There is no specific function that I know of. Although there are some common algorithms for removing outliers, there is substantial disagreement about which algorithms should be used, and what constitutes an outlier tends to change from situation to situation and with interpretation of the situation.
Richard Willey le 9 Jan 2012
MATLAB doesn't provide a specific function to remove outliers. In general you have a couple different options to deal with outliers.
1. You can create an index that flags potential outliers and either delete them from your data set or substitute more plausible values
2. You can use robust techniques like robust regression which are less sensitive to the presence of outliers.
Your choice of strategies will depend a lot on your knowledge about the data set. For example, if you have a lot of data points that are coded with a value like -9999 these are probably error codes of some kind rather than actual numeric information.
I'm including some simple example code which shows a standard technique to detect outliers.
s = RandStream('mt19937ar','seed',1966);
% Create a vector of X values
X = 1:100;
X = X';
% Create a noise vector
noise = randn(100,1);
% Create a second noise value where sigma is much larger
noise2 = 10*randn(100,1);
% Substitute noise2 for noise1 at obs# (11, 31, 51, 71, 91)
% Many of these points will have an undue influence on the model
noise(11:20:91) = noise2(11:20:91);
% Specify Y = F(X)
Y = 3*X + 2 + noise;
% Cook's Distance for a given data point measures the extent to
% which a regression model would change if this data point
% were excluded from the regression. Cook's Distance is
% sometimes used to suggest whether a given data point might be an outlier.
% Use regstats to calculate Cook's Distance
stats = regstats(Y,X,'linear');
% if Cook's Distance > n/4 is a typical treshold that is used to suggest
% the presence of an outlier
potential_outlier = stats.cookd > 4/length(X);
% Display the index of potential outliers and graph the results