Training and group matrices for classifying data
1 vue (au cours des 30 derniers jours)
Afficher commentaires plus anciens
I am getting this error when trying to classify matrix data:
Error using classify (line 220) TRAINING must have more observations than the number of groups.
My classification data matrix is 10x5, my training matrix is 2x5, and my group vector is of length 2:
classificationFeatureValues =
1.0e+004 *
0.0006 0.0761 0.0065 3.7003 0.0113
0.0005 0.0683 0.0063 3.3502 0.0114
0.0006 0.0761 0.0065 3.7003 0.0113
0.0005 0.0683 0.0063 3.3502 0.0114
0.0006 0.0761 0.0065 3.7003 0.0113
0.0005 0.0683 0.0063 3.3502 0.0114
0.0006 0.0761 0.0065 3.7003 0.0113
0.0005 0.0683 0.0063 3.3502 0.0114
0.0006 0.0761 0.0065 3.7003 0.0113
0.0005 0.0683 0.0063 3.3502 0.0114
training =
1.0e+004 *
0.0005 0.0683 0.0063 3.3502 0.0113
0.0006 0.0761 0.0065 3.7003 0.0114
group =
1 2
I can't seem to find the error here...
Steve
0 commentaires
Réponse acceptée
Greg Heath
le 27 Juil 2012
Modifié(e) : Greg Heath
le 27 Juil 2012
In order to use CLASSIFY, each class should have enough training points, Ntrni (i=1:c), to obtain accurate estimates of the mean and covariance matrix. The typical rule of thumb is that the number of vector measurements is much greater than the number of estimated parameters. For each class of p-dimensional vectors the Bayesian quadratic classifier requires
Ntrni >> numel(mean) + numel(cov) = p + p*(p+1)/2 = p*(p+3)/2
In addition, each class should have enough testing points Ntsti, to obtain accurate performance estimates on nontraining data (generalization). For classification, the errors are assumed to be Binomially distributed with approximate standard deviation
stdvei = sqrt(ei*(1-ei)/Ntsti ), ~0.05 <= ei <= ~0.95
It is desirable that stdvei << ei. Since max(ei*(1-ei)) = 0.25 , the typical rule of thumb for ei > ~0.05
Ntsti >> 19 >= (1-ei)/ei
For smaller errors Ntsti should be larger. Check a stats handbook for a more accurate estimate of stdvei for ei < 0.05.
If N is not large enough to obtain an adequate Ntrni/Ntsti division, crossvalidation or bootstrapping should be used.
For 10-fold crossfold validation of the the 3-class Fisher iris data with 50 4-dimensional inputs per class, Ntrni = 45 and the ratio per class is
r = 2*Ntrni/(p*(p+3)) < 90/(4*7) = 3.2
For the Bayesian linear classifier, the pooled covariance matrix is estimated yielding
3*Ni >> 3*p + p*(p+1)/2 = p*(p+7)/2
r = 6*Ni/(p*(p+7)) = 270/(4*11) = 6.1
For a LMSE (e.g.,backslash) linear classifier
Ni >> p + 1
r = 45/5 = 9
Therefore I suggest that you use
1. Raw data (i.e., NOT medians or means)...increases Ni
2. A backslash LMSE classifier... decreases no. of estimated parameters
W*[ones(1,N);traininput] = target % Columns of eye(c) for c classes
W = target/[ones(1,Ntrn);traininput] % Ntrn = sum(Ntrni)
output = W*[ones(1,size(input,2);input]
3. Bootstrapping or crossvalidation
Hope this helps.
Greg
0 commentaires
Plus de réponses (4)
Ilya
le 24 Juil 2012
classify needs to estimate either the pooled-in covariance matrix (for linear discriminant) or covariance matrix for each class (for quadratic discriminant). You can't estimate covariance if you have one observation per class. What is observed variance for a single observation?
With so little data, you should use all of it for training and estimate classification error by cross-validation. If you have 2011b or later, I would recommend ClassificationDiscriminant for an easier workflow.
2 commentaires
Ilya
le 25 Juil 2012
Sorry, I couldn't understand what you are saying about your acquisition.
Given the signature
classify(SAMPLE,TRAINING,GROUP)
You cannot perform discriminant analysis when your TRAINING matrix has only one observation (row) per class (distinct value in GROUP). The more observations you have for training, the more accurate your model is going to be. Take a look at examples in classify help or doc to see how GROUP and TRAINING are formed.
Greg Heath
le 26 Juil 2012
For the quadratic classifier, CLASSIFY requires full rank covariance matrices for each group. However, for the linear classifier, it only requires the pooled covariance matrix to have full rank.
Neither of these conditions hold. If you combine the training and test data and use format short g you will get
close all, clear all, clc
ClassificationFeatureValues = 1.0e+004 *[...
0.0006 0.0761 0.0065 3.7003 0.0113
0.0005 0.0683 0.0063 3.3502 0.0114
0.0006 0.0761 0.0065 3.7003 0.0113
0.0005 0.0683 0.0063 3.3502 0.0114
0.0006 0.0761 0.0065 3.7003 0.0113
0.0005 0.0683 0.0063 3.3502 0.0114
0.0006 0.0761 0.0065 3.7003 0.0113
0.0005 0.0683 0.0063 3.3502 0.0114
0.0006 0.0761 0.0065 3.7003 0.0113
0.0005 0.0683 0.0063 3.3502 0.0114 ]
Training =1.0e+004 *[...
0.0005 0.0683 0.0063 3.3502 0.0113
0.0006 0.0761 0.0065 3.7003 0.0114]
group = [ 1 2 ]
format short g
x = [ClassificationFeatureValues; Training]
x =
6 761 65 37003 113
5 683 63 33502 114
6 761 65 37003 113
5 683 63 33502 114
6 761 65 37003 113
5 683 63 33502 114
6 761 65 37003 113
5 683 63 33502 114
6 761 65 37003 113
5 683 63 33502 114
5 683 63 33502 113
6 761 65 37003 114
If you look closely at the 12 5-dimensional data points You will see that they collapse into two points. Therefore the data is only 1-dimensional and no formal classification is needed.
I usually recommend that, before classifier design, you should get a "feel" for the data via
1. plots and outlier checks
2. SVD condition and rank checks
3. Clustering
For example
>> svdx = svd(x)
svdx =
1.223e+005
24.571
0.69704
1.4925e-013
1.2842e-016
>> condx = cond(x)
condx = 9.5233e+020
>> tol = max(size(x)) * eps(norm(x))
tol = 1.7462e-010
>> rankx = rank(x,tol))
rankx = 3 % Too conservative
>> svdx/max(svdx)
ans =
1
0.00020092 % Essentially one-dimensional
5.6997e-006
1.2204e-018
1.0501e-021
Perhaps using your raw data will make the analysis more interesting.
Hope this helps.
Greg
0 commentaires
steve
le 26 Juil 2012
2 commentaires
Ilya
le 26 Juil 2012
Again:
You cannot perform discriminant analysis when your TRAINING matrix has only one observation (row) per class (distinct value in GROUP).
If you type 'help classify', the very first example gives you:
load fisheriris
x = meas(51:end,1:2); % for illustrations use 2 species, 2 columns
y = species(51:end);
Could you please look at the content of y. There are two distinct values there, 'versicolor' and 'virginica'. These are classes. Rows 1:50 in x are for class 'versicolor', and so you have 50 observations for this class. Rows 51:100 are for class 'virginica', and you have 50 observations for that class too.
steve
le 31 Juil 2012
1 commentaire
Oleg Komarov
le 31 Juil 2012
Please use comments. Who are you addressing with this question? If it is a standalone question open a new thread, however this doesn't sound like a MATLAB question and you might have more chances asking in math/stat forums.
Voir également
Catégories
En savoir plus sur Classification dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!