Classification by logistic regression
3 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
I am new learner in the field of classification, and am stuck with a problem while implementing logistic regression:
My data set consists of about 300 measurement, with 20 features. I implemented logistic regression model using glmfit and got the probability (Y) values. Next, I use the model output (Y) to generate ROC curve, which gives me sensitivity and specificity of the model/technique.
(1) I am using the entire data set for training and testing. Is that correct? If not, how can I validate my model? Is there a way to know if I am not overfitting by using all the features?
(2)I have tried to implement k-fold cross-validation(k =10), by running logistic regression and getting the sensitivity/specificity for test set 10 times. But my concern is that I am creating a new model for each of the 10 training sets, so in the end I do not have a single classifier.
Thanks,
Vikrant
0 commentaires
Réponse acceptée
Ilya
le 28 Déc 2011
Because logistic regression is a simple linear model and because you have 10 times as many observations as predictors, the classification error measured on the training set should not be far off the true value. Even so, it is best to validate your model on data not used for training. 300 observations are not a lot, so you would likely be better off cross-validating the classification error and ROC curve.
10-fold stratified cross-validation is a good rule of thumb. This is what you get from function CROSSVAL by default. Several runs of 10-fold cross-validation would be even better.
Hosmer-Lemeshow goodness of fit test is often used for logistic regression models. It is described in many places.
You can use SEQUENTIALFS (with cross-validation) to see if you need all predictors.
Logistic regression and cross-validation are described in many textbooks, by the way.
2 commentaires
Ilya
le 29 Déc 2011
It is best to gain some understanding of the theory and then look at demos and documentation examples in the Statistics Toolbox.
The doc page for glmfit has a few references at the bottom. Cross-validation is discussed, for example, in Elements of Statistical Learning by Hastie, Tibshirani & Friedman. I don't have a good reference for sequential feature selection, but examples on the doc page for sequentialfs should suffice. For small data with not too many predictors, I would recommend backward elimination.
Plus de réponses (0)
Voir également
Catégories
En savoir plus sur Gaussian Process Regression dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!