Classification by logistic regression

3 vues (au cours des 30 derniers jours)
Vikrant
Vikrant le 28 Déc 2011
I am new learner in the field of classification, and am stuck with a problem while implementing logistic regression:
My data set consists of about 300 measurement, with 20 features. I implemented logistic regression model using glmfit and got the probability (Y) values. Next, I use the model output (Y) to generate ROC curve, which gives me sensitivity and specificity of the model/technique.
(1) I am using the entire data set for training and testing. Is that correct? If not, how can I validate my model? Is there a way to know if I am not overfitting by using all the features?
(2)I have tried to implement k-fold cross-validation(k =10), by running logistic regression and getting the sensitivity/specificity for test set 10 times. But my concern is that I am creating a new model for each of the 10 training sets, so in the end I do not have a single classifier.
Thanks,
Vikrant

Réponse acceptée

Ilya
Ilya le 28 Déc 2011
Because logistic regression is a simple linear model and because you have 10 times as many observations as predictors, the classification error measured on the training set should not be far off the true value. Even so, it is best to validate your model on data not used for training. 300 observations are not a lot, so you would likely be better off cross-validating the classification error and ROC curve.
10-fold stratified cross-validation is a good rule of thumb. This is what you get from function CROSSVAL by default. Several runs of 10-fold cross-validation would be even better.
Hosmer-Lemeshow goodness of fit test is often used for logistic regression models. It is described in many places.
You can use SEQUENTIALFS (with cross-validation) to see if you need all predictors.
Logistic regression and cross-validation are described in many textbooks, by the way.
  2 commentaires
Vikrant
Vikrant le 28 Déc 2011
Thanks Ilya. I'll explore some of the functions you suggested.
If you can point out some references where I can find implementation details, that would be great!
Once again, thank you for the reply!
Ilya
Ilya le 29 Déc 2011
It is best to gain some understanding of the theory and then look at demos and documentation examples in the Statistics Toolbox.
The doc page for glmfit has a few references at the bottom. Cross-validation is discussed, for example, in Elements of Statistical Learning by Hastie, Tibshirani & Friedman. I don't have a good reference for sequential feature selection, but examples on the doc page for sequentialfs should suffice. For small data with not too many predictors, I would recommend backward elimination.

Connectez-vous pour commenter.

Plus de réponses (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by