Create k-fold Cross Validation with Undersampling for highly imbalanced Dataset

7 vues (au cours des 30 derniers jours)
Dario Walter
Dario Walter le 4 Août 2020
Dear Community,
I am not sure how to implement the following requirement. When I use undersampling for my supervised Machine Learning Algorithm, how can I assure that the k-fold corresponds to the distribution of the original dataset. The performace metric (e.g. PR AUC) shall refer to the original distribution and not to the distribution of the undersampled set.
It does not make sense to solely perform k-fold cross validation on the entire undersampled dataset.
Your help is highly appreciated!

Réponses (1)

Anshika Chaurasia
Anshika Chaurasia le 14 Août 2020
Hi Dario,
It is my understanding that you want k-folds (cross-validation) to preserve the imbalanced distribution of original dataset. The solution is stratified k-fold cross-validation.
  • Use cvpartition function and refer to cvpartition documentation for more information.
c = cvpartition(group,'KFold',k,'Stratify',stratifyOption)
  • You can also try following file exchange documents as a drop-in replacement to cvpartition:
  1. Distribution-balanced stratified cross-validation
  2. Stratified cross-validation for multi-label datasets

Produits


Version

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by