How to divide data into train/valid/test sets such that one sample from every class is selected?

Hello to all,
I am trying to partition a dataset into training and test sets in a way such that at least one class sample is selected in both training and the test set.
In the process, in a loop I have used cvpartition and to check whether every class sample has been selected or not, I matched the class samples from every loop to the total classes present. This is what I have done so far,
s2 = data(:,1); % target vector in data
s2_1 = unique(data(:,1)); % total number of classes
for m = 1 : 1000
cv = cvpartition(data(:,1),'KFold',5,'Stratify',false);
for i = 1:cv.NumTestSets
testClasses = s2(cv.test(i));
[~,~,idx] = unique(testClasses);
nCount{i} = accumarray(idx(:),1);
end
for n = 1 : 5
if length(nCount{1,n})==length(s2_1)
break
end
end
end
There's a problem here with the break statement but I can work it out. The major problem is I don't get any proper result here and the uncertainity about the max number of loops (eg 1000) to be run here.
I hope I am able to explain properly.
Thanks in advance.

 Réponse acceptée

Set 'Stratify' option to 'True'.
cv = cvpartition(data(:,1),'KFold',5,'Stratify',true);

6 commentaires

I have tried that too, but it didn't worked. Also I was getting this warning message:
Warning: One or more folds do not contain points from all the groups.
So I didn't continue.
I can try to debug more if you share data(:,1) or s2.
And, despite the warning, ['Stratify','True'] attempts to ensure you have all groups in all folds.
Sure, I have attached it.
Thanks a lot.
Hi, so I run the code and couldn't find any issue with your data, except the distribution is not uniform. So when you try to divide it into stratified folds, the counts are not always the same. But it certainly ensures to capture as much of the classes as possible. You can try to run this code to see the similar distributions across folds:
s2_1 = unique(s2); % total number of classes
figure; hist(s2,s2_1); title('Overall Histogram');
cv = cvpartition(s2,'KFold',5,'Stratify',true);
for i = 1:cv.NumTestSets
disp(['Fold ',num2str(i)]);
trainClasses = s2(cv.training(i));
testClasses = s2(cv.test(i));
figure;
subplot(2,1,1);hist(trainClasses,s2_1);
ylabel('training');
title(['fold: ' num2str(i)]);
subplot(2,1,2);hist(testClasses,s2_1);
ylabel('test');
[C,~,idx] = unique(testClasses);
nCount{i} = accumarray(idx(:),1);
end

Connectez-vous pour commenter.

Plus de réponses (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by