Training a Neural Net on the entire dataset after model selection on K-fold Cross Validation: How to overcome overfitting if i don't have a validation and test set?
    2 vues (au cours des 30 derniers jours)
  
       Afficher commentaires plus anciens
    
Hi everyone, 
I am working on artificial neural networks for application in Movement Analysis. I started using Neural Networks this year and following courses and post on ANSWER and Matlab community i tried to implement a K-fold CV procedure to develop a model for movement classification.
SOME CONSIDERATION: My Dataset is composed by 19 Subjects repeating a movement pattern for 20 times. This movement pattern is composed by 5 sequential phases which are divided in 100 ms observations from 6 sensors: in order to divide the data in 3 indipendent TRAINING, VALIDATION AND TEST SETS i have to include all observation from a subject inside a specific group. 
I implemented the overall procedure which i will include at the end of this post. But now i have 2 question arising in my head:
1- Looking at the useful examples from prof. Greg Heath i saw that the R^2 is often used as performance measure to evaluate model. Beside i also read that it is typically recommended for regression problem.  Is it possible to use it also in classification ?
2- After i get the results from my 10x10 iteration over weight and hidden neurons different model, should i get the collected information to train the 'optimal' model found on all the entire dataset ? Or should i simply take the best model found even if i don't consider a N°val+N*tst samples ? I ask this becouse i already tried to train a found optimal model an all my data, but off course if i don't specify a validation set the early stop does not work and i fall in the overfitting.
Thanks in advance for every possible help.
Mirko
%% MEMORY CLEAN
clear all; close all; clc
%% LOAD DATASET
datasetfolder='DATA/CLASSIFIER/Table_Classifier';
load(fullfile(cd,datasetfolder,'Table_Classifier.mat'));% ------------------- Load Dataset
x=table2array(DataSET(:,1:end-1))';% ---------------------------------------- Input [IxN°obs.] 252x42563
tc=table2array(DataSET(:,end));% -------------------------------------------- Label Cell Array [1xN°obs.]
targets=[strcmp(tc,'Phase1'),...% ------------------------------------------- Targets [OxN°osserv.] 5x42563
    strcmp(tc,'Phase2'),... 
    strcmp(tc,'Phase3'),...
    strcmp(tc,'Phase4'),...
    strcmp(tc,'Phase5')]';
%% DIMENSIONALITY OF THE DATASET
[I N]=size(x);
[O ~]=size(targets);
%% DEFINITION OF FOLDS FOR XVALIDATION
% In my case each fold should include all observation from all exercise from a specific subject, DIVISOR is a
% label that indicate the specific subject of an observation.
Sbj=unique(DIVISOR);
loop=0;
% Choose of the type of validation
while loop==0
    flag=input(['What validation model you would like to implement?\n',...
        '   1 - 5 folds\n   2 - 10 folds\n   3 - LOSOCV\n\n']);
    switch flag
        case 1
            folds = 6;
            loop = 1;
        case 2
            folds = 11;
            loop = 1;
        case 3
            folds = length(SBJ);
            loop = 1;
        otherwise
            loop = 0;
    end
end
Basing on the number of loop defined above, i created a cell array 'subgroup' (1,folds) containing the subjects label randomized in fold different groups, it is important to note that if i choose to implement 5-fold X Validation Subgroup will have 5+1 element (one element will be considered as test-set)
- Subgroup {1}: Sbj1, Sbj7, Sbj5
- Subgroup {2}: Sbj2, Sbj4
- Subgroup {3}: Sbj3, Sbj6
At this point i implemented starting from the double loop approach by prof. Greg Heath an expanded approach that:
- each element of the Subgroup (i.e. folds) is considered as Test Set
- the remaining element are used for k-fold cross validation
- a validation loop is iterated for 10 random initialization of initial weights and 10 possible model of hidden neurons
%% IDENTIFICATION OF THE AVERAGE NTRN 
% Changing different folds for test and validation implicitly change the number of training samples
% to calculate N° of hidden neurons, so i evaluate an average N° of training samples among all possible selections.
Ntr_av=0;%------------------------------------------------------------------- Average N°trn
for t=1:folds%--------------------------------------------------------------- For each test choice
    logicalindext=cellfun(@(x)contains(DIVISOR,x),...
    subgroup{t},'un',0); 
    for v=1:folds%----------------------------------------------------------- For each validation choice
        if t~=v
            logicalindexv=cellfun(@(x)contains(DIVISOR,x),subgroup{v},'un',0);
            TrainSET=find(~any([any(...%------------------------------------- Train indixes
                horzcat(logicalindext{:}),2),any(...
                horzcat(logicalindexv{:}),2)],2)==1);           
            Ntr_av=Ntr_av+length(TrainSET);
        end
    end
end
Ntr_av=Ntr_av/((folds-1)*folds);%-------------------------------------------- Average N°trn
Hmin=10;%-------------------------------------------------------------------- Minimum Hidden nodes number
Hub_av=(Ntr_av*O-O)/(I+O+1);%------------------------------------------------ Upper limit for N° Hidden neuron                                                                                                                                      
Hmax_av = round(Hub_av/10);%------------------------------------------------- Max N° hidden neurons (<<<Hub_av for robust training)
dn=floor((Hmax_av-Hmin)/9);%------------------------------------------------- Step dn 
Neurons=(0:9).*dn+Hmin;%----------------------------------------------------- I define 10 possible models of hidde layer differentiatig for dn
                                                                            % Hidden neurons
MSE00 = mean(var(targets',1));%---------------------------------------------- Naive Constant model reference on all dataset
%% NEURAL NETWORK MODEL
for t=1:folds%--------------------------------------------------------------- For each fold t
    logicalindext=cellfun(@(x)contains(DIVISOR,x),...%----------------------- I define the current fold as TEST SET finding all the indixes corresponding
        subgroup{t},'un',0);                                                % to the label in subgroup{t}                                     
    ITST=find(any(horzcat(logicalindext{:}),2)==1);
    MSE00tst = mean(var(targets(:,ITST)',1));%------------------------------- Naive Constant model reference on the Test SET
    IVAL=cell(1,folds-1);%--------------------------------------------------- Declaration of folds-1 couple of possible training 
    ITRN=cell(1,folds-1);%--------------------------------------------------- and validation indixes and respective MSE00
    MSE00val=zeros(1,folds-1);
    MSE00trn=zeros(1,folds-1);
    count=1;
    for v=1:folds%----------------------------------------------------------- For each fold
        if t~=v%------------------------------------------------------------- different from Test SET t
            logicalindexv=cellfun(@(x)contains(DIVISOR,x),subgroup{v},'un',0);
            IVAL{1,count}=find(any(...%-------------------------------------- I identify the indixes of validation and training
                horzcat(logicalindexv{:}),2)==1);
            ITRN{1,count}=find(~any([any(...
                horzcat(logicalindext{:}),2),any(...
                horzcat(logicalindexv{:}),2)],2)==1);
            MSE00val(1,count)=mean(var(targets(:,ITRN{1,count})',1));%------- And i calculate the MSE00 references
            MSE00trn(1,count)=mean(var(targets(:,IVAL{1,count})',1));
            count=count+1;
        end
    end
    S=cell(1,10);%----------------------------------------------------------- Across each validation loop i have to use the same initial weight
    rng(0);%----------------------------------------------------------------- Default random state
    for s=1:10
        S{s}=rng;%----------------------------------------------------------- I save 10 different random states to be resettled across 10
        rand;                                                               % different validation loop (initial weight iteration)
    end
    rng(0);%----------------------------------------------------------------- Default random state
    % Performance measures 
    perf_xentrval=zeros(10,10);
    perf_xentrtrn=zeros(10,10);
    perf_xentrtst=zeros(10,10);
    perf_mseval=zeros(10,10);
    perf_msetrn=zeros(10,10);
    perf_msetst=zeros(10,10);
    perf_R2=zeros(10,10);
    perf_R2trn=zeros(10,10);
    perf_R2tst=zeros(10,10);
    perf_R2val=zeros(10,10);
    for n=1:10%-------------------------------------------------------------- For each model of hidden neurons
        H=Neurons(n);%------------------------------------------------------- I use the model defined previously
        parfor i=1:10%------------------------------------------------------- For each iteration of initial random weight
            fprintf(['Validation for Model with: ',num2str(H),' neurons and randomization ',num2str(i),'\n']);
            tic
            [val_xentrval,val_xentrtrn,val_xentrtst,val_mseval,val_msetrn,val_msetst,val_R2,val_R2trn,val_R2val,val_R2tst]=ValidationLoops...
                (S{i},MSE00,MSE00trn,MSE00tst,MSE00val,folds,x,targets,H,ITRN,IVAL,ITST)
            toc
The function validationLoops has been created to overcome parfor problem and errors in multiprocessing comands:
function [val_xentrval,val_xentrtrn,val_xentrtst,val_mseval,val_msetrn,val_msetst,val_R2,val_R2trn,val_R2val,val_R2tst]...
    =ValidationLoops(S,MSE00,MSE00trn,MSE00tst,MSE00val,folds,x,targets,H,ITRN,IVAL,ITST)
% Validation performance Variables
val_xentrval = zeros(1,folds-1);                     
val_xentrtrn = zeros(1,folds-1);                    
val_xentrtst = zeros(1,folds-1);                    
val_mseval = zeros(1,folds-1);                      
val_msetrn = zeros(1,folds-1);                      
val_msetst = zeros(1,folds-1);
val_R2 = zeros(1,folds-1);
val_R2trn = zeros(1,folds-1);
val_R2val = zeros(1,folds-1);
val_R2tst = zeros(1,folds-1);
for v=1:folds-1%---------------------------------------------- For each validation fold
    net=patternnet(H,'trainlm');%----------------------------- Define the net
    net.performFcn = 'mse';%---------------------------------- Loss function
    net.divideFcn='divideind';%------------------------------- Setting TRAINING TEST AND VALIDATION
    net.divideParam.trainInd=ITRN{v};                          %  TrainingSET
    net.divideParam.valInd=IVAL{v};                            %  ValidationSET
    net.divideParam.testInd=ITST;                              %  TestSET
    rng(S);                                                  % Reset initial weight, across validation loops i evaluate the SAME MODEL in terms
                                                             % of Neurons and Initial Weighy
    net=configure(net,x,targets);
    [net,tr,y,e]=train(net,x,targets);
    % Save Performance variables
    val_xentrval(v) = crossentropy(net,targets(:,IVAL{v}),...%------- Crossentropy
        y(:,IVAL{v}));
    val_xentrtrn(v) = crossentropy(net,targets(:,ITRN{v}),...
        y(:,ITRN{v}));
    val_xentrtst(v) = crossentropy(net,targets(:,ITST),...
        y(:,ITST));
    val_mseval(v) = tr.best_vperf;%---------------------------------- MSE
    val_msetrn(v) = tr.best_perf;
    val_msetst(v) = tr.best_tperf;
    val_R2(v) = 1 - mse(e)/MSE00;%----------------------------------- R2
    val_R2trn(v) = 1 - tr.best_perf/MSE00trn(v);
    val_R2val(v) = 1 - tr.best_vperf/MSE00val(v);
    val_R2tst(v) = 1 - tr.best_tperf/MSE00tst;
end
After the validation i save the results of model with N neurons and I random iteration of initial weights as a mean of results obtained in validation loops.
            perf_xentrval(n,i)=...                       
                mean(val_xentrval);
            perf_xentrtrn(n,i)=...
                mean(val_xentrtrn);
            perf_xentrtst(n,i)=...
                mean(val_xentrtst);
            perf_mseval(n,i)=...
                mean(val_mseval);
            perf_msetrn(n,i)=...
                mean(val_msetrn);
            perf_msetst(n,i)=...
                mean(val_msetst);
            perf_R2(n,i)=...
                mean(val_R2);
            perf_R2trn(n,i)=...
                mean(val_R2trn);
            perf_R2val(n,i)=...
                mean(val_R2val);
            perf_R2tst(n,i)=...
                mean(val_R2tst);
        end
    end
    % This process is repeated for each choice of different Test Set
    eval(['T',num2str(t),'Test_model.data.xentrval=perf_xentrval']);
    eval(['T',num2str(t),'Test_model.data.xentrtrn=perf_xentrtrn']);
    eval(['T',num2str(t),'Test_model.data.xentrtst=perf_xentrtst']);
    eval(['T',num2str(t),'Test_model.data.mseval=perf_mseval']);
    eval(['T',num2str(t),'Test_model.data.msetrn=perf_msetrn']);
    eval(['T',num2str(t),'Test_model.data.msetst=perf_msetst']);
    eval(['T',num2str(t),'Test_model.data.R2=perf_R2']);
    eval(['T',num2str(t),'Test_model.data.R2val=perf_R2val']);
    eval(['T',num2str(t),'Test_model.data.R2trn=perf_R2trn']);
    eval(['T',num2str(t),'Test_model.data.R2tst=perf_R2tst']);
    eval(['T',num2str(t),'Test_model.HiddenNeurons=Neurons']);
    eval(['T',num2str(t),'Test_model.SET.Sbj=subgroup{t};']);
    eval(['T',num2str(t),'Test_model.SET.Ind=ITST;']);
end
delete(gcp('nocreate'))
0 commentaires
Réponses (0)
Voir également
Catégories
				En savoir plus sur Deep Learning Toolbox dans Help Center et File Exchange
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
