Would kfold loss values vary if cross validation is performed after model training?
    3 vues (au cours des 30 derniers jours)
  
       Afficher commentaires plus anciens
    
    Charles Bergen
 le 9 Mai 2025
  
    
    
    
    
    Modifié(e) : the cyclist
      
      
 le 10 Mai 2025
            I am concerned about the difference in cross validated (CV) predictions (kfoldpredict) in regression bagged ensembles (fitrensemble) if CV is performed after a model has been trained. If I understand this correctly, a fitrensemble model without CV will have access to all available variables in a data set. Thus generated trees will have a unique set of node split values different from node split values found in trees generated from a fitrensemble with CV on. Differences in these split values would then lead to an overall difference in possible outcomes for constructed trees in both models. 
I guess this would boil down to, does the crossval and subsequent kfoldloss or kfoldpredict (really any CV predict functions) functions account for these differences when supplied a model that did not peform initial cross validation?
If there is an error in my thoughts, please let me know. 
I tried to supply an example of my question below. 
% No initial CV 
Mdl = fitrensemble(looperValues(:,1:cherrios), allratios2,...           'Learners',t,'Weights',W1,'Method','Bag','NumLearningCycles',numblearningcyc,'Options',statset('UseParallel',true));
Mdl_CV_After_Training = crossval(MdllooperPhyschemMexB, 'KFold', 10);  
Mdl_CV_After_Training_kfold_predictions = kfoldpredict(Mdl_CV_After_Training)
VS
% Yes initial CV 
Mdl = fitrensemble(looperValues(:,1:cherrios), allratios2, 'Learners', t, 'Crossval', 'On','Weights',W1,'Method','Bag','NumLearningCycles',numblearningcyc,'Options',statset('UseParallel',true));
Mdl_Yes_CV_kfold_predictions = kfoldpredict(Mdl_CV_After_Training)
% Would Mdl_CV_After_Training_kfold_predictions == Mdl_Yes_CV_kfold_predictions?
0 commentaires
Réponse acceptée
  the cyclist
      
      
 le 9 Mai 2025
        The predictions will be identical, as long as you use the same fold assignments:
% Set seed, for reproducibility
rng default
% Simulate some data
N = 100;
X = randn(N,3);
y = sum(X+0.5*randn(N,1),2);
% Define a partition (which will be used for both models)
p = cvpartition(N,'KFold',10);
% Train one model using cross-validation during training
mdl_1 = fitrensemble(X,y,'CrossVal','on','CVPartition',p);
% Train a second model without using cross-validation during training, but apply it afterward
mdl_2 = fitrensemble(X,y);        
mdl2_cv = crossval(mdl_2,'CVPartition',p);
% Make the k-fold predictions
y1 = kfoldPredict(mdl_1);
y2 = kfoldPredict(mdl2_cv);
% See if they are equal -- THEY ARE!
isequal(y1,y2)
If you do not make sure the two models use exactly the same fold assignments, the predictions will not be identical, but they will be statistically equivalent.
3 commentaires
  the cyclist
      
      
 le 9 Mai 2025
				
      Modifié(e) : the cyclist
      
      
 le 10 Mai 2025
  
			To make an analogy ...
If you used
N = 1000;
x1 = randn(N,1);
x2 = randn(N,1);
to draw two samples of (pseudo)randomly generated values from a normal distribution, you would not expect those to be identical samples unless you set the seed each time, to get the same sequence. However you would expect the two samples to have the same statistical properties (the same within sampling error). Same mean, standard deviation, etc.
Similarly, I would not expect your predictions to be identical, but for all properites to be the same to within sampling error.
Plus de réponses (0)
Voir également
Catégories
				En savoir plus sur Gaussian Process Regression dans Help Center et File Exchange
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!

