Main Content

cvshrink

Cross-validate pruning and regularization of regression ensemble

Description

vals = cvshrink(ens) returns an L-by-T matrix with cross-validated values of the mean squared error. L is the number of Lambda values in the ens.Regularization structure. T is the number of Threshold values on weak learner weights. If ens does not have a Regularization property containing values specified by the regularize function, set the Lambda name-value argument.

[vals,nlearn] = cvshrink(ens) additionally returns an L-by-T matrix of the mean number of learners in the cross-validated ensemble.

[___] = cvshrink(ens,Name=Value) specifies additional options using one or more name-value arguments. For example, you can specify the number of folds to use, the fraction of data to use for holdout validation, and lower cutoffs on weights for weak learners.

example

Examples

collapse all

Create a regression ensemble for predicting mileage from the carsmall data. Cross-validate the ensemble.

Load the carsmall data set and select displacement, horsepower, and vehicle weight as predictors.

load carsmall
X = [Displacement Horsepower Weight];

You can train an ensemble of bagged regression trees.

ens = fitrensemble(X,Y,Method="Bag")

fircensemble uses a default template tree object templateTree() as a weak learner when 'Method' is 'Bag'. In this example, for reproducibility, specify 'Reproducible',true when you create a tree template object, and then use the object as a weak learner.

rng('default') % For reproducibility
t = templateTree(Reproducible=true); % For reproducibiliy of random predictor selections
ens = fitrensemble(X,MPG,Method="Bag",Learners=t);

Specify values for Lambda and Threshold. Use these values to cross-validate the ensemble.

[vals,nlearn] = cvshrink(ens,Lambda=[.01 .1 1],Threshold=[0 .01 .1])
vals = 3×3

   18.9150   19.0092  128.5935
   18.9099   18.9504  128.8449
   19.0328   18.9636  116.8500

nlearn = 3×3

   13.7000   11.6000    4.1000
   13.7000   11.7000    4.1000
   13.9000   11.6000    4.1000

Clearly, setting a threshold of 0.1 leads to unacceptable errors, while a threshold of 0.01 gives similar errors to a threshold of 0. The mean number of learners with a threshold of 0.01 is about 11.4, whereas the mean number is about 13.8 when the threshold is 0.

Input Arguments

collapse all

Regression ensemble model, specified as a RegressionEnsemble or RegressionBaggedEnsemble model object trained with fitrensemble.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: cvshrink(ens,Holdout=0.1,Threshold=[0 .01 .1]) specifies to reserve 10% of the data for holdout validation, and weight cutoffs of 0, 0.01, and 1 for the first, second, and third weak learners, respectively.

Cross-validation partition, specified as a cvpartition object that specifies the type of cross-validation and the indexing for the training and validation sets.

To create a cross-validated model, you can specify only one of these four name-value arguments: CVPartition, Holdout, KFold, or Leaveout.

Example: Suppose you create a random partition for 5-fold cross-validation on 500 observations by using cvp = cvpartition(500,KFold=5). Then, you can specify the cross-validation partition by setting CVPartition=cvp.

Fraction of the data used for holdout validation, specified as a scalar value in the range (0,1). If you specify Holdout=p, then the software completes these steps:

  1. Randomly select and reserve p*100% of the data as validation data, and train the model using the rest of the data.

  2. Store the compact trained model in the Trained property of the cross-validated model.

To create a cross-validated model, you can specify only one of these four name-value arguments: CVPartition, Holdout, KFold, or Leaveout.

Example: Holdout=0.1

Data Types: double | single

Number of folds to use in the cross-validated model, specified as a positive integer value greater than 1. If you specify KFold=k, then the software completes these steps:

  1. Randomly partition the data into k sets.

  2. For each set, reserve the set as validation data, and train the model using the other k – 1 sets.

  3. Store the k compact trained models in a k-by-1 cell vector in the Trained property of the cross-validated model.

To create a cross-validated model, you can specify only one of these four name-value arguments: CVPartition, Holdout, KFold, or Leaveout.

Example: KFold=5

Data Types: single | double

Regularization parameter values for lasso, specified as a vector of nonnegative scalar values. If the value of this argument is empty, cvshrink does not perform cross-validation.

Example: Lambda=[.01 .1 1]

Data Types: single | double

Leave-one-out cross-validation flag, specified as "on" or "off". If you specify Leaveout="on", then for each of the n observations (where n is the number of observations, excluding missing observations, specified in the NumObservations property of the model), the software completes these steps:

  1. Reserve the one observation as validation data, and train the model using the other n – 1 observations.

  2. Store the n compact trained models in an n-by-1 cell vector in the Trained property of the cross-validated model.

To create a cross-validated model, you can specify only one of these four name-value arguments: CVPartition, Holdout, KFold, or Leaveout.

Example: Leaveout="on"

Data Types: char | string

Weights threshold, specified as a numeric vector with lower cutoffs on weights for weak learners. cvshrink discards learners with weights below Threshold in its cross-validation calculation.

Example: Threshold=[0 .01 .1]

Data Types: single | double

Output Arguments

collapse all

Cross-validated values of the mean squared error, returned as an L-by-T numeric matrix. L is the number of values of the regularization parameter Lambda, and T is the number of Threshold values on weak learner weights.

Mean number of learners in the cross-validated ensemble, returned as an L-by-T numeric matrix. L is the number of values of the regularization parameter Lambda, and T is the number of Threshold values on weak learner weights.

Extended Capabilities

Version History

Introduced in R2011a