RegressionPartitionedLinear

Cross-validated linear regression model for high-dimensional data

Description

RegressionPartitionedLinear is a set of linear regression models trained on cross-validated folds. You can estimate the predictive quality of the model, or how well the linear regression model generalizes, using one or more kfold functions: kfoldPredict and kfoldLoss.

Every kfold object function uses models trained on training-fold (in-fold) observations to predict the response for validation-fold (out-of-fold) observations. For example, suppose that you cross-validate using five folds. The software randomly assigns each observation into five groups of equal size (roughly). The training fold contains four of the groups (roughly 4/5 of the data), and the validation fold contains the other group (roughly 1/5 of the data). In this case, cross-validation proceeds as follows:

The software trains the first model (stored in CVMdl.Trained{1}) by using the observations in the last four groups, and reserves the observations in the first group for validation.
The software trains the second model (stored in CVMdl.Trained{2}) by using the observations in the first group and the last three groups. The software reserves the observations in the second group for validation.
The software proceeds in a similar manner for the third, fourth, and fifth models.

If you validate by using kfoldPredict, the software computes predictions for the observations in group i by using model i. In short, the software estimates a response for every observation using the model trained without that observation.

Note

Unlike other cross-validated regression models, RegressionPartitionedLinear model objects do not store the predictor data set.

Creation

You can create a RegressionPartitionedLinear object by using the fitrlinear function and specifying one of the name-value arguments CrossVal, CVPartition, Holdout, KFold, or Leaveout.

Properties

expand all

Cross-Validation Properties

`CrossValidatedModel` — Cross-validated model name
character vector

Cross-validated model name, specified as a character vector.

For example, 'Linear' specifies a cross-validated linear model for binary classification or regression.

Data Types: char

`KFold` — Number of cross-validated folds
positive integer

Number of cross-validated folds, specified as a positive integer.

Data Types: double

`ModelParameters` — Cross-validation parameter values
object

Cross-validation parameter values, e.g., the name-value pair argument values used to cross-validate the linear model, specified as an object. ModelParameters does not contain estimated parameters.

Access properties of ModelParameters using dot notation.

`NumObservations` — Number of observations
positive numeric scalar

Number of observations in the training data, specified as a positive numeric scalar.

Data Types: double

`Partition` — Data partition
`cvpartition` model

Data partition indicating how the software splits the data into cross-validation folds, specified as a cvpartition model.

`Trained` — Linear regression models trained on cross-validation folds
cell array of `RegressionLinear` model objects

Linear regression models trained on cross-validation folds, specified as a cell array of RegressionLinear models. Trained has k cells, where k is the number of folds.

Data Types: cell

`W` — Observation weights
numeric vector

Observation weights used to cross-validate the model, specified as a numeric vector. W has NumObservations elements.

The software normalizes the weights used for training so that sum(W,'omitnan') is 1.

Data Types: single | double

`Y` — Observed responses
numeric vector

Observed responses used to cross-validate the model, specified as a numeric vector containing NumObservations elements.

Each row of Y represents the observed response of the corresponding observation in the predictor data.

Data Types: single | double

Other Regression Properties

`CategoricalPredictors` — Categorical predictor indices
vector of positive integers | `[]`

Categorical predictor indices, specified as a vector of positive integers. CategoricalPredictors contains index values indicating that the corresponding predictors are categorical. The index values are between 1 and p, where p is the number of predictors used to train the model. If none of the predictors are categorical, then this property is empty ([]).

Data Types: single | double

`PredictorNames` — Predictor names
cell array of character vectors

Predictor names in order of their appearance in the predictor data, specified as a cell array of character vectors. The length of PredictorNames is equal to the number of variables in the training data X or Tbl used as predictor variables.

Data Types: cell

`ResponseName` — Response variable name
character vector

Response variable name, specified as a character vector.

Data Types: char

`ResponseTransform` — Response transformation function
`'none'` | function handle

Response transformation function, specified as 'none' or a function handle. ResponseTransform describes how the software transforms raw response values.

For a MATLAB^® function or a function that you define, enter its function handle. For example, you can enter Mdl.ResponseTransform = @function, where function accepts a numeric vector of the original responses and returns a numeric vector of the same size containing the transformed responses.

Data Types: char | function_handle

Object Functions

`kfoldLoss`	Regression loss for cross-validated linear regression model
`kfoldPredict`	Predict responses for observations in cross-validated linear regression model

Examples

collapse all

Create Cross-Validated Linear Regression Model

Open Live Script

Simulate 10000 observations from this model

$y = x_{100} + 2 x_{200} + e .$

$X = {x_{1}, . . ., x_{1000}}$ is a 10000-by-1000 sparse matrix with 10% nonzero standard normal elements.
e is random normal error with mean 0 and standard deviation 0.3.

rng(1) % For reproducibility
n = 1e4;
d = 1e3;
nz = 0.1;
X = sprandn(n,d,nz);
Y = X(:,100) + 2*X(:,200) + 0.3*randn(n,1);

Cross-validate a linear regression model. To increase execution speed, transpose the predictor data and specify that the observations are in columns.

X = X';
CVMdl = fitrlinear(X,Y,'CrossVal','on','ObservationsIn','columns');

CVMdl is a RegressionPartitionedLinear cross-validated model. Because fitrlinear implements 10-fold cross-validation by default, CVMdl.Trained contains a cell vector of ten RegressionLinear models. Each cell contains a linear regression model trained on nine folds, and then tested on the remaining fold.

Predict responses for out-of-fold observations and estimate the generalization error by passing CVMdl to kfoldPredict and kfoldLoss, respectively.

oofYHat = kfoldPredict(CVMdl);
ge = kfoldLoss(CVMdl)

ge = 
0.1748

The estimated, generalization, mean squared error is 0.1748.

Find Good Lasso Penalty Using Cross-Validation

Open Live Script

To determine a good lasso-penalty strength for a linear regression model that uses least squares, implement 5-fold cross-validation.

Simulate 10000 observations from this model

$y = x_{100} + 2 x_{200} + e .$

$X = {x_{1}, . . ., x_{1000}}$ is a 10000-by-1000 sparse matrix with 10% nonzero standard normal elements.
e is random normal error with mean 0 and standard deviation 0.3.

rng(1) % For reproducibility
n = 1e4;
d = 1e3;
nz = 0.1;
X = sprandn(n,d,nz);
Y = X(:,100) + 2*X(:,200) + 0.3*randn(n,1);

Create a set of 15 logarithmically-spaced regularization strengths from $1 0^{- 5}$ through $1 0^{- 1}$ .

Lambda = logspace(-5,-1,15);

Cross-validate the models. To increase execution speed, transpose the predictor data and specify that the observations are in columns. Optimize the objective function using SpaRSA.

X = X'; 
CVMdl = fitrlinear(X,Y,'ObservationsIn','columns','KFold',5,'Lambda',Lambda,...
    'Learner','leastsquares','Solver','sparsa','Regularization','lasso');

numCLModels = numel(CVMdl.Trained)

numCLModels = 
5

CVMdl is a RegressionPartitionedLinear model. Because fitrlinear implements 5-fold cross-validation, CVMdl contains 5 RegressionLinear models that the software trains on each fold.

Display the first trained linear regression model.

Mdl1 = CVMdl.Trained{1}

Mdl1 = 
  RegressionLinear
         ResponseName: 'Y'
    ResponseTransform: 'none'
                 Beta: [1000x15 double]
                 Bias: [-0.0049 -0.0049 -0.0049 -0.0049 -0.0049 -0.0048 -0.0044 -0.0037 -0.0030 -0.0031 -0.0033 -0.0036 -0.0041 -0.0051 -0.0071]
               Lambda: [1.0000e-05 1.9307e-05 3.7276e-05 7.1969e-05 1.3895e-04 2.6827e-04 5.1795e-04 1.0000e-03 0.0019 0.0037 0.0072 0.0139 0.0268 0.0518 0.1000]
              Learner: 'leastsquares'

Mdl1 is a RegressionLinear model object. fitrlinear constructed Mdl1 by training on the first four folds. Because Lambda is a sequence of regularization strengths, you can think of Mdl1 as 15 models, one for each regularization strength in Lambda.

Estimate the cross-validated MSE.

mse = kfoldLoss(CVMdl);

Higher values of Lambda lead to predictor variable sparsity, which is a good quality of a regression model. For each regularization strength, train a linear regression model using the entire data set and the same options as when you cross-validated the models. Determine the number of nonzero coefficients per model.

Mdl = fitrlinear(X,Y,'ObservationsIn','columns','Lambda',Lambda,...
    'Learner','leastsquares','Solver','sparsa','Regularization','lasso');
numNZCoeff = sum(Mdl.Beta~=0);

In the same figure, plot the cross-validated MSE and frequency of nonzero coefficients for each regularization strength. Plot all variables on the log scale.

figure
[h,hL1,hL2] = plotyy(log10(Lambda),log10(mse),...
    log10(Lambda),log10(numNZCoeff)); 
hL1.Marker = 'o';
hL2.Marker = 'o';
ylabel(h(1),'log_{10} MSE')
ylabel(h(2),'log_{10} nonzero-coefficient frequency')
xlabel('log_{10} Lambda')
hold off

$Figure contains 2 axes objects. Axes object 1 with xlabel log_{10} Lambda, ylabel log_{10} MSE contains an object of type line. Axes object 2 with ylabel log_{10} nonzero-coefficient frequency contains an object of type line.$

Choose the index of the regularization strength that balances predictor variable sparsity and low MSE (for example, Lambda(10)).

idxFinal = 10;

Extract the model with corresponding to the minimal MSE.

MdlFinal = selectModels(Mdl,idxFinal)

MdlFinal = 
  RegressionLinear
         ResponseName: 'Y'
    ResponseTransform: 'none'
                 Beta: [1000x1 double]
                 Bias: -0.0050
               Lambda: 0.0037
              Learner: 'leastsquares'

idxNZCoeff = find(MdlFinal.Beta~=0)

idxNZCoeff = 2×1

   100
   200

EstCoeff = Mdl.Beta(idxNZCoeff)

EstCoeff = 2×1

    1.0051
    1.9965

MdlFinal is a RegressionLinear model with one regularization strength. The nonzero coefficients EstCoeff are close to the coefficients that simulated the data.

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

The object functions of the RegressionPartitionedLinear model fully support GPU arrays.

Version History

Introduced in R2016a

expand all

R2024a: Specify GPU array support for `RegressionPartitionedLinear` object functions (requires Parallel Computing Toolbox)

You can fit a RegressionPartitionedLinear object with GPU arrays by using fitrlinear. RegressionPartitionedLinear object functions now support GPU array input arguments so that they can execute on a GPU.

RegressionPartitionedLinear

Description

Creation

Properties

Cross-Validation Properties

CrossValidatedModel — Cross-validated model name character vector

KFold — Number of cross-validated folds positive integer

ModelParameters — Cross-validation parameter values object

NumObservations — Number of observations positive numeric scalar

Partition — Data partition cvpartition model

Trained — Linear regression models trained on cross-validation folds cell array of RegressionLinear model objects

W — Observation weights numeric vector

Y — Observed responses numeric vector

Other Regression Properties

CategoricalPredictors — Categorical predictor indices vector of positive integers | []

PredictorNames — Predictor names cell array of character vectors

ResponseName — Response variable name character vector

ResponseTransform — Response transformation function 'none' | function handle

Object Functions

Examples

Create Cross-Validated Linear Regression Model

Find Good Lasso Penalty Using Cross-Validation

Extended Capabilities

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

R2024a: Specify GPU array support for RegressionPartitionedLinear object functions (requires Parallel Computing Toolbox)

See Also

`CrossValidatedModel` — Cross-validated model name
character vector

`KFold` — Number of cross-validated folds
positive integer

`ModelParameters` — Cross-validation parameter values
object

`NumObservations` — Number of observations
positive numeric scalar

`Partition` — Data partition
`cvpartition` model

`Trained` — Linear regression models trained on cross-validation folds
cell array of `RegressionLinear` model objects

`W` — Observation weights
numeric vector

`Y` — Observed responses
numeric vector

`CategoricalPredictors` — Categorical predictor indices
vector of positive integers | `[]`

`PredictorNames` — Predictor names
cell array of character vectors

`ResponseName` — Response variable name
character vector

`ResponseTransform` — Response transformation function
`'none'` | function handle

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

R2024a: Specify GPU array support for `RegressionPartitionedLinear` object functions (requires Parallel Computing Toolbox)