Accelerating the pace of engineering and science

# LinearModel class

Linear regression model class

## Description

An object comprising training data, model description, diagnostic information, and fitted coefficients for a linear regression. Predict model responses with the predict or feval methods.

## Construction

mdl = fitlm(tbl) or mdl = fitlm(X,y) create a linear model of a table or dataset array tbl, or of the responses y to a data matrix X. For details, see fitlm.

mdl = stepwiselm(tbl) or mdl = stepwiselm(X,y) create a linear model of a table or dataset array tbl, or of the responses y to a data matrix X, with unimportant predictors excluded. For details, see stepwiselm.

expand all

### tbl — Input datatable | dataset array

Input data, specified as a table or dataset array. When modelspec is a formula, it specifies the variables to be used as the predictors and response. Otherwise, if you do not specify the predictor and response variables, the last variable is the response variable and the others are the predictor variables by default.

Predictor variables can be numeric, or any grouping variable type, such as logical or categorical (see Grouping Variables). The response must be numeric or logical.

To set a different column as the response variable, use the ResponseVar name-value pair argument. To use a subset of the columns as predictors, use the PredictorVars name-value pair argument.

Data Types: single | double | logical

### X — Predictor variablesmatrix

Predictor variables, specified as an n-by-p matrix, where n is the number of observations and p is the number of predictor variables. Each column of X represents one variable, and each row represents one observation.

By default, there is a constant term in the model, unless you explicitly remove it, so do not include a column of 1s in X.

Data Types: single | double | logical

### y — Response variablevector

Response variable, specified as an n-by-1 vector, where n is the number of observations. Each entry in y is the response for the corresponding row of X.

Data Types: single | double

## Properties

CoefficientCovariance

Covariance matrix of coefficient estimates.

CoefficientNames

Cell array of strings containing a label for each coefficient.

Coefficients

Coefficient values stored as a table. Coefficients has one row for each coefficient and these columns:

• Estimate — Estimated coefficient value

• SE — Standard error of the estimate

• tStatt statistic for a test that the coefficient is zero

• pValuep-value for the t statistic

To obtain any of these columns as a vector, index into the property using dot notation. For example, in mdl the estimated coefficient vector is

`beta = mdl.Coefficients.Estimate`

Use coefTest to perform other tests on the coefficients.

DFE

Degrees of freedom for error (residuals), equal to the number of observations minus the number of estimated coefficients.

Diagnostics

Table with the same number of rows as the input data (tbl or X). Diagnostics contains diagnostics helpful in finding outliers and influential observations. Many diagnostics describe the effect on the fit of deleting single observations. Diagnostics contains the following fields.

FieldMeaningUtility
LeverageDiagonal elements of HatMatrixLeverage indicates to what extent the predicted value for an observation is determined by the observed value for that observation. A value close to 1 indicates that the prediction is largely determined by that observation, with little contribution from the other observations. A value close to 0 indicates the fit is largely determined by the other observations. For a model with P coefficients and N observations, the average value of Leverage is P/N. An observation with Leverage larger than 2*P/N can be regarded as having high leverage.
CooksDistanceCook's measure of scaled change in fitted valuesCooksDistance is a measure of scaled change in fitted values. An observation with CooksDistance larger than three times the mean Cook's distance can be an outlier.
DffitsDelete-1 scaled differences in fitted values vs. observation numberDffits is the scaled change in the fitted values for each observation that would result from excluding that observation from the fit. Values with an absolute value larger than 2*sqrt(P/N) may be considered influential.
S2_iDelete-1 variance vs. observation numberS2_i is a set of residual variance estimates obtained by deleting each observation in turn. These can be compared with the value of the MSE property.
CovRatioDelete-1 ratio of determinant of covariance vs. observation numberCovRatio is the ratio of the determinant of the coefficient covariance matrix with each observation deleted in turn to the determinant of the covariance matrix for the full model. Values larger than 1+3*P/N or smaller than 1-3*P/N indicate influential points.
DfbetasDelete-1 scaled differences in covariance estimates vs. observation numberDfbetas is an N-by-P matrix of the scaled change in the coefficient estimates that would result from excluding each observation in turn. Values larger than 3/sqrt(N) in absolute value indicate that the observation has a large influence on the corresponding coefficient.
HatMatrixProjection matrix to compute fitted from observed responsesHatMatrix is an N-by-N matrix such that Fitted = HatMatrix*Y, where Y is the response vector and Fitted is the vector of fitted response values.

Rows not used in the fit because of missing values (in ObservationInfo.Missing) contain NaN values.

Rows not used in the fit because of excluded values (in ObservationInfo.Excluded) contain NaN values, with the following exception: Delete-1 diagnostics refer to the statistic with and without that observation (row) included in the fit. These diagnostics help identify important observations.

Fitted

Predicted response to the input data by using the model. Use predict to compute predictions for other predictor values, or to compute confidence bounds on Fitted.

Formula

Object containing information about the model.

LogLikelihood

Log likelihood of the model distribution at the response values, with mean fitted from the model, and other parameters estimated as part of the model fit.

ModelCriterion

AIC and other information criteria for comparing models. A structure with fields:

• AIC — Akaike information criterion

• AICc — Akaike information criterion corrected for sample size

• BIC — Bayesian information criterion

• CAIC — Consistent Akaike information criterion

To obtain any of these values as a scalar, index into the property using dot notation. For example, in a model mdl, the AIC value aic is:

`aic = mdl.ModelCriterion.AIC`

MSE

Mean squared error (residuals), SSE/DFE.

NumCoefficients

Number of coefficients in the model, a positive integer. NumCoefficients includes coefficients that are set to zero when the model terms are rank deficient.

NumEstimatedCoefficients

Number of estimated coefficients in the model, a positive integer. NumEstimatedCoefficients does not include coefficients that are set to zero when the model terms are rank deficient. NumEstimatedCoefficients is the degrees of freedom for regression.

NumObservations

Number of observations the fitting function used in fitting. This is the number of observations supplied in the original table, dataset, or matrix, minus any excluded rows (set with the Excluded name-value pair) or rows with missing values.

NumPredictors

Number of variables fitlm used as predictors for fitting.

NumVariables

Number of variables in the data. NumVariables is the number of variables in the original table or dataset, or the total number of columns in the predictor matrix and response vector when the fit is based on those arrays. It includes variables, if any, that are not used as predictors or as the response.

ObservationInfo

Table with the same number of rows as the input data (tbl or X).

FieldDescription
WeightsObservation weights. Default is all 1.
ExcludedLogical value, 1 indicates an observation that you excluded from the fit with the Exclude name-value pair.
MissingLogical value, 1 indicates a missing value in the input. Missing values are not used in the fit.
SubsetLogical value, 1 indicates the observation is not excluded or missing, so is used in the fit.

ObservationNames

Cell array of strings containing the names of the observations used in the fit.

• If the fit is based on a table or dataset containing observation names, ObservationNames uses those names.

• Otherwise, ObservationNames is an empty cell array

PredictorNames

Cell array of strings, the names of the predictors used in fitting the model.

Residuals

Table of residuals, with one row for each observation and these variables.

FieldDescription
RawObserved minus fitted values.
PearsonRaw residuals divided by RMSE.
StandardizedRaw residuals divided by their estimated standard deviation.
StudentizedResidual divided by an independent estimate of the residual standard deviation. The residual for observation i is divided by an estimate of the error standard deviation based on all observations except for observation i.

To obtain any of these columns as a vector, index into the property using dot notation. For example, in a model mdl, the ordinary raw residual vector r is:

`r = mdl.Residuals.Raw`

Rows not used in the fit because of missing values (in ObservationInfo.Missing) contain NaN values.

Rows not used in the fit because of excluded values (in ObservationInfo.Excluded) contain NaN values, with the following exceptions:

• raw contains the difference between the observed and predicted values.

• standardized is the residual, standardized in the usual way.

• studentized matches the standardized values because this residual is not used in the estimate of the residual standard deviation.

ResponseName

String giving naming the response variable.

RMSE

Root mean squared error (residuals), sqrt(MSE).

Robust

Structure that is empty unless fitlm constructed the model using robust regression.

FieldDescription
WgtFunRobust weighting function, such as 'bisquare' (see robustfit)
TuneValue specified for tuning parameter (can be [])
WeightsVector of weights used in final iteration of robust fit

Rsquared

Proportion of total sum of squares explained by the model. The ordinary R-squared value relates to the SSR and SST properties:

Rsquared = SSR/SST = 1 - SSE/SST.

For a linear or nonlinear model, Rsquared is a structure with two fields:

• Ordinary — Ordinary (unadjusted) R-squared

For a generalized linear model, Rsquared is a structure with five fields:

• Ordinary — Ordinary (unadjusted) R-squared

• LLR — Log-likelihood ratio

• Deviance — Deviance

To obtain any of these values as a scalar, index into the property using dot notation. For example, the adjusted R-squared value in mdl is

`r2 = mdl.Rsquared.Adjusted`

SSE

Sum of squared errors (residuals).

The Pythagorean theorem implies

SST = SSE + SSR.

SSR

Regression sum of squares, the sum of squared deviations of the fitted values from their mean.

The Pythagorean theorem implies

SST = SSE + SSR.

SST

Total sum of squares, the sum of squared deviations of y from mean(y).

The Pythagorean theorem implies

SST = SSE + SSR.

Steps

Structure that is empty unless stepwiselm constructed the model.

FieldDescription
StartFormula representing the starting model
LowerFormula representing the lower bound model, these terms that must remain in the model
UpperFormula representing the upper bound model, model cannot contain more terms than Upper
CriterionCriterion used for the stepwise algorithm, such as 'sse'
PEnterValue of the parameter, such as 0.05
PRemoveValue of the parameter, such as 0.10
HistoryTable representing the steps taken in the fit

The History table has one row for each step including the initial fit, and the following variables (columns).

FieldDescription
ActionAction taken during this step, one of:
• 'Start' — First step

• 'Remove' — A term is removed

TermName
• 'Start' step: The starting model specification

• 'Add' or 'Remove' steps: The term moved in that step

TermsTerms matrix (see modelspec of fitlm)
DFRegression degrees of freedom after this step
delDFChange in regression degrees of freedom from previous step (negative for steps that remove a term)
DevianceDeviance (residual sum of squares) at that step
FStatF statistic that led to this step
PValuep-value of the F statistic

VariableInfo

Table containing metadata about Variables. There is one row for each term in the model, and the following columns.

FieldDescription
ClassString giving variable class, such as 'double'
RangeCell array giving variable range:
• Continuous variable — Two-element vector [min,max], the minimum and maximum values

• Categorical variable — Cell array of distinct variable values

InModelLogical vector, where true indicates the variable is in the model
IsCategoricalLogical vector, where true indicates a categorical variable

VariableNames

Cell array of strings containing names of the variables in the fit.

• If the fit is based on a table or dataset, this property provides the names of the variables in that table or dataset.

• If the fit is based on a predictor matrix and response vector, VariableNames is the values in the VarNames name-value pair of the fitting method.

• Otherwise the variables have the default fitting names.

Variables

Table containing the data, both observations and responses, that the fitting function used to construct the fit. If the fit is based on a table or dataset array, Variables contains all of the data from that table or dataset array. Otherwise, Variables is a table created from the input data matrix X and response vector y.

## Methods

 addTerms Add terms to linear regression model anova Analysis of variance for linear model coefCI Confidence intervals of coefficient estimates of linear model coefTest Linear hypothesis test on linear regression model coefficients disp Display linear regression model dwtest Durbin-Watson test of linear model feval Evaluate linear regression model prediction fit Create linear regression model plot Scatter plot or added variable plot of linear model plotAdded Added variable plot or leverage plot for linear model plotAdjustedResponse Adjusted response plot for linear regression model plotDiagnostics Plot diagnostics of linear regression model plotEffects Plot main effects of each predictor in linear regression model plotInteraction Plot interaction effects of two predictors in linear regression model plotResiduals Plot residuals of linear regression model plotSlice Plot of slices through fitted linear regression surface predict Predict response of linear regression model random Simulate responses for linear regression model removeTerms Remove terms from linear model step Improve linear regression model by adding or removing terms stepwise Create linear regression model by stepwise regression

## Copy Semantics

Value. To learn how value classes affect copy operations, see Copying Objects in the MATLAB® documentation.

## Definitions

### Hat Matrix

The hat matrix H is defined in terms of the data matrix X:

H = X(XTX)–1XT.

The diagonal elements Hii satisfy

$\begin{array}{l}0\le {h}_{ii}\le 1\\ \sum _{i=1}^{n}{h}_{ii}=p,\end{array}$

where n is the number of observations (rows of X), and p is the number of coefficients in the regression model.

### Leverage

The leverage of observation i is the value of the ith diagonal term, hii, of the hat matrix H. Because the sum of the leverage values is p (the number of coefficients in the regression model), an observation i can be considered to be an outlier if its leverage substantially exceeds p/n, where n is the number of observations.

### Cook's Distance

Cook's distance is the scaled change in fitted values. Each element in CooksDistance is the normalized change in the vector of coefficients due to the deletion of an observation. The Cook's distance, Di, of observation i is

${D}_{i}=\frac{\sum _{j=1}^{n}{\left({\stackrel{^}{y}}_{j}-{\stackrel{^}{y}}_{j\left(i\right)}\right)}^{2}}{p\text{\hspace{0.17em}}MSE},$

where

• ${\stackrel{^}{y}}_{j}$ is the jth fitted response value.

• ${\stackrel{^}{y}}_{j\left(i\right)}$ is the jth fitted response value, where the fit does not include observation i.

• MSE is the mean squared error.

• p is the number of coefficients in the regression model.

Cook's distance is algebraically equivalent to the following expression:

${D}_{i}=\frac{{r}_{i}^{2}}{p\text{\hspace{0.17em}}MSE}\left(\frac{{h}_{ii}}{{\left(1-{h}_{ii}\right)}^{2}}\right),$

where ri is the ith residual, and hii is the ith leverage value.

CooksDistance is an n-by-1 column vector in the Diagnostics table of the LinearModel object.

## Examples

expand all

### Linear Regression Model of Matrix Data

Fit a linear model of the Hald data.

```load hald
X = ingredients; % predictor variables
y = heat; % response```

Fit a default linear model to the data.

`mdl = fitlm(X,y)`
```mdl =

Linear regression model:
y ~ 1 + x1 + x2 + x3 + x4

Estimated Coefficients:
Estimate    SE         tStat       pValue
(Intercept)      62.405     70.071      0.8906     0.39913
x1               1.5511    0.74477      2.0827    0.070822
x2              0.51017    0.72379     0.70486      0.5009
x3              0.10191    0.75471     0.13503     0.89592
x4             -0.14406    0.70905    -0.20317     0.84407

Number of observations: 13, Error degrees of freedom: 8
Root Mean Squared Error: 2.45
F-statistic vs. constant model: 111, p-value = 4.76e-07```

### Linear Regression with Categorical Predictor and Nonlinear Model

Fit a model of a table that contains a categorical predictor. Use a nonlinear response formula.

`load carsmall`

Construct a table containing continuous predictor variable Weight, nominal predictor variable Year, and response variable MPG.

```tbl = table(MPG,Weight);
tbl.Year = nominal(Model_Year);```

Create a fitted model of MPG as a function of Year, Weight, and Weight2. (You don't have to include Weight explicitly in your formula because it is a lower-order term of Weight2.

`mdl = fitlm(tbl,'MPG ~ Year + Weight^2')`
```mdl =

Linear regression model:
MPG ~ 1 + Weight + Year + Weight^2

Estimated Coefficients:
Estimate      SE            tStat      pValue
(Intercept)        54.206        4.7117     11.505    2.6648e-19
Weight          -0.016404     0.0031249    -5.2493    1.0283e-06
Year_76            2.0887       0.71491     2.9215     0.0044137
Year_82            8.1864       0.81531     10.041    2.6364e-16
Weight^2       1.5573e-06    4.9454e-07      3.149     0.0022303

Number of observations: 94, Error degrees of freedom: 89
Root Mean Squared Error: 2.78
F-statistic vs. constant model: 172, p-value = 5.52e-41```

fitlm creates two dummy (indicator) variables for the nominal variate, Year. The dummy variable Year_76 takes the value 1 if model year is 1976 and takes the value 0 if it is not. The dummy variable Year_82 takes the value 1 if model year is 1982 and takes the value 0 if it is not. And the year 1970 is the reference year. The corresponding model is

$M\stackrel{^}{P}G=54.206-0.0164\left(Weight\right)+2.0887\left(Year_76\right)+8.1864\left(Year_82\right)+\left(1.557e-06\right){\left(Weight\right)}^{2}$

### Robust Linear Regression Model

Fit a linear regression model of the Hald data using robust fitting.

```load hald
X = ingredients; % predictor variables
y = heat; % response```

Fit a robust linear model to the data.

`mdl = fitlm(X,y,'linear','RobustOpts','on')`
```mdl =

Linear regression model (robust fit):
y ~ 1 + x1 + x2 + x3 + x4

Estimated Coefficients:
Estimate    SE         tStat       pValue
(Intercept)       60.09     75.818     0.79256      0.4509
x1               1.5753    0.80585      1.9548    0.086346
x2               0.5322    0.78315     0.67957     0.51596
x3              0.13346     0.8166     0.16343     0.87424
x4             -0.12052     0.7672    -0.15709     0.87906

Number of observations: 13, Error degrees of freedom: 8
Root Mean Squared Error: 2.65
F-statistic vs. constant model: 94.6, p-value = 9.03e-07```

## Algorithms

The main fitting algorithm is QR decomposition. For robust fitting, the algorithm is robustfit.

## Alternatives

To remove redundant predictors in linear regression using lasso or elastic net, use the lasso function.

To regularize a regression with correlated terms using ridge regression, use the ridge or lasso functions.

To regularize a regression with correlated terms using partial least squares, use the plsregress function.