TreeBagger
Class: TreeBagger
Create bag of decision trees
Individual decision trees tend to overfit. Bootstrap-aggregated (bagged) decision trees combine the results of many decision trees, which reduces the effects of overfitting and improves generalization. TreeBagger
grows the decision trees in the ensemble using bootstrap samples of the data. Also, TreeBagger
selects a random subset of predictors to use at each decision split as in the random forest algorithm [1].
By default, TreeBagger
bags classification trees. To bag regression trees instead, specify 'Method','regression'
.
For regression problems, TreeBagger
supports mean and quantile regression (that is, quantile regression forest [5]).
Syntax
Mdl = TreeBagger(NumTrees,Tbl,ResponseVarName)
Mdl = TreeBagger(NumTrees,Tbl,formula)
Mdl = TreeBagger(NumTrees,Tbl,Y)
B = TreeBagger(NumTrees,X,Y)
B = TreeBagger(NumTrees,X,Y,Name,Value)
Description
returns an ensemble of Mdl
= TreeBagger(NumTrees
,Tbl
,ResponseVarName
)NumTrees
bagged classification trees trained using the sample data in the table Tbl
. ResponseVarName
is the name of the response variable in Tbl
.
returns an ensemble of bagged classification trees trained using the sample data in the table Mdl
= TreeBagger(NumTrees
,Tbl
,formula
)Tbl
. formula
is an explanatory model of the response and a subset of predictor variables in Tbl
used to fit Mdl
. Specify Formula
using Wilkinson notation. For more information, see Wilkinson Notation.
returns an ensemble of classification trees using the predictor variables in table Mdl
= TreeBagger(NumTrees
,Tbl
,Y
)Tbl
and class labels in vector Y
.
Y
is an array of response data. Elements of Y
correspond to the rows of Tbl
. For classification, Y
is the set of true class labels. Labels can be any grouping variable, that is, a numeric or logical vector, character matrix, string array, cell array of character vectors, or categorical vector. TreeBagger
converts labels to a cell array of character vectors. For regression, Y
is a numeric vector. To grow regression trees, you must specify the name-value pair 'Method','regression'
.
creates an ensemble B
= TreeBagger(NumTrees
,X
,Y
)B
of NumTrees
decision trees for predicting response Y
as a function of predictors in the numeric matrix of training data, X
. Each row in X
represents an observation and each column represents a predictor or feature.
B = TreeBagger(NumTrees,X,Y,Name,Value)
specifies optional parameter name-value pairs:
'InBagFraction' | Fraction of input data to sample with replacement from the input data for growing each new tree. Default value is 1. |
'Cost' | Square matrix Alternatively,
The default value is If |
'SampleWithReplacement' | 'on' to sample with replacement or 'off' to sample without replacement. If you sample without replacement, you need to set 'InBagFraction' to a value less than one. Default is 'on' . |
'OOBPrediction' | 'on' to store info on what observations are out of bag for each tree. This info can be used by oobPrediction to compute the predicted class probabilities for each tree in the ensemble. Default is 'off' . |
'OOBPredictorImportance' | 'on' to store out-of-bag estimates of feature importance in the ensemble. Default is 'off' . Specifying 'on' also sets the 'OOBPrediction' value to 'on' . If an analysis of predictor importance is your goal, then also specify 'PredictorSelection','curvature' or 'PredictorSelection','interaction-curvature' . For more details, see fitctree or fitrtree . |
'Method' | Either 'classification' or 'regression' . Regression requires a numeric Y . |
'NumPredictorsToSample' | Number of variables to select at random for each decision split. Default is the square root of the number of variables for classification and one third of the number of variables for regression. Valid values are 'all' or a positive integer. Setting this argument to any valid value but 'all' invokes Breiman's random forest algorithm [1]. |
'NumPrint' | Number of training cycles (grown trees) after which TreeBagger displays a diagnostic message showing training progress. Default is no diagnostic messages. |
'MinLeafSize' | Minimum number of observations per tree leaf. Default is 1 for classification and 5 for regression. |
'Options' | A structure that specifies options that govern the computation when growing the ensemble of decision trees. One option requests that the computation of decision trees on multiple bootstrap replicates uses multiple processors, if the Parallel Computing Toolbox™ is available. Two options specify the random number streams to use in selecting bootstrap replicates. You can create this argument with a call to
|
'Prior' | Prior probabilities for each class. Specify as one of:
If you set values for both If |
'PredictorNames' | Predictor variable names, specified as the comma-separated pair consisting of
|
'CategoricalPredictors' | Categorical predictors list, specified as the comma-separated pair consisting of
|
'ChunkSize' | Chunk size, specified as the comma-separated pair consisting of Note This option only applies when using |
In addition to the optional arguments above, TreeBagger
accepts these optional fitctree
and fitrtree
arguments.
Examples
Tips
Avoid large estimated out-of-bag error variances by setting a more balanced misclassification cost matrix or a less skewed prior probability vector.
The
Trees
property ofB
stores a cell array ofB.NumTrees
CompactClassificationTree
orCompactRegressionTree
model objects. For a textual or graphical display of treet
in the cell array, enterview(B.Trees{t})
Standard CART tends to select split predictors containing many distinct values, e.g., continuous variables, over those containing few distinct values, e.g., categorical variables [4]. Consider specifying the curvature or interaction test if any of the following are true:
If there are predictors that have relatively fewer distinct values than other predictors, for example, if the predictor data set is heterogeneous.
If an analysis of predictor importance is your goal.
TreeBagger
stores predictor importance estimates in theOOBPermutedPredictorDeltaError
property ofMdl
.
For more information on predictor selection, see
PredictorSelection
for classification trees orPredictorSelection
for regression trees.
Algorithms
If you specify the
Cost
,Prior
, andWeights
name-value arguments, the output model object stores the specified values in theCost
,Prior
, andW
properties, respectively. TheCost
property stores the user-specified cost matrix (C) as is. ThePrior
andW
properties store the prior probabilities and observation weights, respectively, after normalization. For model training, the software updates the prior probabilities and observation weights to incorporate the penalties described in the cost matrix. For details, see Misclassification Cost Matrix, Prior Probabilities, and Observation Weights.TreeBagger
generates in-bag samples by oversampling classes with large misclassification costs and undersampling classes with small misclassification costs. Consequently, out-of-bag samples have fewer observations from classes with large misclassification costs and more observations from classes with small misclassification costs. If you train a classification ensemble using a small data set and a highly skewed cost matrix, then the number of out-of-bag observations per class might be very low. Therefore, the estimated out-of-bag error might have a large variance and might be difficult to interpret. The same phenomenon can occur for classes with large prior probabilities.For details on selecting split predictors and node-splitting algorithms when growing decision trees, see Algorithms for classification trees and Algorithms for regression trees.
Alternative Functionality
Statistics and Machine Learning Toolbox™ offers three objects for bagging and random forest:
ClassificationBaggedEnsemble
created byfitcensemble
for classificationRegressionBaggedEnsemble
created byfitrensemble
for regressionTreeBagger
created byTreeBagger
for classification and regression
For details about the differences between TreeBagger
and
bagged ensembles (ClassificationBaggedEnsemble
and
RegressionBaggedEnsemble
), see Comparison of TreeBagger and Bagged Ensembles.
References
[1] Breiman, L. "Random Forests." Machine Learning 45, pp. 5–32, 2001.
[2] Breiman, L., J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Boca Raton, FL: CRC Press, 1984.
[3] Loh, W.Y. “Regression Trees with Unbiased Variable Selection and Interaction Detection.” Statistica Sinica, Vol. 12, 2002, pp. 361–386.
[4] Loh, W.Y. and Y.S. Shih. “Split Selection Methods for Classification Trees.” Statistica Sinica, Vol. 7, 1997, pp. 815–840.
[5] Meinshausen, N. “Quantile Regression Forests.” Journal of Machine Learning Research, Vol. 7, 2006, pp. 983–999.