incrementalLearner
Syntax
Description
            
               
            returns a robust random cut forest (RRCF) model IncrementalForest = incrementalLearner(forest)IncrementalForest
            for anomaly detection, initialized using the parameters provided in the RRCF model
               forest. Because its property values reflect the knowledge gained
            from forest, IncrementalForest can detect
            anomalies given new observations, and it is warm, meaning that
            the incremental fit function can return scores and detect
            anomalies.
            
               
            specifies additional options using one or more
               name-value arguments. For example, IncrementalForest = incrementalLearner(forest,Name=Value)ScoreWarmupPeriod=500 specifies
               to process 500 observations before score computation and anomaly detection.
         
Examples
Train an incremental robust random cut forest (RRCF) model and perform anomaly detection on a data set with categorical predictors.
Load Data
Load census1994.mat. The data set consists of demographic data from the US Census Bureau.
load census1994.matincrementalRobustRandomCutForest does not use observations with missing values. Remove missing values in the data to reduce memory consumption and speed up training.  Keep only the first 1000 observations in the training data set and the first 2000 observations in the test data set. 
adultdata = rmmissing(adultdata); adulttest = rmmissing(adulttest); Xtrain = adultdata(1:1000,:); Xstream = adulttest(1:2000,:);
Train RRCF Model
Fit an RRCF model to the training data. Specify an anomaly contamination fraction of 0.001.
rng(0,"twister"); % For reproducibility TTforest = rrcforest(Xtrain,ContaminationFraction=0.001); details(TTforest)
  RobustRandomCutForest with properties:
        CollusiveDisplacement: 'maximal'
                  NumLearners: 100
    NumObservationsPerLearner: 256
                           Mu: []
                        Sigma: []
        CategoricalPredictors: [2 4 6 7 8 9 10 14 15]
        ContaminationFraction: 1.0000e-03
               ScoreThreshold: 55.5745
               PredictorNames: {'age'  'workClass'  'fnlwgt'  'education'  'education_num'  'marital_status'  'occupation'  'relationship'  'race'  'sex'  'capital_gain'  'capital_loss'  'hours_per_week'  'native_country'  'salary'}
  Methods, Superclasses
TTforest is a RobustRandomCutForest model object representing a traditionally trained RRCF model. The software identifies nine variables in the data as categorical predictors because they contain string arrays.
Convert Trained Model
Convert the traditionally trained RRCF model to an RRCF model for incremental learning.
Incrementalforest = incrementalLearner(TTforest);
Incrementalforest is an incrementalRobustRandomCutForest model object that is ready for incremental learning and anomaly detection. 
Fit Incremental Model and Detect Anomalies
Perform incremental learning on the Xstream data by using the fit function. To simulate a data stream, fit the model in chunks of 100 observations at a time. At each iteration:
- Process 100 observations. 
- Overwrite the previous incremental model with a new one fitted to the incoming observations. 
- Store - medianscore, the median score value of the data chunk, to see how it evolves during incremental learning.
- Store - threshold, the score threshold value for anomalies, to see how it evolves during incremental learning.
- Store - numAnom, the number of detected anomalies in the chunk, to see how it evolves during incremental learning.
n = numel(Xstream(:,1)); numObsPerChunk = 100; nchunk = floor(n/numObsPerChunk); medianscore = zeros(nchunk,1); numAnom = zeros(nchunk,1); threshold = zeros(nchunk,1); % Incremental fitting for j = 1:nchunk ibegin = min(n,numObsPerChunk*(j-1) + 1); iend = min(n,numObsPerChunk*j); idx = ibegin:iend; [Incrementalforest,tf,scores] = fit(Incrementalforest,Xstream(idx,:)); medianscore(j) = median(scores); numAnom(j) = sum(tf); threshold(j) = Incrementalforest.ScoreThreshold; end
Analyze Incremental Model During Training
To see how the median score, score threshold, and number of detected anomalies per chunk evolve during training, plot them on separate tiles.
tiledlayout(3,1); nexttile plot(medianscore) ylabel("Median Score") xlabel("Iteration") xlim([0 nchunk]) nexttile plot(threshold) ylabel("Score Threshold") xlabel("Iteration") xlim([0 nchunk]) nexttile plot(numAnom,"+") ylabel("Anomalies") xlabel("Iteration") xlim([0 nchunk]) ylim([0 max(numAnom)+0.2])

totalanomalies=sum(numAnom)
totalanomalies = 1
anomfrac= totalanomalies/n
anomfrac = 5.0000e-04
fit updates the model and returns the observation scores and the indices of observations with scores above the score threshold value as anomalies. A high score value indicates a normal observation, and a low value indicates an anomaly. The median score fluctuates between approximately 230 and 270. The score threshold rises from a value of 260 after the first iteration and steadily approaches 285 after 12 iterations. The software detected 4 anomalies in the Xstream data, yielding a total contamination fraction of 0.002. 
Train a robust random cut forest (RRCF) model on a simulated, noisy, periodic shingled time series containing no anomalies by using rrcforest. Convert the trained model to an incremental learner object, and then incrementally fit the time series and detect anomalies.   
Create Simulated Data Stream
Create a simulated data stream of observations representing a noisy sinusoid signal.
rng(0,"twister"); % For reproducibility period = 100; n = 2001+period; sigma = 0.04; a = linspace(1,n,n)'; b = sin(2*pi*(a-1)/period)+sigma*randn(n,1);
Introduce an anomalous region into the data stream. Plot the data stream portion that contains the anomalous region, and circle the anomalous data points.
c = 2*(sin(2*pi*(a-35)/period)+sigma*randn(n,1)); b(1150:1170) = c(1150:1170); scatter(a,b,".") xlim([900,1200]) xlabel("Observation") hold on scatter(a(1150:1170),b(1150:1170),"r") hold off

Convert the single-featured data set b into a multi-featured data set by shingling [1] with a shingle size equal to the period of the signal. The th shingled observation is a vector of  features with values , , ..., , where  is the shingle size. 
X = []; shingleSize = period; for i = 1:n-shingleSize X = [X;b(i:i+shingleSize-1)']; end
Train Model and Perform Incremental Anomaly Detection
Fit a robust random cut forest model to the first 1000 shingled observations, specifying a contamination fraction of 0. Convert the model to an incrementalRobustRandomCutForest model object. Specify to keep the 100 most recent observations relevant for anomaly detection.
Mdl = rrcforest(X(1:1000,:),ContaminationFraction=0); IncrementalMdl = incrementalLearner(Mdl,NumObservationsToKeep=100);
To simulate a data stream, process the full shingled data set in chunks of 100 observations at a time. At each iteration:
- Process 100 observations. 
- Calculate scores and detect anomalies using the - isanomalyfunction.
- Store - anomIdx, the indices of shingled observations marked as anomalies.
- If the chunk contains fewer than three anomalies, fit and update the previous incremental model. 
n = numel(X(:,1)); numObsPerChunk = 100; nchunk = floor(n/numObsPerChunk); anomIdx = []; allscores = []; % Incremental fitting rng("default"); % For reproducibility for j = 1:nchunk ibegin = min(n,numObsPerChunk*(j-1) + 1); iend = min(n,numObsPerChunk*j); idx = ibegin:iend; [isanom,scores] = isanomaly(IncrementalMdl,X(idx,:)); allscores = [allscores;scores]; anomIdx = [anomIdx;find(isanom)+ibegin-1]; if (sum(isanom) < 3) IncrementalMdl = fit(IncrementalMdl,X(idx,:)); end end
Analyze Incremental Model During Training
At each iteration, the software calculates a score value for each observation in the data chunk. A negative score value with large magnitude indicates a normal observation, and a large positive value indicates an anomaly. Plot the anomaly score for the observations in the vicinity of the anomaly. Circle the scores of shingles that the software returns as anomalous.
figure scatter(a(1:2000),allscores,".") hold on scatter(a(anomIdx),allscores(anomIdx),20,"or") xlim([900,1200]) xlabel("Shingle") ylabel("Score") hold off

Because the introduced anomalous region begins at observation 1150, and the shingle size is 100, shingle 1051 is the first to show a high anomaly score. Some shingles between 1050 and 1170 have scores lying just below the anomaly score threshold, due to the noise in the sinusoidal signal. The shingle size affects the performance of the model by defining how many subsequent consecutive data points in the original time series the software uses to calculate the anomaly score for each shingle.
Plot the unshingled data and highlight the introduced anomalous region. Circle the observation number of the first element in each shingle returned by that the software as anomalous.
figure xlim([900,1200]) ylim([-1.5 2]) rectangle(Position=[1150 -1.5 20 3.5],FaceColor=[0.9 0.9 0.9], ... EdgeColor=[0.9 0.9 0.9]) hold on scatter(a,b,".") scatter(a(anomIdx),b(anomIdx),20,"or") xlabel("Observation") hold off

Input Arguments
Traditionally trained RRCF model for anomaly detection, specified as a RobustRandomCutForest model object returned by rrcforest.
Name-Value Arguments
Specify optional pairs of arguments as
      Name1=Value1,...,NameN=ValueN, where Name is
      the argument name and Value is the corresponding value.
      Name-value arguments must appear after other arguments, but the order of the
      pairs does not matter.
    
Example: 
            incrementalLearner(forest,ObservationRemoval="timedecaying",ScoreWarmupPeriod=500)
            sets the observation removal method to "timedecaying" and specifies
            to process 500 observations before the incremental fit function
            returns scores and detects anomalies.
Number of the most recent observations relevant for anomaly detection, specified as a nonnegative integer.
Example: 
                     NumObservationsToKeep=250
                  
Data Types: single | double
Observation removal method, specified as "oldest",
                "timedecaying", or "random". When the robust
            random cut trees reach their capacity, the software removes old observations to
            accommodate the most recent data.
| Value | Description | 
|---|---|
| 
 | Oldest observations are removed first. | 
| 
 | Observations are removed randomly in a weighted fashion. Older observations have a higher probability of being removed first. | 
| 
 | Observations are removed in random order. | 
Data Types: string | char
Options for computing in parallel and setting random streams, specified as a
            structure. Create the Options structure using statset. This table lists the option fields and their
                values.
| Field Name | Value | Default | 
|---|---|---|
| UseParallel | Set this value to trueto run computations in
                                parallel. | false | 
| UseSubstreams | Set this value to  To compute
                                    reproducibly, set  | false | 
| Streams | Specify this value as a RandStreamobject or
                                cell array of such objects. Use a single object except when theUseParallelvalue istrueand theUseSubstreamsvalue isfalse. In that case, use a cell array that
                                has the same size as the parallel pool. | If you do not specify Streams, thenincrementalLearneruses the default stream or
                                streams. | 
Note
You need Parallel Computing Toolbox™ to run computations in parallel.
Example: Options=statset(UseParallel=true,UseSubstreams=true,Streams=RandStream("mlfg6331_64"))
Data Types: struct
Warm-up period before score computation and anomaly detection, specified as
                     a nonnegative integer. This option specifies the number of observations used by
                     the incremental fit function to train the model and
                     estimate the score threshold.
Note
When processing observations during the score warm-up period, the software ignores observations that contain missing values for all predictors.
Example: 
                     ScoreWarmupPeriod=200
                  
Data Types: single | double
Running window size used to estimate the score threshold
                        (ScoreThreshold), specified as a positive integer. The
                     default ScoreWindowSize value is
                     1000.
If ScoreWindowSize is greater than the number of
                     observations in the training data, the software determines
                        ScoreThreshold by subsampling from the training data.
                     Otherwise, ScoreThreshold is set to
                        forest.ScoreThreshold.
Example: 
                     ScoreWindowSize=100
                  
Data Types: single | double
Output Arguments
RRCF model for incremental anomaly detection, returned as an incrementalRobustRandomCutForest model object.
To initialize IncrementalForest for incremental anomaly
                  detection, 
                     incrementalLearner
                   passes the values of the following properties of
                     forest to the corresponding properties of
                     IncrementalForest.
| Property | Description | 
|---|---|
| CategoricalPredictors | Categorical predictor indices, a vector of positive integers | 
| ContaminationFraction | Fraction of anomalies in the training data, a numeric scalar in
                                 the range [0,1] | 
| Mu | Predictor means of the training data, a numeric vector | 
| NumLearners | Number of robust random cut trees, a positive integer scalar | 
| NumObservationsPerLearner | Number of observations for each robust random cut tree, a nonnegative integer | 
| PredictorNames | Predictor variable names, a cell array of character vectors | 
| ScoreThreshold | Threshold score for anomalies in the training data, a numeric
                                 scalar in the range [0, Inf). IfScoreWindowSizeis greater than the number
                                 of observations used to trainforest, then
                                    incrementalLearnerapproximatesScoreThresholdby subsampling from the
                                 training data. Otherwise,
                                    incrementalLearnerpassesforest.ScoreThresholdtoIncrementalForest.ScoreThreshold. | 
| Sigma | Predictor standard deviations of the training data, a numeric vector | 
More About
Incremental learning, or online learning, is a branch of machine learning concerned with processing incoming data from a data stream, possibly given little to no knowledge of the distribution of the predictor variables, aspects of the prediction or objective function (including tuning parameter values), or whether the observations contain anomalies. Incremental learning differs from traditional machine learning, where enough data is available to fit to a model, perform cross-validation to tune hyperparameters, and infer the predictor distribution.
Anomaly detection is used to identify unexpected events and departures from normal behavior. In situations where the full data set is not immediately available, or new data is arriving, you can use incremental learning for anomaly detection to incrementally train a model so it adjusts to the characteristics of the incoming data.
Given incoming observations, an incremental learning model for anomaly detection does the following:
- Computes anomaly scores 
- Updates the anomaly score threshold 
- Detects data points above the score threshold as anomalies 
- Fits the model to the incoming observations 
For more information, see Incremental Anomaly Detection with MATLAB.
References
[1] Guha, Sudipto, N. Mishra, G. Roy, and O. Schrijvers. "Robust Random Cut Forest Based Anomaly Detection on Streams," Proceedings of The 33rd International Conference on Machine Learning 48 (June 2016): 2712–21.
[2] Bartos, Matthew D., A. Mullapudi, and S. C. Troutman. "rrcf: Implementation of the Robust Random Cut Forest Algorithm for Anomaly Detection on Streams." Journal of Open Source Software 4, no. 35 (2019): 1336.
Extended Capabilities
To run in parallel, specify the Options name-value argument in the call to
                        this function and set the UseParallel field of the
                        options structure to true using
                                    statset:
Options=statset(UseParallel=true)
For more information about parallel computing, see Run MATLAB Functions with Automatic Parallel Support (Parallel Computing Toolbox).
Version History
Introduced in R2023b
See Also
Functions
Objects
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Sélectionner un site web
Choisissez un site web pour accéder au contenu traduit dans votre langue (lorsqu'il est disponible) et voir les événements et les offres locales. D’après votre position, nous vous recommandons de sélectionner la région suivante : .
Vous pouvez également sélectionner un site web dans la liste suivante :
Comment optimiser les performances du site
Pour optimiser les performances du site, sélectionnez la région Chine (en chinois ou en anglais). Les sites de MathWorks pour les autres pays ne sont pas optimisés pour les visites provenant de votre région.
Amériques
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)