Main Content

r2plus1dVideoClassifier

R(2+1)D video classifier. Requires Computer Vision Toolbox Model for R(2+1)D Video Classification

Description

The r2plus1dVideoClassifier object returns an R(2+1)D video classifier pretrained on the Kinetics-400 data set. You can use the pretrained video classifier to classify 400 human actions, such as running, walking, and shaking hands.

Creation

Description

rd = r2plus1dVideoClassifier returns a R(2+1)D video classifier pretrained on the Kinetics-400 dataset.

rd = r2plus1dVideoClassifier("resnet-3d-18",classes) configures the pretrained R(2+1)D video classifer for transfer learning on a new set of classes, classes. The video classifier is pretrained on the Kinetics-400 dataset with a ResNet3D convolutional neural network(CNN) with 18 spatio-temporal layers.

rd = r2plus1dVideoClassifier(___,Name=Value) sets properties using name-value arguments in addition to the input arguments from the previous syntax. For example, rd = r2plus1dVideoClassifier("resnet-3d-18",classes,InputSize=[112,112,3,32]) sets the input size of the network. You can specify multiple name-value arguments.

Note

This function requires the Computer Vision Toolbox™ Model for R(2+1)D Video Classification. You can install the Computer Vision Toolbox Model for R(2+1)D Video Classification from Add-On Explorer. For more information about installing add-ons, see Get and Manage Add-Ons. To use this object, you must have a license for the Deep Learning Toolbox™.

Properties

expand all

Configure Classifier Properties

This property is read-only.

Size of the video classifier network, specified as a four-element row vector in the form [H,W,C,T], where H and W represent the height and width respectively, C represents the number of channels, and T represents the number of frames for the video subnetwork.

Typical values for the number of frames are 8, 16, 32, or 64. Increase the number of frames to capture the temporal nature of activities when training the classifier.

This property is read-only.

Normalization statistics for the video data, specified as a structure with field names Min, Max, Mean, and StandardDeviation. The Min and Max field values define the minimum and maximum values for rescaling the video data. The Mean, and StandardDeviation values define the mean and standard deviation for input normalization. All field values must be specified as a row vector of size equal to the number of channels for the video input data.

The default structure contains the fields, Min, Max, Mean and StandardDeviation with values [0,0,0], [255,255,255],, [0.45,0.45,0.45], and [0.225,0.225,0.225], respectively. You must calculate the statistics values from the dataset for which you are training the video classifier. To rescale the data using minimum and maximum values precomputed from your dataset, specify both Min and Max. Otherwise, the minimum and maximum values are calculated from each input sequence when using updateSequence or classifyVideoFile.

Note

The object normalizes the data by rescaling it between 0 and 1, and then the rescaled data is standardized by subtracting the mean and dividing by the standard deviation. The rescaled data is standardized if the Mean and StandardDeviation fields are non-empty. The input is automatically normalized when using updateSequence or classifyVideoFile object functions. The data must be manually normalized when using the forward or predict object functions.

Name of the trained video classifier, specified as a string scalar.

This property is read-only.

Classes that the video classifier is configured to train or classify, specified as a vector of strings or a cell array of character vectors. For example:

classes = ['kiss','laugh','pick','pour','pushup'];

Training Properties

Learnable parameters for the ResNet (2+1)D video classifier, specified as a table with three columns.

  • Layer — Layer name, specified as a string scalar.

  • Parameter — Parameter name, specified as a string scalar.

  • Value — Parameter value, specified as a dlarray (Deep Learning Toolbox) object.

The network learnable parameters contain the features learned by the network. For example, the weights of convolution and fully connected layers.

State of the nonlearnable parameters for the ResNet (2+1)D video classifier, specified as a table with three columns.

  • Layer — Layer name, specified as a string scalar.

  • Parameter — Parameter name, specified as a string scalar.

  • Value — Parameter value, specified as a dlarray (Deep Learning Toolbox) object.

The network state contains information remembered by the network between iterations. For example, the state of long short term networks (LSTM) and batch normalization layers. During training or inference, you can update the network state using the output of the forward and predict object functions.

Streaming Video Classification Properties

This property is read-only.

Video sequence used to update and classify sequences for streaming classification, specified as a 4-D numeric array. Each vector in the array is of the form [H,W,C,T], where H and W represent the height and width respectively, C represents the number of channels, and T represents the number of frames, for the video subnetwork. The updateSequence and classifySequence object functions use the video sequence specified by the VideoSequence property.

Object Functions

expand all

classifyVideoFileClassify a video file
classifySequenceClassify video sequence
resetSequenceReset video sequence properties for streaming video classification
updateSequenceUpdate video sequence for classification
forwardCompute video classifier outputs for training
predictCompute video classifier predictions

Examples

collapse all

This example shows how to classify a video stream using a pretrained R(2+1)D video classifier.

Load a pretrained R(2+1)D video classifier.

rd = r2plus1dVideoClassifier();

Create a VideoReader to read a video frame by frame.

videoFilename = "visiontraffic.avi";
reader = VideoReader(videoFilename);

Create a video player to visualize the video data and update the player position to match the size of the video.

player = vision.VideoPlayer;
player.Position(:,3:4) = [reader.Width reader.Height];

Specify the frequency at which the streaming video frames will be classified as 10. The classifier will be applied to a sequence of video frames every 10 frames to balance runtime performance against classification performance.

classificationFrequency = 10;

Specify the sequence length required by the classifier. This is based on the inuput size of the video classifier. You can begin to classify the sequence only after the sequence length reaches the required length.

sequenceLength = rd.InputSize(4);

Read through the video frame by frame, update the sequence with each new frame using updateSequence, and then classify the collected frames using classifySequence.

numFrames = 0;
text = "";

while hasFrame(reader)
    frame = readFrame(reader);
    numFrames = numFrames + 1;

    % Update the sequence with the next video frame.
    rd = updateSequence(rd,frame);

    % Classify the sequence based on the classificationFrequency.
    if mod(numFrames, classificationFrequency) == 0 && numFrames >= sequenceLength
        [label,score] = classifySequence(rd);
        text = string(label) + "; " + num2str(score, "%0.2f");
    end

    % Insert the predicted label into the video frame.
    frame = insertText(frame,[30,30],text,'FontSize',18);

    % Display the video and label. 
    step(player,frame);
end

Introduced in R2021b