Main Content

Speech Command Recognition in Simulink

This example shows a Simulink model that detects the presence of speech commands in audio. The model uses a pretrained convolutional neural network to recognize a given set of commands.

Speech Command Recognition Model

The model recognizes the following speech commands:

  • "yes"

  • "no"

  • "up"

  • "down"

  • "left"

  • "right"

  • "on"

  • "off"

  • "stop"

  • "go"

The model uses a pretrained convolutional deep learning network. Refer to the example Speech Command Recognition Using Deep Learning (Audio Toolbox) for details on the architecture of this network and how you train it.

Open the model.

model = 'speechCommandRecognition';

The model breaks the audio stream into one-second overlapping segments. A mel spectrogram is computed from each segment. The spectrograms are fed to the pretrained network.

Use the manual switch to select either a live stream from your microphone or read commands stored in audio files. For commands on file, use the rotary switch to select one of three commands (Go, Yes and Stop).

Auditory Spectrogram Extraction

The deep learning network is trained on auditory spectrograms computed using an audioFeatureExtractor. For the model to classify commands properly, you must extract auditory spectrograms in a manner identical to the trainind stage.

Define the parameters of the feature extraction. frameDuration is the duration of each frame for spectrum calculation. hopDuration is the time step between each spectrum. numBands is the number of filters in the auditory spectrogram.

fs = 16000;
frameDuration = 0.025;
frameSamples = round(frameDuration*fs);
hopDuration = 0.010;
hopSamples = round(hopDuration*fs);
numBands = 50;

Define an audioFeatureExtractor object to perform the feature extraction. The object is identical to the one used in Speech Command Recognition Using Deep Learning (Audio Toolbox) to extract the training spectrograms.

afe = audioFeatureExtractor( ...
    'SampleRate',fs, ...
    'FFTLength',512, ...
    'Window',hann(frameSamples,'periodic'), ...
    'OverlapLength',frameSamples - hopSamples, ...


Call generateMATLABFunction to create a feature extraction function. This function is called from the Auditory Spectrogram MATLAB Function block in the model. This ensures that the feature extraction used in the model matches the one used in training.


Run the model

Simulate the model for 20 seconds. To run the model indefinitely, set the stop time to Inf.


The recognized command is printed in the display block. The speech spectrogram is displayed in a Spectrum Analyzer scope. The network activations, which give a level of confidence in the different supported commands, are displayed in a time scope.

Close the model.