Main Content

Sequential Feature Selection for Audio Features

This example shows a typical workflow for feature selection applied to the task of spoken digit recognition.

In sequential feature selection, you train a network on a given feature set and then incrementally add or remove features until the highest accuracy is reached [1]. In this example, you apply sequential forward selection to the task of spoken digit recognition using the Free Spoken Digit Dataset [2].

Streaming Spoken Digit Recognition

To motivate the example, begin by loading a pretrained network, the audioFeatureExtractor (Audio Toolbox) object used to train the network, and normalization factors for the features.


Create an audioDeviceReader (Audio Toolbox) to read audio from a microphone. Create three dsp.AsyncBuffer (DSP System Toolbox) objects: one to buffer audio read from your microphone, one to buffer short-term energy of the input audio for speech detection, and one to buffer predictions.

fs = afe.SampleRate;

deviceReader = audioDeviceReader('SampleRate',fs,'SamplesPerFrame',256);

audioBuffer = dsp.AsyncBuffer(fs*3);
steBuffer = dsp.AsyncBuffer(1000);
predictionBuffer = dsp.AsyncBuffer(5);

Create a plot to display the streaming audio, the probability the network outputs during inference, and the prediction.

fig = figure;

streamAxes = subplot(3,1,1);
streamPlot = plot(zeros(fs,1));
xlabel('Time (s)')
title('Audio Stream')
streamAxes.XTick = [0,fs];
streamAxes.XTickLabel = [0,1];
streamAxes.YLim = [-1,1];

analyzedAxes = subplot(3,1,2);
analyzedPlot = plot(zeros(fs/2,1));
title('Analyzed Segment')
xlabel('Time (s)')
analyzedAxes.XTick = [0,fs/2];
analyzedAxes.XTickLabel = [0,0.5];
analyzedAxes.YLim = [-1,1];

probabilityAxes = subplot(3,1,3);
probabilityPlot = bar(0:9,0.1*ones(1,10));

Perform streaming digit recognition (digits 0 through 9) for 20 seconds. While the loop runs, speak one of the digits and test its accuracy.

First, define a short-term energy threshold under which to assume a signal contains no speech.

steThreshold = 0.015;
idxVec = 1:fs;
while toc < 20
    % Read in a frame of audio from your device.
    audioIn = deviceReader();
    % Write the audio into a the buffer.
    % While 200 ms of data is unused, continue this loop.
    while audioBuffer.NumUnreadSamples > 0.2*fs
        % Read 1 second from the audio buffer. Of that 1 second, 800 ms
        % is rereading old data and 200 ms is new data.
        audioToAnalyze = read(audioBuffer,fs,0.8*fs);
        % Update the figure to plot the current audio data.
        streamPlot.YData = audioToAnalyze;

        ste = mean(abs(audioToAnalyze));
        if steBuffer.NumUnreadSamples > 5
            abc = sort(peek(steBuffer));
            steThreshold = abc(round(0.4*numel(abc)));
        if ste > steThreshold
            % Use the detectSpeeech function to determine if a region of speech
            % is present.
            idx = detectSpeech(audioToAnalyze,fs);
            % If a region of speech is present, perform the following.
            if ~isempty(idx)
                % Zero out all parts of the signal except the speech
                % region, and trim to 0.5 seconds.
                audioToAnalyze = HelperTrimOrPad(audioToAnalyze(idx(1,1):idx(1,2)),fs/2);
                % Normalize the audio.
                audioToAnalyze = audioToAnalyze/max(abs(audioToAnalyze));
                % Update the analyzed segment plot
                analyzedPlot.YData = audioToAnalyze;

                % Extract the features and transpose them so that time is
                % across columns.
                features = (extract(afe,audioToAnalyze))';

                % Normalize the features.
                features = (features - normalizers.Mean) ./ normalizers.StandardDeviation;
                % Call classify to determine the probabilities and the
                % winning label.
                features(isnan(features)) = 0;
                [label,probs] = classify(bestNet,features);
                % Update the plot with the probabilities and the winning
                % label.
                probabilityPlot.YData = probs;

                if predictionBuffer.NumUnreadSamples == predictionBuffer.Capacity
                    lastTen = peek(predictionBuffer);
                    [~,decision] = max(mean(lastTen.*hann(size(lastTen,1)),1));
                    probabilityAxes.Title.String = num2str(decision-1);
            % If the signal energy is below the threshold, assume no speech
            % detected.
             probabilityAxes.Title.String = '';
             probabilityPlot.YData = 0.1*ones(10,1);
             analyzedPlot.YData = zeros(fs/2,1);
        drawnow limitrate

The remainder of the example illustrates how the network used in the streaming detection was trained and how the features fed into the network were chosen.

Create Train and Validation Data Sets

Download the Free Spoken Digit Dataset (FSDD) [2]. FSDD consists of short audio files with spoken digits (0-9).

url = "";
downloadFolder = tempdir;
datasetFolder = fullfile(downloadFolder,'FSDD');

if ~exist(datasetFolder,'dir')
    fprintf('Downloading Free Spoken Digit Dataset ...\n')

Create an audioDatastore (Audio Toolbox) to point to the recordings. Get the sample rate of the data set.

ads = audioDatastore(datasetFolder,'IncludeSubfolders',true);
[~,adsInfo] = read(ads);
fs = adsInfo.SampleRate;

The first element of the file names is the digit spoken in the file. Get the first element of the file names, convert them to categorical, and then set the Labels property of the audioDatastore.

[~,filenames] = cellfun(@(x)fileparts(x),ads.Files,'UniformOutput',false);
ads.Labels = categorical(string(cellfun(@(x)x(1),filenames)));

To split the datastore into a development set and a validation set, use splitEachLabel (Audio Toolbox). Allocate 80% of the data for development and the remaining 20% for validation.

[adsTrain,adsValidation] = splitEachLabel(ads,0.8);

Set Up Audio Feature Extractor

Create an audioFeatureExtractor (Audio Toolbox) object to extract audio features over 30 ms windows with an update rate of 10 ms. Set all features you would like to test in this example to true.

win = hamming(round(0.03*fs),"periodic");
overlapLength = round(0.02*fs);

afe = audioFeatureExtractor( ...
    'Window',       win, ...
    'OverlapLength',overlapLength, ...
    'SampleRate',   fs, ...
    'linearSpectrum',      false, ...
    'melSpectrum',         false, ...
    'barkSpectrum',        false, ...
    'erbSpectrum',         false, ...
    'mfcc',                true, ...
    'mfccDelta',           true, ...
    'mfccDeltaDelta',      true, ...
    'gtcc',                true, ...
    'gtccDelta',           true, ...
    'gtccDeltaDelta',      true, ...
    'spectralCentroid',    true, ...
    'spectralCrest',       true, ...
    'spectralDecrease',    true, ...
    'spectralEntropy',     true, ...
    'spectralFlatness',    true, ...
    'spectralFlux',        true, ...
    'spectralKurtosis',    true, ...
    'spectralRolloffPoint',true, ...
    'spectralSkewness',    true, ...
    'spectralSlope',       true, ...
    'spectralSpread',      true, ...
    'pitch',               false, ...
    'harmonicRatio',       false);

Define Layers and Training Options

Define the List of Deep Learning Layers and trainingOptions used in this example. The first layer, sequenceInputLayer, is just a placeholder. Depending on which features you test during sequential feature selection, the first layer is replaced with a sequenceInputLayer of the appropriate size.

numUnits = 100;
layers = [ ...

options = trainingOptions("adam", ...
    "LearnRateSchedule","piecewise", ...
    "Shuffle","every-epoch", ...
    "Verbose",false, ...

Sequential Feature Selection

In the basic form of sequential feature selection, you train a network on a given feature set and then incrementally add or remove features until the accuracy no longer improves [1].

Forward Selection

Consider a simple case of forward selection on a set of four features. In the first forward selection loop, each of the four features are tested independently by training a network and comparing their validation accuracy. The feature that resulted in the highest validation accuracy is noted. In the second forward selection loop, the best feature from the first loop is combined with each of the remaining features. Now each pair of features is used for training. If the accuracy in the second loop did not improve over the accuracy in the first loop, the selection process ends. Otherwise, a new best feature set is selected. The forward selection loop continues until the accuracy no longer improves.

Backward Selection

In backward feature selection, you begin by training on a feature set that consists of all features and test whether or not accuracy improves as you remove features.

Run Sequential Feature Selection

The helper functions (HelperSFS, HelperTrainAndValidateNetwork, and HelperTrimOrPad) implement forward or backward sequential feature selection. Specify the training datastore, validation datastore, audio feature extractor, network layers, network options, and direction. As a general rule, choose forward if you anticipate a small feature set or backward if you anticipate a large feature set.

direction = 'forward';
[logbook,bestFeatures,bestNet,normalizers] = HelperSFS(adsTrain,adsValidation,afe,layers,options,direction);
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 6).

The logbook output from HelperFeatureExtractor is a table containing all feature configurations tested and the corresponding validation accuracy.

logbook=48×2 table
                 Features                 Accuracy
    __________________________________    ________

    "mfcc, gtcc"                           97.333 
    "mfcc, mfccDelta, gtcc"                    97 
    "mfcc, gtcc, spectralEntropy"              97 
    "mfcc, gtcc, spectralFlatness"             97 
    "mfcc, gtcc, spectralFlux"                 97 
    "mfcc, gtcc, spectralSpread"               97 
    "gtcc"                                 96.667 
    "gtcc, spectralCentroid"               96.667 
    "gtcc, spectralFlux"                   96.667 
    "mfcc, gtcc, spectralRolloffPoint"     96.667 
    "mfcc, gtcc, spectralSkewness"         96.667 
    "gtcc, spectralEntropy"                96.333 
    "mfcc, gtcc, gtccDeltaDelta"           96.333 
    "mfcc, gtcc, spectralKurtosis"         96.333 
    "mfccDelta, gtcc"                          96 
    "gtcc, gtccDelta"                          96 

The bestFeatures output from HelperSFS contains a struct with the optimal features set to true.

bestFeatures = struct with fields:
                    mfcc: 1
               mfccDelta: 0
          mfccDeltaDelta: 0
                    gtcc: 1
               gtccDelta: 0
          gtccDeltaDelta: 0
        spectralCentroid: 0
           spectralCrest: 0
        spectralDecrease: 0
         spectralEntropy: 0
        spectralFlatness: 0
            spectralFlux: 0
        spectralKurtosis: 0
    spectralRolloffPoint: 0
        spectralSkewness: 0
           spectralSlope: 0
          spectralSpread: 0

You can set your audioFeatureExtractor using the struct.

afe = 
  audioFeatureExtractor with properties:

                     Window: [240×1 double]
              OverlapLength: 160
                 SampleRate: 8000
                  FFTLength: []
    SpectralDescriptorInput: 'linearSpectrum'

   Enabled Features
     mfcc, gtcc

   Disabled Features
     linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfccDelta, mfccDeltaDelta
     gtccDelta, gtccDeltaDelta, spectralCentroid, spectralCrest, spectralDecrease, spectralEntropy
     spectralFlatness, spectralFlux, spectralKurtosis, spectralRolloffPoint, spectralSkewness, spectralSlope
     spectralSpread, pitch, harmonicRatio

   To extract a feature, set the corresponding property to true.
   For example, obj.mfcc = true, adds mfcc to the list of enabled features.

HelperSFS also outputs the best performing network and the normalization factors that correspond to the chosen features. To save the network, configured audioFeatureExtractor, and normalization factors, uncomment this line:

% save('network_Audio_SequentialFeatureSelection.mat','bestNet','afe','normalizers')


This example illustrates a workflow for sequential feature selection for a Recurrent Neural Network (LSTM or BiLSTM). It could easily be adapted for CNN and RNN-CNN workflows.

Supporting Functions


function [trueLabels,predictedLabels,net,normalizers] = HelperTrainAndValidateNetwork(adsTrain,adsValidation,afe,layers,options)
% Train and validate a network.
%   adsTrain      - audioDatastore object that points to training set
%   adsValidation - audioDatastore object that points to validation set
%   afe           - audioFeatureExtractor object.
%   layers        - Layers of LSTM or BiLSTM network
%   options       - trainingOptions object
%   trueLabels      - true labels of validation set
%   predictedLabels - predicted labels of validation set
%   net             - trained network
%   normalizers     - normalization factors for features under test

% Copyright 2019 The MathWorks, Inc.

% Convert the data to tall arrays.
tallTrain      = tall(adsTrain);
tallValidation = tall(adsValidation);

% Extract features from the training set. Reorient the features so that
% time is along rows to be compatible with sequenceInputLayer.
fs = afe.SampleRate;
tallTrain         = cellfun(@(x)HelperTrimOrPad(x,fs/2),tallTrain,"UniformOutput",false);
tallTrain         = cellfun(@(x)x/max(abs(x),[],'all'),tallTrain,"UniformOutput",false);
tallFeaturesTrain = cellfun(@(x)extract(afe,x),tallTrain,"UniformOutput",false);
tallFeaturesTrain = cellfun(@(x)x',tallFeaturesTrain,"UniformOutput",false);  %#ok<NASGU>
[~,featuresTrain] = evalc('gather(tallFeaturesTrain)'); % Use evalc to suppress command-line output.

tallValidation         = cellfun(@(x)HelperTrimOrPad(x,fs/2),tallValidation,"UniformOutput",false);
tallValidation         = cellfun(@(x)x/max(abs(x),[],'all'),tallValidation,"UniformOutput",false);
tallFeaturesValidation = cellfun(@(x)extract(afe,x),tallValidation,"UniformOutput",false);
tallFeaturesValidation = cellfun(@(x)x',tallFeaturesValidation,"UniformOutput",false); %#ok<NASGU>
[~,featuresValidation] = evalc('gather(tallFeaturesValidation)'); % Use evalc to suppress command-line output.

% Use the training set to determine the mean and standard deviation of each
% feature. Normalize the training and validation sets.
allFeatures = cat(2,featuresTrain{:});
M = mean(allFeatures,2,'omitnan');
S = std(allFeatures,0,2,'omitnan');
featuresTrain = cellfun(@(x)(x-M)./S,featuresTrain,'UniformOutput',false);
for ii = 1:numel(featuresTrain)
    idx = find(isnan(featuresTrain{ii}));
    if ~isempty(idx)
        featuresTrain{ii}(idx) = 0;
featuresValidation = cellfun(@(x)(x-M)./S,featuresValidation,'UniformOutput',false);
for ii = 1:numel(featuresValidation)
    idx = find(isnan(featuresValidation{ii}));
    if ~isempty(idx)
        featuresValidation{ii}(idx) = 0;

% Replicate the labels of the train and validation sets so that they are in
% one-to-one correspondence with the sequences.
labelsTrain = adsTrain.Labels;

% Update input layer for the number of features under test.
layers(1) = sequenceInputLayer(size(featuresTrain{1},1));

% Train the network.
net = trainNetwork(featuresTrain,labelsTrain,layers,options);

% Evaluate the network. Call classify to get the predicted labels for each
% sequence.
predictedLabels = classify(net,featuresValidation);
trueLabels = adsValidation.Labels;

% Save the normalization factors as a struct.
normalizers.Mean = M;
normalizers.StandardDeviation = S;


function [logbook,bestFeatures,bestNet,bestNormalizers] = HelperSFS(adsTrain,adsValidate,afeThis,layers,options,direction)
%   adsTrain    - audioDatastore object that points to training set
%   adsValidate - audioDatastore object that points to validation set
%   afe         - audioFeatureExtractor object. Set all features to test to true
%   layers      - Layers of LSTM or BiLSTM network
%   options     - trainingOptions object
%   direction   - SFS direction, specify as 'forward' or 'backward'
%   logbook         - table containing feature configurations tested and corresponding validation accuracies
%   bestFeatures    - struct containg best feature configuration
%   bestNet         - Trained network with highest validation accuracy
%   bestNormalizers - Feature normalization factors for best features

% Copyright 2019 The MathWorks, Inc.

afe = copy(afeThis);
featuresToTest = fieldnames(info(afe));
N = numel(featuresToTest);
bestValidationAccuracy = 0;

% Set the initial feature configuration: all on for backward selection
% or all off for forward selection.
featureConfig = info(afe);
for i = 1:N
    if strcmpi(direction,"backward")
        featureConfig.(featuresToTest{i}) = true;
        featureConfig.(featuresToTest{i}) = false;

% Initialize logbook to track feature configuration and accuracy.
logbook = table(featureConfig,0,'VariableNames',["Feature Configuration","Accuracy"]);

% Perform sequential feature evaluation.
wrapperIdx = 1;
bestAccuracy = 0;
while wrapperIdx <= N
    % Create a cell array containing all feature configurations to test
    % in the current loop.
    featureConfigsToTest = cell(numel(featuresToTest),1);
    for ii = 1:numel(featuresToTest)
        if strcmpi(direction,"backward")
            featureConfig.(featuresToTest{ii}) = false;
            featureConfig.(featuresToTest{ii}) = true;
        featureConfigsToTest{ii} = featureConfig;
        if strcmpi(direction,"backward")
            featureConfig.(featuresToTest{ii}) = true;
            featureConfig.(featuresToTest{ii}) = false;

    % Loop over every feature set.
    for ii = 1:numel(featureConfigsToTest)

        % Determine the current feature configuration to test. Update
        % the feature afe.
        currentConfig = featureConfigsToTest{ii};

        % Train and get k-fold cross-validation accuracy for current
        % feature configuration.
        [trueLabels,predictedLabels,net,normalizers] = HelperTrainAndValidateNetwork(adsTrain,adsValidate,afe,layers,options);
        valAccuracy = mean(trueLabels==predictedLabels)*100;
        if valAccuracy > bestValidationAccuracy
            bestValidationAccuracy = valAccuracy;
            bestNet = net;
            bestNormalizers = normalizers;

        % Update Logbook
        result = table(currentConfig,valAccuracy,'VariableNames',["Feature Configuration","Accuracy"]);
        logbook = [logbook;result]; %#ok<AGROW> 


    % Determine and print the setting with the best accuracy. If accuracy
    % did not improve, end the run.
    [a,b] = max(logbook{:,'Accuracy'});
    if a <= bestAccuracy
        wrapperIdx = inf;
        wrapperIdx = wrapperIdx + 1;
    bestAccuracy = a;

    % Update the features-to-test based on the most recent winner.
    winner = logbook{b,'Feature Configuration'};
    fn = fieldnames(winner);
    tf = structfun(@(x)(x),winner);
    if strcmpi(direction,"backward")
        featuresToRemove = fn(~tf);
        featuresToRemove = fn(tf);
    for ii = 1:numel(featuresToRemove)
        loc =  strcmp(featuresToTest,featuresToRemove{ii});
        featuresToTest(loc) = [];
        if strcmpi(direction,"backward")
            featureConfig.(featuresToRemove{ii}) = false;
            featureConfig.(featuresToRemove{ii}) = true;


% Sort the logbook and make it more readable.
logbook(1,:) = []; % Delete placeholder first row.
logbook = sortrows(logbook,{'Accuracy'},{'descend'});
bestFeatures = logbook{1,'Feature Configuration'};
m = logbook{:,'Feature Configuration'};
fn = fieldnames(m);
myString = strings(numel(m),1);
for wrapperIdx = 1:numel(m)
    tf = structfun(@(x)(x),logbook{wrapperIdx,'Feature Configuration'});
    myString(wrapperIdx) = strjoin(fn(tf),", ");
logbook = table(myString,logbook{:,'Accuracy'},'VariableNames',["Features","Accuracy"]);


function y = HelperTrimOrPad(x,n)
% y = HelperTrimOrPad(x,n) trims or pads the input x to n samples. If x is
% trimmed, it is trimmed equally on the front and back. If x is padded, it is
% padded equally on the front and back with zeros. For odd-length trimming or
% padding, the extra sample is trimmed or padded from the back.

% Copyright 2019 The MathWorks, Inc.
a = size(x,1);
if a < n
    frontPad = floor((n-a)/2);
    backPad = n - a - frontPad;
    y = [zeros(frontPad,1);x;zeros(backPad,1)];
elseif a > n
    frontTrim = floor((a-n)/2)+1;
    backTrim = a - n - frontTrim;
    y = x(frontTrim:end-backTrim);
    y = x;


[1] Jain, A., and D. Zongker. "Feature Selection: Evaluation, Application, and Small Sample Performance." IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 19, Issue 2, 1997, pp. 153-158.

[2] Jakobovski. “Jakobovski/Free-Spoken-Digit-Dataset.” GitHub, May 30, 2019.