Feature Selection for Audio Classification

This example uses:

Feature selection reduces the dimensionality of data by selecting a subset of measured features to create a model. Performing feature selection enables you to train smaller models quickly without sacrificing accuracy. For some tasks, properly selected features used with simple thresholding can provide adequate results, especially in situations where model size and complexity must be minimized.

In this example, you walk through a standard machine learning pipeline to develop an audio classification system. The pipeline has been abstracted so that you can apply the same steps to either speaker recognition or word recognition tasks.

Dataset Management and Labeling

Download the Free Spoken Digit Dataset (FSDD) [1]. FSDD consists of short audio files with spoken digits (0-9). The data is sampled at 8 kHz.

downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"FSDD");

Create an audioDatastore to manage the audio dataset.

ads = audioDatastore(dataset,IncludeSubfolders=true);

Choose a task and set the audioDatastore labels accordingly.

task = "word recognition";
[~,filenames] = fileparts(ads.Files);
switch task
    case "speaker recognition"
        ads.Labels = extractBetween(filenames,"_","_");
    case "word recognition"
        ads.Labels = extractBefore(filenames,"_");
end

Split data into train and test sets. Use 80% for training and 20% for testing.

[adsTrain,adsTest] = splitEachLabel(ads,0.8);

Listen to a sample from the training set. Plot the waveform and display the associated label.

[x,xinfo] = read(adsTrain);
sound(x,xinfo.SampleRate)

t = (0:numel(x)-1)/xinfo.SampleRate;
figure
plot(t,x)
title("Label: " + xinfo.Label)
grid on
axis tight
ylabel("Amplitude")
xlabel("Time (s)")

Feature Extraction Pipeline

Audio signals can broadly be categorized as stationary or non-stationary. Stationary signals have spectrums that do not change over time, like pure tones. Non-stationary signals have spectrums that change over time, like speech signals. To make machine learning-based tasks tractable, non-stationary signals can be approximated as stationary when analyzed at appropriately small time scales. Generally, speech signals are considered stationary when viewed at time scales around 30 ms. Therefore, speech can be characterized by extracting features from 30 ms analysis windows over time.

Use the helper function, helperVisualizeBuffer, to visualize the analysis windows of an audio file. Specify a 30 ms analysis window with 20 ms overlap between adjacent windows. The overlap duration must be less than the window duration. The Analysis Windows of Signal plot shows the individual analysis windows from which features are extracted.

windowDuration =0.03;
overlapDuration = 0.02;
helperVisualizeBuffer(x,xinfo.SampleRate,WindowDuration=windowDuration,OverlapDuration=overlapDuration);

Create an audioFeatureExtractor to extract features from 30 ms windows with 20 ms overlap between windows.

afe = audioFeatureExtractor(SampleRate=xinfo.SampleRate, ...
    Window=hann(round(windowDuration*xinfo.SampleRate),"periodic"), ...
    OverlapLength=round(overlapDuration*xinfo.SampleRate))

afe = 
  audioFeatureExtractor with properties:

   Properties
                     Window: [240×1 double]
              OverlapLength: 160
                 SampleRate: 8000
                  FFTLength: []
    SpectralDescriptorInput: 'linearSpectrum'
        FeatureVectorLength: 0

   Enabled Features
     none

   Disabled Features
     linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfcc, mfccDelta
     mfccDeltaDelta, gtcc, gtccDelta, gtccDeltaDelta, spectralCentroid, spectralCrest
     spectralDecrease, spectralEntropy, spectralFlatness, spectralFlux, spectralKurtosis, spectralRolloffPoint
     spectralSkewness, spectralSlope, spectralSpread, pitch, harmonicRatio, zerocrossrate
     shortTimeEnergy


   To extract a feature, set the corresponding property to true.
   For example, obj.mfcc = true, adds mfcc to the list of enabled features.

Configure the audioFeatureExtractor to extract all features.

in = info(afe,"all");
featureSwitches = fields(in);
cellfun(@(x)afe.set(x,true),featureSwitches)

afe

afe = 
  audioFeatureExtractor with properties:

   Properties
                     Window: [240×1 double]
              OverlapLength: 160
                 SampleRate: 8000
                  FFTLength: []
    SpectralDescriptorInput: 'linearSpectrum'
        FeatureVectorLength: 306

   Enabled Features
     linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfcc, mfccDelta
     mfccDeltaDelta, gtcc, gtccDelta, gtccDeltaDelta, spectralCentroid, spectralCrest
     spectralDecrease, spectralEntropy, spectralFlatness, spectralFlux, spectralKurtosis, spectralRolloffPoint
     spectralSkewness, spectralSlope, spectralSpread, pitch, harmonicRatio, zerocrossrate
     shortTimeEnergy

   Disabled Features
     none


   To extract a feature, set the corresponding property to true.
   For example, obj.mfcc = true, adds mfcc to the list of enabled features.

You can use the extract object function of audioFeatureExtractor to extract all the enabled features from an audio signal. The features are concatenated into a matrix with analysis windows along the rows and features along the columns.

featureMatrix = extract(afe,x);
[numWindows,numFeatures] = size(featureMatrix)

numWindows = 
62

numFeatures = 
306

You can use info to get a mapping between the columns of the output matrix and the feature names. The term "features" is overloaded in the literature. features can refer to the feature group, such as "linearSpectrum", "mfcc", or "spectralCentroid", or the individual feature elements, such as the first element of the linear spectrum or the third element of the MFCC. The output map returned by info is a struct where each field corresponds to a feature group and the values correspond to which columns in the feature matrix the feature groups occupy.

outputMap = info(afe)

outputMap = struct with fields:
          linearSpectrum: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 … ]
             melSpectrum: [122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153]
            barkSpectrum: [154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185]
             erbSpectrum: [186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213]
                    mfcc: [214 215 216 217 218 219 220 221 222 223 224 225 226]
               mfccDelta: [227 228 229 230 231 232 233 234 235 236 237 238 239]
          mfccDeltaDelta: [240 241 242 243 244 245 246 247 248 249 250 251 252]
                    gtcc: [253 254 255 256 257 258 259 260 261 262 263 264 265]
               gtccDelta: [266 267 268 269 270 271 272 273 274 275 276 277 278]
          gtccDeltaDelta: [279 280 281 282 283 284 285 286 287 288 289 290 291]
        spectralCentroid: 292
           spectralCrest: 293
        spectralDecrease: 294
         spectralEntropy: 295
        spectralFlatness: 296
            spectralFlux: 297
        spectralKurtosis: 298
    spectralRolloffPoint: 299
        spectralSkewness: 300
           spectralSlope: 301
          spectralSpread: 302
                   pitch: 303
           harmonicRatio: 304
           zerocrossrate: 305
         shortTimeEnergy: 306

This figure is intended to help you interpret the feature matrix returned from extract.

Use extract to extract features from all files in the audio datastore. If you have Parallel Computing Toolbox™, spread the computation across multiple workers.

The output is a (Number of files)-by-1 cell array. Each element of the cell array is a (Number of hops)-by-(Number of features) matrix, where the number of hops depends on the length of the audio file.

features = extract(afe,adsTrain,UseParallel=canUseParallelPool);

Feature/Label Correspondence

Once you have extracted features from approximately stationary windows in time, the next question is whether to feed the window-level features to your machine learning model or to combine the features into file-level representations. The choice of window-level or file-level features depends on your application and requirements. For file-level features, you will generally create summary statistics of the window-level features to collapse the time dimension. Common summary statistics include the mean and standard deviation. This example uses window-level features.

To train a machine learning model on window-level features, replicate the file-level labels so that they are in one-to-one correspondence with the features.

N = cellfun(@(x)size(x,1),features);
T = repelem(adsTrain.Labels,N);

Concatenate the features into a single matrix for consumption by machine-learning tools.

X = cat(1,features{:});

Feature Selection

Statistics and Machine Learning Toolbox™ provides several tools to aid in feature selection. The best feature selector will depend on your intended model. Use fscmrmr (Statistics and Machine Learning Toolbox) to rank features for classification using the minimum-redundancy/maximum-relevance (MRMR) algorithm. The MRMR is a sequential algorithm that finds an optimal set of features that is mutually and maximally dissimilar and can represent the response variable effectively.

rng("default") % for reproducibility
[featureSelectionIdx,featureSelectionScores] = fscmrmr(X,T);

The fscmrmr function considers each column of the input feature matrix as a unique feature. Plot the scores of each scalar in the feature matrix returned by audioFeatureExtractor.

figure
bar(featureSelectionScores)
ylabel("Feature Score")
xlabel("Feature Matrix Column")

The audioFeatureExtractor extracts feature groups with varying numbers of elements. For example, the default number of elements of the MFCC feature group is 13, while the spectral centroid feature always consists of 1 element. The output map returned by calling info on audioFeatureExtractor is a struct with fields equal to the feature group and values equal to the columns that feature group occupies in the matrix output by extract. Use the output map and the supporting function uniqueFeatureName to create a unique name for each scalar feature, then plot the scores of the top 25 performing features.

featurenames = uniqueFeatureName(outputMap);

featurenamesSorted = featurenames(featureSelectionIdx);
figure
bar(reordercats(categorical(featurenames),featurenamesSorted),featureSelectionScores)
xlim([featurenamesSorted(1),featurenamesSorted(25)])

Depending on your application, you can approximate grouped feature selection by averaging the scores of feature groups. Using grouped features (for example, all MFCC) may help you deploy more efficient feature extraction. In this example, you use the top-performing feature scalars, regardless of which feature group they belong to.

Select some top scoring features. The number you select will depend on the model you are training and the final constraints of your application.

numFeatures = 30;
selectedFeatureIndex = featureSelectionIdx(1:numFeatures);

Train Model

To train a KNN model using your selected features, use fitcknn (Statistics and Machine Learning Toolbox). If you are unsure of which machine learning model you want to use, try fitcauto (Statistics and Machine Learning Toolbox) to automatically select a classification model with optimized parameters, or try the Classification Learner (Statistics and Machine Learning Toolbox).

Mdl = fitcknn(X(:,selectedFeatureIndex),T,Standardize=true);

Evaluate Model

Spot-check the model's performance.

Read a sample from the test set. Listen to the sample and then plot its waveform and display the ground-truth label.

[x,xInfo] = read(adsTest);
sound(x,xInfo.SampleRate)

t = (0:numel(x)-1)/xInfo.SampleRate;
figure
plot(t,x)
title("Label: " + xInfo.Label)
grid on
axis tight
ylabel("Amplitude")
xlabel("Time (s)")

Extract features from analysis windows.

yPerWindow = extract(afe,x);

Predict the correct label per window.

t = predict(Mdl,yPerWindow(:,selectedFeatureIndex));

trueLabel = categorical(xInfo.Label)

trueLabel = categorical
     0

predictionsPerWindow = categorical(t')

predictionsPerWindow = 1×39 categorical array
    "0"    "0"    "3"    "3"    "3"    "3"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"

Create a file-level prediction by taking the mode of window-level predictions.

prediction = mode(predictionsPerWindow)

prediction = categorical
     0

Analyze the whole-word performance over the entire test set.

Tfile = categorical(adsTest.Labels);
featuresTest = extract(afe,adsTest,UseParallel=canUseParallelPool);
Y = cellfun(@(x)mode(categorical(predict(Mdl,x(:,selectedFeatureIndex)))),featuresTest,UniformOutput=false);
Y = cat(1,Y{:});

figure
confusionchart(Tfile,Y,Title="Accuracy = " + 100*mean(Tfile==Y) + " (%)")

You can apply a similar pattern as above to also select an optimal window, window length, window overlap, DFT length, and input to spectral descriptors.

Supporting Functions

function c = uniqueFeatureName(afeInfo)
%UNIQUEFEATURENAME Create unique feature names
%c = uniqueFeatureName(featureInfo) creates a unique set of feature names
%for each element of each feature described in the afeInfo struct. The
%afeInfo struct is returned by the info object function of
%audioFeatureExtractor.
a = repelem(fields(afeInfo),structfun(@numel,afeInfo));
b = matlab.lang.makeUniqueStrings(a);
d = find(endsWith(b,"_1"));
c = strrep(b,"_","");
c(d-1) = strcat(c(d-1),"0");
end

References

[1] Jakobovski. “Jakobovski/Free-Spoken-Digit-Dataset.” GitHub, May 30, 2019. https://github.com/Jakobovski/free-spoken-digit-dataset.