Spoken kannada letters recognition using Machine Learning

Kannada alphabets datasets are trained to the neural network using MFCC and LPC coefficients and accuracy is compared using confusion matrix

Ravikiran Bidari

Version 1.0.2 (51,8 Mo)

95 téléchargements

(1)

12 août 2021

Télécharger

Ouvrir dans MATLAB Online

Suivre

Télécharger

Ouvrir dans MATLAB Online

Suivre

Intro :

This MATLAB code shows how to train a deep learning model that detects the presence of kannada letters in audio. The example uses the Kannada Dataset prepared to train a convolutional neural network to recognize a spoken kannada letters.

To train a network from scratch, you must first prepare the data set.

Create Training Datastore

Create an audioDatastore (Audio Toolbox) that points to the training data set. Common practice is to divide a dataset into training dataset and validation dataset in ration of 4:1. This partition helps the network to predict the accuracy of trained network.

Choose Words to Recognize

Specify the words that you want your model to recognize as commands. Label all words that are not commands as unknown. Labeling words that are not commands as unknown creates a group of words that approximates the distribution of all words other than the commands. The network uses this group to learn the difference between commands and all other words.

To reduce the class imbalance between the known and unknown words and speed up processing, only include a fraction (2%) of the unknown words in the training set.

Use subset (Audio Toolbox) to create a datastore that contains only the commands and the subset of unknown words. Count the number of examples belonging to each category.

Compute Auditory Spectrograms

To prepare the data for efficient training of a convolutional neural network, convert the speech waveforms to auditory-based spectrograms.

Define the parameters of the feature extraction. segmentDuration is the duration of each speech clip (in seconds). frameDuration is the duration of each frame for spectrum calculation. hopDuration is the time step between each spectrum. numBands is the number of filters in the auditory spectrogram.

Create an audioFeatureExtractor (Audio Toolbox) object to perform the feature extraction.

Read a file from the dataset. Training a convolutional neural network requires input to be a consistent size. Some files in the data set are less than 1 second long. Apply zero-padding to the front and back of the audio signal so that it is of length segmentSamples.

To extract audio features, call extract. The output is a mel spectrum with time across rows.

In this example, you post-process the auditory spectrogram by applying a logarithm. Taking a log of small numbers can lead to roundoff error.

Scale the features by the window power and then take the log. To obtain data with a smoother distribution, take the logarithm of the spectrograms using a small offset.

Isolate the train and validation labels. Remove empty categories using removecats Visualize Data

Plot the waveforms and auditory spectrograms of a few training samples. Play the corresponding audio clips. To confirm proper labels assigned to training and validation dataset.

Plot the distribution of the different class labels in the training and validation sets.

Define Neural Network Architecture:

Create a simple network architecture as an array of layers. Use convolutional and batch normalization layers, and down sample the feature maps "spatially" (that is, in time and frequency) using max pooling layers. Add a final max pooling layer that pools the input feature map globally over time. This enforces (approximate) time-translationinvariance in the input spectrograms, allowing the network to perform the same classification independent of the exact position of the speech in time. Global pooling also significantly reduces the number of parameters in the final fully connected layer. To reduce the possibility of the network memorizing specific features of the training data, add a small amount of dropout to the input to the last fully connected layer.

The network is small, as it has only five convolutional layers with few filters. numF controls the number of filters in the convolutional layers. To increase the accuracy of the network, try increasing the network depth by adding identical blocks of convolutional, batch normalization, and ReLU layers. You can also try increasing the number of convolutional filters by increasing numF.

Use a weighted cross entropy classification loss. weightedClassificationLayer(classWeights) . Specify the class weights in the same order as the classes appear in categories(YTrain). To give each class equal total weight in the loss, use class weights that are inversely proportional to the number of training examples in each class. When using the Adam optimizer to train the network, the training algorithm is independent of the overall normalization of the class weights.

Evaluate Trained Network

Calculate the final accuracy of the network on the training set (without data augmentation) and validation set. The network is very accurate on this data set. However, the training, validation, and test data all have similar distributions that do not necessarily reflect real-world environments. This limitation particularly applies to the unknown category, which contains utterances of only a small number of words.

Prediction accuracy

Plot the confusion matrix. Display the precision and recall for each class by using column and row summaries. Sort the classes of the confusion matrix. The largest confusion is between unknown words and commands.

Code execution sequence :

mfccAudioFeatureAnalysis
LPCAudioFeatureAnalysis
MFCCTrainingCode – Don’t close training progress graph
ConfusionMatrixCode – Don’t close matrix chart
MFCCSerailTestingCode – show Result_Table
MFCCRandomTestingCode – command window output
MFCCLiveDemo – Live record window
LPCTrainingCode – Don’t close training progress graph
ConfusionMatrixCode – Don’t close matrix chart
LPCSerailTestingCode – show Result_Table
LPCRandomTestingCode – command window output
LPCLiveDemo – Live record window
Finally show training progress comparison and confusion matrix comparison

Citation pour cette source

Ravikiran Bidari (2026). Spoken kannada letters recognition using Machine Learning (https://fr.mathworks.com/matlabcentral/fileexchange/97492-spoken-kannada-letters-recognition-using-machine-learning), MATLAB Central File Exchange. Extrait(e) le juillet 8, 2026.

Remerciements

Inspiré par : Speech Recognition, Speech recognition using MFCC and LPC

Compatibilité avec les versions de MATLAB

Compatible avec les versions R2020a et ultérieures

Plateformes compatibles

Windows
macOS
Linux

Ouvrir dans un nouvel onglet

Version	Publié le	Notes de version	Action
1.0.2	12 août 2021	The execution flow is updated in the description	Télécharger
1.0.1	12 août 2021	Updated to ease the execution flow	Télécharger
1.0.0	12 août 2021		Télécharger