YAMNet sound classification network
Audio Toolbox / Deep Learning
The YAMNet block leverages a pretrained sound classification network that is trained on the AudioSet dataset to predict audio events from the AudioSet ontology.
features— Mel spectrograms
Mel spectrograms, specified as a 96-by-64 matrix or a 96-by-64-by-1-by-N array, where:
96 –– Represents the number of 10 ms frames in each mel
64 –– Represents the number of mel bands spanning 125 Hz to
N –– Number of channels.
You can use the YAMNet Preprocess block to generate mel spectrograms. The dimensions of these spectrograms are 96-by-64.
sound— Predicted sound label
Predicted sound label, returned as an enumerated scalar.
scores— Predicted activations or scores
Predicted activation or score values for each supported sound label, returned as a 1-by-521 vector, where 521 is the number of classes in YAMNet.
labels— Class labels for predicted scores
Class labels for predicted scores, returned as a 1-by-521 vector.
Mini-batch size— Size of mini-batches
128(default) | positive integer
Size of mini-batches to use for prediction, specified as a positive integer. Larger mini-batch sizes require more memory, but can lead to faster predictions.
Classification— Select to output sound classification
Enable the output port sound, which outputs the classified sound.
Predictions— Output all scores and associated labels
Enable the output ports scores and labels, which output all predicted scores and associated class labels.
The block accepts mel spectrograms of size 96-by-64 or 96-by-64-by-1-by-N, and computes a maximum of three outputs using these spectrograms:
sound: The label of the most likely sound. You get one "sound" for each 96-by-64 spectrogram input.
scores: 1-by-512 vectors. Each element in the vector is a score value for each supported sound label.
labels: 1-by-521 vectors. Each element in the vector is a sound label.
 Gemmeke, Jort F., Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 776–80. DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952261.
 Hershey, Shawn, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, et al. “CNN Architectures for Large-Scale Audio Classification.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 131–35. DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952132.
Usage notes and limitations:
The Language parameter in the Configuration
Parameters > Code Generation general category must
be set to
For ERT-based targets, the Support: variable-size signals parameter in the Code Generation> Interface pane must be enabled.
For a list of networks and layers supported for code generation, see Networks and Layers Supported for Code Generation (MATLAB Coder).