Robust Classification of Emotion in Human Speech Using Spectrogram Features Open Access
Downloadable ContentDownload PDF Report an accessibility issue with this item
The recognition of emotions, such as anger, anxiety, joy, etc. from tonal variations in human speech is an important task for research and applications in human computer interaction. The objective of this research is to design, implement and test a Speech Emotion Classification (SEC) engine that can extract useful features and accurately classify emotions in human speech in the presence of speaker-dependent characteristics variations and noise. Current approaches extract several standard global values from the temporal sequence of power spectra, such as pitch, formants, energy, and values from the time signal, such as attack and decay rates. In this work, the frequency dimension of the spectrogram is quantized to simulate the Bark scale in the human audition system, the time dimension of the spectrogram is quantized in units starting from 50 ms, and the linear regression coefficients of the surface of each spectrogram segment are combined into a feature vector. In this way, complete local features are extracted to establish a larger sample. The accumulated feature vectors for each category of emotion provide a robust training basis for a state of the art classifier, such as an SVM. In order to further improve the performance of the SEC engine and to demonstrate the flexibility and benefit of local features, a backward context scheme is introduced. A series of experiments have been designed and conducted using the EMO-DB and LDC-DB speech emotion databases to measure the performance of the SEC engine. First, the accuracy and the precision of the performance are measured in terms of seven to fifteen emotion categories when trained on the speech utterances by random sampling. Next, the generalization performance is measured through a speaker cross-validation scheme. Third, the generalization and robust performance of the SEC engine is measured by performing gender, language and speaker classification with the SEC engine, hence measuring the discrimination power of the engine related to the speaker characteristics variations. Finally, the robust performance of the SEC engine is measured when the SNR is varied between 10 and 50 dB.
Notice to Authors
If you are the author of this work and you have any questions about the information on this page, please use the Contact form to get in touch with us.