Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Emotion recognition using speech features
Krothapalli S., Koolagudi S., Springer Publishing Company, Incorporated, New York, NY, 2013. 136 pp. Type: Book (978-1-461451-42-6)
Date Reviewed: Apr 19 2013

This short book is part of a series that aims to provide “concise summaries of cutting-edge research and practical applications.” As such, it fulfills one’s expectations quite well. The book is a hybrid, combining a state-of-the-art survey and a fairly detailed report on the implementation of several emotion recognition methods based on speech features.

The book starts with a very brief introduction to the psychology of emotion and its manifestations in the speech signal. This is followed by a more detailed presentation on the speech production process and the human vocal apparatus. This sets the stage for the description of the main types of speech signal features around which the content of the book is structured: source features, vocal tract system (spectral) features, and prosody features. The authors also describe some practical aspects of emotion recognition research, such as the need for emotional speech databases and potential applications.

The second chapter presents a useful review of speech corpora used in emotion recognition, research on emotion recognition based on different types of features, and categorization models used in this task. The survey of speech corpora is fairly detailed. Speech corpora are distinguished by language, range of emotions tagged, number of speakers, nature of the emotional speech collected (simulated, elicited, or natural emotion), and purpose of the corpus. The descriptions of these databases are supported by references to the literature. However, under closer inspection, a good number of these references prove to be incorrect. Inconsistent bibliographical referencing is, in fact, a recurring problem in the book. The references appear in order of citation, and some of them are repeated several times in the reference list at the end of the book. Although the reader can, with some knowledge and patience, piece together most of the citations, a little editorial care would have avoided such unnecessary effort.

The survey of the features used in different approaches to emotion recognition in speech covers source, spectral, and prosodic features. Spectral and prosodic features account for most research in emotion recognition. The authors conclude that more research is needed on excitation source features, which have been used in other speech recognition tasks but seldom for emotion recognition. The review of features is complemented by a survey of the classifiers employed in this type of work.

In chapters 3 to 5, the authors describe their implementations of approaches to emotion classification based on each of the identified types of features. The main innovative aspect of this work is the focus on excitation source features in chapter 3. This chapter also describes the collection of a simulated emotional speech corpus in the Telugu language, one of the major languages spoken in India. The signal processing methods used to extract source features from the speech signal are explained in some detail, and references to the relevant literature are given. Source features are essentially derived from the linear prediction residual signal, so an appendix on linear prediction analysis is included for useful reference. The classification models used with these source features are auto-associative neural networks (AANNs) and support vector machines (SVMs). This combination produces surprisingly good results on the (admittedly clean) speech of the Telugu corpus and the Berlin emotional speech corpus (Emo-DB).

The next two chapters follow the same template as chapter 3. The feature extraction techniques are described, covering linear prediction cepstrum coefficients (LPCCs) and mel-frequency cepstral coefficients (MFCCs) for vocal tract system features, and duration, pitch, and energy for prosodic features (both global and local). The authors discuss the classifiers used--Gaussian mixture models (GMMs) and SVMs for vocal tract, and SVMs for prosody features--and present results and concise discussions. The book concludes with a chapter containing a summary, conclusions, and a somewhat unstructured list of future directions.

Oddly, the authors make no attempt to integrate the source features with the best performing approach (GMM classification based on LPCCs and formants), even though they speculate in chapter 3 that such features may complement other features in emotion recognition. Despite this, and the fact that the editorial process let through several typographical errors, incorrect citations, and other minor inconsistencies and inaccuracies, this useful book lives up to its promise to provide a concise and practical summary of a relevant research topic.

Reviewer:  Saturnino Luz Review #: CR141154 (1307-0598)
Bookmark and Share
  Featured Reviewer  
 
Speech Recognition And Synthesis (I.2.7 ... )
 
 
Applications And Expert Systems (I.2.1 )
 
 
General (I.2.0 )
 
Would you recommend this review?
yes
no
Other reviews under "Speech Recognition And Synthesis": Date
On-line recognition of spoken words from a large vocabulary
Kohonen T. (ed), Riittinen H., Reuhkala E., Haltsonen S. Information Sciences 33(1-2): 3-30, 1984. Type: Article
Oct 1 1985
Connected spoken word recognition algorithms by constant time delay DP, O (n) DP and augmented continuous DP matching
Nakagawa S. Information Sciences 33(1-2): 63-85, 1984. Type: Article
Jun 1 1985
The phonetic basis for computer speech processing
Ladefoged P., Prentice Hall International (UK) Ltd., Hertfordshire, UK, 1985. Type: Book (9789780131638419)
Dec 1 1987
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy