A Study On Hidden Markov Models Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In this paper we describe emotional speech recognition with the use of the Hidden Markov Models.The HMM, being a dynamic model, takes into account how the prosodic contours from the extracted features from speech (short-term features) change overtime. As a result in many cases a HMM produces better results in classifying human emotions from speech signals than other techniques that use the statistics of the short-time features to achieve this (K-NN, SVM, LDA, ANN etc.). In this document we describe speech emotion HMM modeling, we present several methods based on this modeling found from the literature and afterwards we evaluate them by comparing them with static classifiers. Finally we mention their advantages and disadvantages and when they should be used.

Key-Words: -

Speech emotion recognition; Hidden Markov model;

1 Introduction

Human emotions are subjective experiences, often associated withmood,temperament,personality, and disposition. Before talking about automatic emotion recognition it is a good idea to define emotion by understanding its' psychological, and biological aspects.

Tin Lay New et al [1], wrote that “emotion can be described by a cause-and-effect procedure. There is a stimulus that excites the emotion, which is the cognition ability of the agent to appraise the nature of the stimulus and subsequently have his/her mental and physical responses to the stimulus. The mental response is in the form of emotional state. The physical response is in the form of ‘fight or flight', or as described by Fox (1992), approach or withdrawal” Tin Lay New et al [1] also wrote that “from a biological perspective, Darwin (1965) looked at the emotional and physical responses as distinctive action patterns selected by evolution because of their survival value. Thus, emotional arousal will have an effect on, the heart rate, skin resistivity, temperature, pupillary diameter, and muscle activity, as the agent prepares for ‘fight or flight'. As a result, the emotional state is also manifested in spoken words and facial expressions.”

To create an automatic emotion recognizer we have to assign labels or classes of emotion for the recognizer, in order to be able to classify a spoken utterance in a certain class. The majority of literature reports six emotions as the ‘primary emotions' or ‘archetypal emotions'. These are Joy, Sadness, Fear, Anger, Surprise and Disgust [1][2][6]. They are considered as the more popular ones (although there exist much more) and are used for emotion classification in speech from the majority of the existing emotion recognition techniques. Some researches consider the Neutral state, the state where one does not express any emotion, as a seventh label of emotion state[4][5][6].

A first attempt to make a machine to recognize emotions is dated back to the mid eighties, when Van Bezooijen [12], Tolkmitt and Scherer [13], used statistical properties of certain acoustic features to create models of human emotion. Nowadays the research is focused on creating hybrid methods by combining classifiers for best classifying the human emotion in real-life applications.

The applications of the speech automatic emotion recognition include Human-Computer Interaction, Human robot interaction[17], dialogue systems[15], Call centers[16] ,tutoring, alerting, entertainment[14], medicine and psychology [10]. Certain methods are also used in automatic speech recognition (ASR) to resolve linguistic ambiguities [8].

Emotional speech data collections seem to be very useful for researchers as they help them to test their methods and compare them with other methods on the same datasets. Nowadays more than 64 emotional speech data collections can be found, each one with different language, number, profession of the subjects, emotional states recorded, and kind of emotions (natural, simulated, elicited) [10]. Some very widely accepted and used are the Danish and Berlin databases [18]

The paper is structured as follows: firstly, the linguistic and paralinguistic parts of speech as well as the speech extracted features used for speech emotion recognition are introduced in section 2; afterwards the classification methods of the extracted features, used to recognize the emotions are briefly presented in section 3. Next we give a definition of the HMMs and we explain how they are used for speech emotion recognition by presenting others' work from the literature in the first and second part section 4. The third part of Section 4 deals with the evaluation of the HMM method by comparing it with the rest of the methods and highlighting its advantages and disadvantages. The conclusion and discussion is presented in section 5.

2 Speech emotion features

2.1 Linguistic and paralinguistic parts of speech

The human speech is used for communication and is composed of two channels, the linguistic one and the paralinguistic one. The first one corresponds to “What is said” in a conversation and it is also called the syntactic-semantic part of the speech, conveying linguistic information (semantics, syntax). The paralinguistic channel corresponds to “How something is said” in a conversation, containing the emotional and physical states of a speaker, (age, sex, facial expressions, gestures, voice quality, stress and nervousness) [7][8]. Although we can realize many things for the emotional state of a speaker from the linguistic part of the speech, the recognition of emotion in the majority of the cases is accomplished by extracting mainly paralinguistic information from the speech signal because emotion content is carried by prosodic features[9].

2.2 Speech emotion extracted features

In order to identify the emotion that underlies a speech utterance, certain paralinguistic speech features need to be extracted from the speech signal that relate to the prosody of the speaker's voice (prosody is therhythm,stress, andintonationofspeech). These prosodic features are short-term features (features extracted in a short time window) and long-term features (short-term features' statistics)

The most common short-term features are the following: the pitch (fundamental frequency) signal which is produced from the vibration of the vocal folds, the speech energy, the Teager energy operator which is the number of harmonics due to the non-linear air flow in the vocal tract that produces the speech signal, and the vocal tract features which are the formants (the vocal tract's resonances), the coefficients, derived from the signal's frequency transformations (mel-frequency cepstral components, log frequency power coefficients, linear prediction cepstral coefficients), and the vocal tract cross-section areas [10].

The most common long-term features are the mean, range, variance of the pitch and intensity contour as well as the speech rate of the speaker [10].

3 Emotional speech classifiers

There are two kinds of classification techniques for human emotion in speech. As stated by Ververidis et al. [10], the first group of algorithms take as input sequences of short-time features (prosody contours), and take into account how they change over time. That's why they are also called as dynamic modeling methods. The techniques that belong to the first group are based on Artificial Neural Networks (ANN) and Hidden Markov Models (HMM). The second group of techniques take as input the statistics of the short-time features. There are two classes of this group. The first class of techniques use the estimation the probability density function (pdf) of the features as their main tool, and are the Bayes classifier and the Gaussian Mixture Models (GMM). The second class of techniques does not take into account the distributions of the features and these are the support vector machines (SVM), the k-nearest neighbors (k-nn), the Linear Discriminant Analysis (LDA), and the artificial neural networks. The main difference between the short-time feature group and the long-term feature group is that the first one takes into account the timing information that speech can provide. This is claimed to be a major advantage over the second group [1][21][22].

4 Hidden Markov Models for emotion speech recognition

4.1 Definition of the Hidden Markov model

A Hidden Markov Model (fig. 1.) is the simplest Dynamic Bayesian Network. It is a set of states each of which is modeled with probability distributions. The transitions among the states are governed by a set of probabilities, which are called transition probabilities. According to the probability distribution of each state an observation can be generated. Only these observations are visible to an external observer and not the states, which are “hidden”. That is why it is called Hidden Markov Model.

An example of a HMM with three states , three observations of the states, with probability distributions , each one corresponding to one state, and transition probabilities among the states, .

To define a HMM we need:

• The number of states of the model

• The number of observations

• A set of state transition probabilities (connectivity probabilities)

• A probability distribution in each of the states

• The initial state distributions

The theory of HMMs has the following assumptions. One, the next state is dependent only upon the current state. Second, state transition probabilities are independent of the actual time at which the transitions take place. Third, the current output (observation) is statistically independent of the previous outputs (observations).

For a HMM there are three problems of interest:

The Evaluation Problem:

Given an HMM and a sequence of observations, what is the probability that the observations are generated by this HMM model? Solution is given by the Forward-Backward algorithm [20]

The Decoding Problem:

Given an HMM and a sequence of observations, what is the most likely state sequence in the model that produced the observations? Solution is given by the Viterbi Decoding [20]

The Learning Problem:

Given a sequence of observations, what is the maximum likelihood HMM that could have produced this string of observations? Solution is given by the Baum-Welch algorithm [20]

In order to use HMM for emotion speech recognition, a single HMM is trained to correspond to a specific emotion, this means that transition and output probabilities are estimated from a training set of data. Then, when we have an unknown speech utterance to classify, we choose the HMM that best describes the sequence of the features extracted from this utterance. To create a HMM for this purpose we have to determine its architecture. Its architecture depends on its number of states, its connectivity and output probabilities (if they are discrete or continuous and the number of mixtures each one is composed of).

The HMMs are used to represent emotions on phoneme, word or utterance segmentation level. They usually have 5 to 10 states per model, and their output probability densities are modeled with a low number of mixtures [19]. As far as their connectivity is concerned, the usual HMM topologies used are the linear with forward connections as well as with forward-backward connections. Left-right models as well as ergodic (fully connected) topologies are also used. (fig. 2.) [19][21]

Different topologies of a 3-state HMM. Top: linear model with only forward connections and model with additional backward connections. Bottom: left-right and fully connected (ergodic) model [21].

4.2 HMMs for emotion speech recognition

In recent literature, there have been several attempts to create an automatic speech emotion recognizer with the use of HMMs. Many researchers claim that the HMM method ,being a dynamic modeling method, can take into account how the prosody contours of a speech signal change over time and produce better results than static classifiers. The inputs to the HMMs for training and testing can be prosodic features taken in phoneme, word, or utterance segmentation level. To prove the HMMs superiority they test the created models on existing speech databases or databases created by them (acted or spontaneous speech) by comparing the HMMs with other static classifiers or the human perception of speech emotion.

Nwe et al. [1] run experiments on discrete HMMs. They propose an alternative method of extracting features from the speech utterance called log frequency power coefficients (LFPC) (12 values). These features were extracted from short acted utterances in 6 archetypal emotions obtained from 12 non-professional speakers. Then they pass all the 12 attribute vectors to a vector quantization algorithm and match each vector to a certain cluster. So the final input vector is a vector of n values (each value corresponds to a certain class) where n is the number of segments the speech signal was divided to. With this data the 6 emotions were modeled with HMMs with up to 8 states. Ergodic topology was chosen instead of a left-right structure, because of the assumption that emotional cues contained in an utterance may not occur strictly sequentially. The best results were achieved with four states HMMs and showed that average accuracy of 77.1% and best accuracy of 89% can be achieved in classifying the six basic emotions individually, better than the accuracy achieved by human assessment of speech emotion (65.8%).

Pao et al. [22] presented 3 methods for categorizing emotions in Mandarin speech, the linear discriminate analysis (LDA), the k-nearest neighbor (K-NN) decision rule, and the Hidden Markov models (HMMs). Two speech databases of 5 emotions were made for this purpose. For the first one twelve native (Non-professional) Mandarin language speakers were asked to generate the emotional utterances. For the second one two professional Mandarin speakers were employed. First the utterances were passed through an Emotion Feature Selection algorithm, where 6 types of features were selected: 20 Mel-frequency cepstral coefficients (MFCC), 12 linear prediction cepstral coefficients (LPCC), 16 Linear predictive coding (LPC) coefficients, 16 perceptual linear prediction (PLP) coefficients, 16 log frequency power coefficients (LFPC), and jitter were extracted from each frame. This created a feature vector consisting of 81 parameters. To compress the data in order to accelerate the classification process, two different vector quantization methods were performed. Then the data was passed as input to the three classifiers for training and testing them. As far as the HMMs are concerned the architecture of a 4-state discrete ergodic HMM was chosen, which achieved better performance than the left-right structure. HMMs' state transition probabilities and output symbol probabilities were uniformly initialized. The HMM classifier did relatively better, having approximately 8% more success than LDA and 4% more success than K-NN, in recognizing the 5 emotions for both vector quantization methods and speech databases (1st database: LDA 79.8% , 78.8% , K-NN 83.6% , 83.9% , HMMs 86.9%, 88.3%. 2nd database: LDA 79.9% , 78.1% , K-NN 84.2% , 83.7% , HMMs 88.3%, 88.7%, for vector quantization method 1 and 2 respectively)

Wagner et al. [21] made an attempt for systematic comparison of different HMM designs for speech emotion recognition. Experiments are carried out on acted speech (The Berlin database [24] with recordings of 10 actors and 6 emotional states plus a neutral one) and spontaneous speech (The German Aibo Emotion Corpus [23], which contains spontaneous emotions from 51 children with three emotional states) in word and utterance segmentation level. Different HMM configurations were tried and all combinations of the most common settings of the individual parameters were examined. In more detail discrete probability distributions of size 64 and 256 were tested as well as continuous with 1, 4 and 8 Gaussian Mixtures. The HMM topologies that were tested were the forward transition, the left-right, the forward backward and fully connected (also known as ergodic) models. As far as the number of states is concerned, HMMs with 5,10,15 and 20 states were tested. A total of 120 HMMS were created based on the above parameters. As an input to these models a feature set of 13 MFCC coefficients was used along with their first and second derivation, resulting in 39 features in total. All 120 HMMs were trained and tested for both segmentation types and on both databases. For word segmentation there was a 52.18% success in recognition of the speaker's emotional state for the AIBO database and 42.29% for the BERLIN database. For Utterance segmentation the success rate was 50.85% and 61.36% for the AIBO and BERLIN database respectively. From the tests it was concluded that HMMs modeled by continuous densities with 4 gaussian mixtures and 5 to 10 states have the best results in classifying emotions. There seemed to be no improvement with HMMs with ergodig architecture or 15+ states. Moreover network design seemed to be independed of the type of speech and the segmentation level. The types of networks that were stated above to do relatively better were compared with a static classifier (Support vector machines, SVM) on the same databases. The SVM had as an input a feature set of 1053 MFCC and 137 energy features. On word level HMMs were more than 10 % superior to the SVM approach for both databases (SVM: 43.7% for AIBO, 36.6% for BERLIN, HMM: 55.5% for AIBO, 48.6% for BERLIN). On utterance level, HMMs were still slightly better for AIBO database, but inferior for BERLIN database (SVM: 51.2% for AIBO, 73.3% for BERLIN, HMM: 52.5% for AIBO, 69.5% for BERLIN).

Another case that the HMMs have better performance than static classifiers (SVMs and ANNs) can be found in [25] where Fernandez et al. used autoregressive HMMs or hidden Markov decision trees for speech emotional classification.

4.3 Advantages and disadvantages of HMMs and the static classifiers

From the literature it was realized that to an extent HMMs can be superior to static classifiers (especially on word level [21]) because the latter ones ignore the change of speech temporal features through time [1][21][22]. But this is not always the case. Studies have shown that sometimes classifiers of long-term features of speech may perform equally or even better than HMMs [19][26]. A reason for this is that HMM-based classifiers are less applicable in applications with utterances of variable length and perform significantly better in cases of applications with short utterances [19][26][21]. A second reason is that for static classification more feature types can be exploited (e. g. suprasegmental acoustic features like jitter shimmer to measure voice quality), so that overall, the static classifiers have better performance. However, when the feature set is restricted to the same feature types, for example only MFCCs and energy, HMMs often outperform static modeling [19]. A third reason why HMMs may not always perform better is that it is difficult to set up general guidelines which kind of HMM network is best suited for certain applications [21]. From the static classifiers, SVM is the most popular and the most often successfully applied algorithm, so it can be considered a kind of standard [19]. Some even support the idea that static classification has been more prevalent than dynamic classification in the work on emotion recognition [19], although the latter is thought to be advantageous for better capturing the temporal activity incorporated in speech [27][21] .

A fact is that a direct comparison of static and dynamic classification is difficult since not the same features can be used, so it is difficult to say if just the features have been chosen more favorable or if really the classifier has been superior [19]. Moreover the classification rates reported in the related literature are not directly comparable with each other, because they were measured on different data collections by applying different experimental protocols [10].

To overcome such problems hybrid methods have been proposed where HMM dynamic modeling is merged with static classifiers for better performance [11][3]. But it is observed that although sophisticated classifiers do achieve higher recognition rates than simple classifiers the performance is not much better in most cases [19].

Finally, despite the advantages and disadvantages HMMs and static classifiers have, it is true that automatic emotion classification results of both types of classifiers are equal to or even better than human rating performance of emotion perception from speech [19] so they can be considered as valuable tools for emotion recognition.

5 Conclusion

In this paper we described emotional speech recognition with the use of the HMMs. The HMM, being a dynamic model, takes into account how the speech prosodic contours change overtime. As a result a HMM may yield better results than other static techniques under certain circumstances, but not always. To improve performance, hybrid methods with of both types of classifiers can be used. Moreover experiments have to be conducted with the same speech databases in order to have a direct comparison of the different proposed methods to the problem of speech emotion recognition.