Speech occur in communication between humans. According to information theory, speech can be represented in terms of its message content, or information and the speech signal can be represented in a digital form . The signal are usually processed in digital representation. After the signal acquired, it will be analyzed and lead to the display of the sample with time, amplitude and frequency.
SpeechÂ processingÂ is the process by whichÂ speechÂ signals are interpreted, understood, and acted upon . It is important in computer science field including the artificial intelligence industry where an attempt to transfer the features of human thought into the design of machines take place. Speech recognition is one of the most important portions of speech processing because the goal of processing speech is to understand and to act on human spoken language .
There are various methods used for identification of different types of sounds like the sound of human speech (Speech Recognition), the human voice (Voice Recognition), as well as the noise generated by certain objects (Sound Recognition). Although using different methods, but the purpose is the same, to process, identify and classify sounds. In speech recognition field, there are accessible classifiers to be used such as Hidden Markov Model (HMM), Artificial Neural Network (ANN), k-nearest neighbors (KNN), Gaussian Mixtures Model (GMM), and Support Vector Machine (SVM).
1.1 Background Issues
In speech recognition area, the focus is on capturing human voice or spoken word or speech as a digital sound wave and translate it into a computer-readable format. This is known as speech-to-text (STT) technology. In the other hand, there also text-to-speech (TTS) technology where the written words are converted into the voice output similarly to human speech using speech synthesis technique. Regardless of any study, the final purpose of any Text-to-Speech (TTS) system is the generation of perfectly natural synthetic speech from any input text .
Speech recognition has emerged as an important technology in the context of human-computer interaction. Humans speech basically have a lot of emotional state such as happiness, sadness, fear, anger, and normal (unemotional) . Although an intensive studied domain, its language dependency makes it less accessible for most of the languages. Nowadays the interest for the methods of combining emotions in the machines has been improved. For example, animated characters in e-learning systems, avatars in virtual environments or computer games, automatic speech services, animated agents, which could interact with a user in a natural way, using gesture, facial and speech expression .
There are speech features that containing the emotional information in speech signal that can be used for speech emotion recognition which are spectral features and prosodic features . Some of the spectral features that always being used in features extraction technique is Linear predictive cepstral coefficients (LPCC) and Mel-frequency cepstral coefficients (MFCC).
To excite the different emotions, the prosodic features such as speech intensity, glottal parameters, fundamental frequency, pitch and loudness has been used .
Previous research had found that it is necessary to record accurate emotional speech database because the precision of the system, is extremely depends on emotional speech database used in the system . A study to process speech signal is carried out to obtain more detailed findings in existence of emotion in speech signal. Hidden Markov Model (HMM) technique will be used to recognize the different emotion that occur in a particular speech signal.
1.2 Problem Statement
Although an intensive studied domain, language dependency in some emotion recognition system makes it less accessible for most of the languages. Explainâ€¦
Furthermore, the specific features in speech signal that contain emotional information always be a chosen topic to be explored. Explainâ€¦
The recognition rate of emotion in speech signal is fluctuate depending on the features used in experiment also the emotional speech database itself. Explainâ€¦
There are a lot of problem occur in speech synthesis. Several problems exist in text pre-processing such as numerals, acronyms and abbreviations . One of major problem today is proper pronunciation and prosody analysis from written text. There is no pronunciation of emotional clarity in a written text and the correct names and foreign is very strange sometimes. The contextual effects and discontinuities in wave concatenation techniques are the most troublesome at the low-level synthesis. Difficulties among female and child voices has been found in speech synthesis .
There is also one other problem that is the main focus for the international research community, and that is the prosodic enhancement of the synthetic speech. Results of most of the speech synthesizers still have a monotone, unattractive at intonation contour. This problem is usually solved by the use of fundamental frequency contour modeling and control of the parameters in a deterministic or statistical manner .
Most of the contour modeling or parameterization techniques are based on extended speech corpora and manual annotation of the intonation. Some other solutions are language dependent methods, involving accent patterns or phrasing. Adaptation of these solutions to under-resourced languages is unfortunately unpractical and hard to achieve .
This project is carried out to study how speech synthesis using HMM can affected the intonation of the written text when convert it to speech. Thus, the problem statement for the project is
"How speech synthesis using HMM technique can clarify the emotion, intonation and pronunciation of the written text?'
To implement the above problem statement, the solution to the following questions should be sought.
i. How to get text sample along with the emotion to enable TTS application clarifying the right emotion, intonation and pronunciation of the written text.
ii. How is the process of HMM technique to translate written text into speech with the right emotion, intonation and pronunciation.
There are three main objectives to be achieved from the study of emotion recognition in speech signal using HMM technique:
To analyze the efficiency of HMM method in recognizing emotion occurrence in particular speech signal.
To improve the recognition rate of emotion in speech signal using HMM along with reliable emotional speech database.
To observe the process that involve in HMM technique in recognizing emotion in speech signals and what speech signal features can be used to differentiate between several state of emotions: happiness, sad, fear, anger and natural (unemotional).
The goal of this project is to conduct a study of emotion recognition in speech signal using HMM-based to achieve the objective that stated above.
There are lot of study about the emotion recognition in speech signal. Valery , in the previous research in two experimental studies on recognition and vocal emotion expression. The first study was about an amount of 700 short speeches represented by thirty non-professional actors expressing five emotions state which were: happiness, sadness, fear, anger and normal (unemotional). Training back propagation technique have been used in the research. The recognizers have give a result of the following accuracy in each emotions state: happiness - 60- 70%, sadness - 70-85%, fear - 35-55%, anger - 70-80%, and normal (unemotional) - 60-75% and give the overall average accuracy is about 70%. This study discovers how well both computer and human in recognizing emotions in speech .
Another study had been done by Albino et al.  where an approach in emotion recognition using RAMSES, the UPC's speech recognition system has been used. The approach is based on standard speech recognition technology using hidden semi-continuous Markov models. The selection of low level features and the design of the recognition system were handled. The accuracy recognizing seven different emotions-the six ones defined in MPEG-4 plus neutral style-exceeds 80% using the best combination of low level features and HMM structure. This result is very similar to that obtained with the same database in subjective evaluation by human judges.
Other than that, Chih-Yung et al.  presented an approach to automatically synthesize the emotional speech of a target speaker based on the hidden Markov model for his/her neutral speech. The model interpolation between the neutral model of the target speaker and an emotional model were selected from a candidate and both the interpolation model selection and the interpolation weight computation were determined based on a model distance measure. They had propose a Monophonebased Mahalanobis distance (MBMD) and evaluating on the synthesized emotional speech of angriness, happiness, and sadness with several subjective tests. Experimental results show that the implemented system is able to synthesize speech with emotional expressiveness of the target speaker.
4.1 Data Gathering
In this phase, speech data will be collected from the target speaker along with the speaker's emotion in a various way i.e. happiness, sadness, fear, anger, and normal (unemotional) condition . These speech data will then be stored in a database for the preprocessing phase. Data collecting process must also consider how many speaker's and speech sample needed before proceeding to next phase. As mention above, it is necessary to record precise emotional speech database because the precision of the system, is extremely depends on emotional speech database used in the system.
In general, speech coding can be considered to be a particular specialty in the broader field of speech processing, which also includes speech analysis and speech recognition. The purpose of a speech coder is to convert an analogue speech signal into digital form for efficient transmission or storage and to convert a received digital signal back to analogue . Figure 4.1 shows the flow of converting an analog signal to digital representative using encoder and decoder.
Figure 4.1: Speech coding block diagram - encoder and decoder.
Moreover, Lawrence  et al. state that most modern A-to-D converters function by sampling at a very high rate, applying a digital low pass filter with cutoff set to preserve a prescribed bandwidth, and then reducing the sampling rate to the desired sampling rate, which can be as low as twice the cutoff frequency of the sharp-cutoff digital filter. The goal of speech coding is to compress the digital waveform representation of speech into a lower bit-rate representation.
Features extraction is an important process in speech emotion recognition because it carrying information about emotion from speech signal. Mel-frequency cepstral coefficients (MFCC) are some of the spectral features that will be used as features extraction technique to carry out this study.
4.3. Training and Testing
According to the research topic selected, an approach using Hidden Markov Model is proposed as an emotion classifier to carry out the training and testing phase. A statistical parametric speech emotion recognition system based on hidden Markov models (HMMs) has rapidly expanding and gain concern among researcher over the last few years.
This system simultaneously models spectrum, excitation, and duration of speech using context-dependent HMMs and generates speech waveforms from the HMMs themselves . The block diagram of the HMM-based speech synthesis system is shown in Figure 4.2.
Figure 4.2: Overview of HMM-based speech synthesis system .
The Hidden Markov Models (short: HMMs) are used successfully in speech recognition for many years. HMMs can be used for recognizing isolated and connected words by constructing HMM capable of generating an unlimited sequence of words from the library . Refer Figure 4.3 to get the picture of HMM structure.
Figure 4.3 : Sample of HMM structure for word recognition .
The HMM is modeling a stochastic process defined by a set of states and transition probabilities between those states, where each state describes a stationary stochastic process and the transition from one state to another state describes how the process changes its characteristics in time . Further research about the technique chosen will be conducted.
This research will be expected to get a result of high recognition rate of several emotion that exist in speech signals in different state such as happiness, sad, fear, anger and natural (unemotional) using the HMM technique. From the initial experiment done in understanding HMM process with selected MFCC features, it shows natural (unemotional) state in a given standard database speech have about 90% of recognition rate. Moreover, this study will be able to analyze the efficiency of HMM method in recognizing emotion occurrence in particular speech signal and improving the recognition rate of emotion in speech signal using HMM along with reliable emotional speech database.
Importance and Justification
This study focused on the study of emotion recognition in speech signal using HMM-based to analyze the efficiency of HMM method in recognizing emotion occurrence in particular speech signal. Result of analysis can be use to improve the recognition rate of HMM by using the reliable emotional speech database. Furthermore, this study is carried out to observe the process that involve in HMM technique in recognizing emotion in speech signals and what speech signal features can be used to differentiate between several state of emotion: happiness, sad, fear, anger and natural (unemotional). People with reading disabilities or visual impairments is allowed to listen to written works on a computer by using an intelligible text-to-speech program. Further studies will be able to improved the computer's usability for the visually impaired.
Eventually,Â speechÂ synthesisÂ is technology that has revolutionized how people communicates. It gives the world an opportunity to hear the thoughts of brilliant individuals who would have normally been voiceless. Text to speech application can give people an opportunity to hear text. This is especially helpful in situations where reading is obtrusive or impossible. In addition of emotion in the application will make the speech more perfectly natural with the human speech. Advances in this area dramatically improved the computer's usability for the visually impaired in multiple language all over the world as similarly to human speech with the correct emotion and intonation