An Emotional Speech Synthesis Module Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

A Dutch emotional text-to-speech (TTS) synthesis module is designed by using concatenative speech synthesis technology. First of all, a Dutch diphone database with a limited number of diphones is built. Each diphone is recorded in 3 vocal efforts.

A demo emotional TTS synthesis module is a C++ implementation using this database. It is capable of processing a limited number of sentences. An interview has been conducted to evaluate this demo. According to the result of the interview, some suggestions are formulated to improve the

The goal of the thesis is to design an emotional speech synthesis module that is based on a diphone database recorded with 3 vocal efforts. This speech synthesis module can portray emotions along three dimensions (See section 2.2). Text to speech synthesizers (TTS, see section 2.1) convert plain text in a natural language into speech. Text to speech synthesizers should have some degrees of intelligibility, meaning that the speech produced by a synthesizer should always been recognized both in the sense of the meaning of the speech and the emotion state of the speech (for an emotional speech synthesizer). Diphone speech synthesis is used in this thesis. A comparison between diphone speech synthesis and unit selection speech synthesis will be given in section 4.1. To build an emotional TTS system, it is common to add some emotional modules on the top of a traditional TTS system.

Figure 1.1 an overview of the work flow of the emotional modules

Figure 1.1 gives an overview of the work flow of the emotional modules. Firstly, a list of phonemes and the corresponding acoustic parameters of each phoneme (duration and F0 target, see section 2.4) are inputted into the acoustic parameter modification module. The MBROLA [12] synthesizer synthesizes the diphone signals from the diphone databases and generates wave files. Finally, the DSP module processes these wave files using some signal processing algorithms, which is in section 5. Section 6 describes the GUI and the algorithm that calculates the spectrogram of the output speech signals. In section 7 the evaluations of the user tests are discussed. Section 8 formulates some suggestions for further implementation and improvement.

2. Background

2.1 Text-To-Speech synthesizer

General purpose state-of-art diphone TTS systems consist of an NLP (natural language processing) module which converts the input text into a list of phonemes and the corresponding parameters for each phoneme, and of a DSP module which converts the output of the NLP module into a speech signal, e.g. a wave file. To build an emotional TTS system, the decision was made in this thesis to bypass the implementation of the traditional NLP and DSP module by using a MBROLA [12] PHO file as an input and using an external MBROLA binary as the DSP synthesizer.

2.2 Dimensional expression of emotions

As is suggested by [2a], a dimensional description of emotions is a useful representation. It captures conceptually important aspects in different methodologies, and provides a means of measuring similarity between emotional states. The emotion dimensions' names are Activation, Evaluation, and Power. Chapter 2.6.1 of [2a] gives the origin of the dimensional expression of emotions [1] , and a 3-dimensional description of emotions from factor analysis is suggested. Nevertheless the dimensions from a factor analysis method are not sufficient to give all the emotions. Some error is introduced by scaling adjective scales, because they are perceptions.

Dimensional expression

The emotion dimensions that were used also have a link between the "real-world" [2a]:

Activation: A simplified representation of action tendencies, such as attack or fight patterns.

Evaluation: A simplified representation of appraisal of a stimulus, which determines the significance of the stimulus for the individual.

Power: Power is also called dominance.

2.3 Vocal effort

Vocal effort is a subjective psychological quantity. When the distance between the listener and the speaker is large, the speaker would speak louder to make the listener hear. This results in a high vocal effort. While the distance between the listener and the speaker is close, the speaker would speak softer. This results in a low vocal effort. The interpolation algorithm described in section 5 can interpolate between 2 vocal efforts. So theoretically only 2 vocal efforts (high vocal effort and low vocal effort) are needed as the source. The problem is that by using this algorithm, only the vocal tract model (see section 5) can be interpolated and interpolating the residuals at the same time results in a quite noisy speech. In this sense, an ideal intermediate level of vocal effort cannot be obtained, and the residuals are always residuals of one of the source vocal efforts. By adding a medium vocal effort besides the low and high vocal efforts as a source, a better intermediate level of vocal effort can be obtained by interpolating between the low and medium vocal effort or between the medium and high vocal effort. Hence, the decision was made to record three levels of vocal effort: low, medium, and high.

2.4 MBROLA speech synthesizer

MBROLA [12] is a diphone speech synthesizer (the reason for using a diphone synthesizer will be given in section 4). By giving a phoneme list and some acoustic parameters (duration, F0 targets) to the MBROLA synthesizer, diphones will be extracted from the database and be concatenated into a synthesized speech uses a time-domain algorithm called MBR-PSOLA. The acoustic parameters (phoneme durations and F0 targets) and the phoneme list are usually stored in an MBROLA PHO file. The MBROLA PHO file is organized as follows:

Phoneme duration target1, target2,...

Phoneme duration target1, target2,...


For example, the word "Hans" can be inputted into the PHO file using FONILEX [13] phonetics as

h 58

A 58 (33,148)

n 89 (84,180) (100,127)

s 172

The first field is the name of the specified phoneme. The second field is the phoneme duration, indicating how many milliseconds will be used to pronounce the phoneme. The remaining fields are grouped in pairs and each pair is in a bracket. These pairs of fields are called targets. They are optional. A target specifies the fundamental frequency of an utterance at a certain moment in time. A target is composed of a position of pitch value and pitch value. The field "pitch value" is expressed in Hz. It refers to the fundamental frequency around this target. The field "position of pitch value" is a percentage value, the length of time from the onset of a phoneme to the target divided by the duration of the phoneme.

For example, if the nth phoneme has the target (pn, F0n), and the duration of the ith phoneme is di, then the position of the target (pn, F0n) on the time axis is

3. Acoustic parameter modification

3.1 Some definitions

F0 curve: The F0 curve shows how the fundamental frequency of speech evolves with time.

ToDI tones: ToDI [9] is the abbreviation for transcription of Dutch Intonation. It is used to annotate the prosody of speech of the Dutch language. Typical pitch accents are H*L and L*H, which means high fall from accented syllable and low rise from accented syllable. The meanings of the symbols H*, L*, H, L are high accent, low accent, upward movement after L*, and downward movement after H* [9].

Declination effect: From Dutoit's book [1], section "In many languages, F0 curves tend to evolve around or between average values that decrease with time." Mathematically, this declination effect is expressed with two declination lines: topline and baseline. Topline is the linear regression of local maxima of the F0 curve, while baseline is the linear regression of local minima of the F0 curve. [1]

Syllable: A syllable is a unit of organization for a sequence of speech sounds. A syllable can be made up of an onset, a nucleus, and a coda. A syllable must have a nucleus, which is of most case a vowel. Sometimes a syllable can also be a syllabic consonant, like /m/, /n/, /l/, etc.

Accented syllable: In linguistics, an accented syllable is the relatively emphasized syllable in a word, certain words in a phrase or sentence.

3.2 Preparation for this module

The goal for this step is to make a raw MBROLA PHO file for the neutral default voice. This PHO file will later be used as the input to the post processing of the acoustic parameters module to add some emotions. This emotion is dimensionally described.

Firstly, by looking up a Flemish Dutch lexicon database FONILEX [13], a sentence is translated to its phonetic transcription. Simply combining these phonetic representations of words neglects some inter-word rules such as assimilation. For example, in the English sentence "don't be silly", /n/ and /t/ in "don't" are assimilated to /m/ and /p/ by the following /b/ [12]. So this translation is not complete.

In order to get a good phonetic transcription of the sentence, as well as to get a "perfect" acoustic description (duration, fundamental frequency) of the speech signals for MBROLA, a recording of this sentence spoken by a real person is analyzed using Praat [14]. Praat is a program for speech analysis (spectral analysis, pitch analysis, formant analysis, voice breaks, etc.). By watching spectrogram and formant plots from Praat, the sentence is segmented into signal fragments of phonemes. By comparing the previous phonetic transcription and the signal fragments of phonemes, most of the phonemes are identified.

In such way, we get a list of phonemes. The duration of each phoneme is obtained from the corresponding signal fragment length. Then by watching the plots showing the fundamental frequency of the speech with Praat, every vowel or syllabic consonant [11] is marked with a target. They are potentially nuclei of accented syllables [11]. The fundamental frequency is cited from Praat. The positions of the targets are the 50% positions relative to the duration of the corresponding vowels. There are two reasons to keep the targets of every vowel: Firstly, a nucleus of a syllable is most often a vowel. An accented syllable is a syllable. The target points of the accented nucleus of the syllables ("*" targets) will be used in the post processing module of the acoustic parameters when determining "+" target according to accent slope, according to chapter 12.3.8 of [2b]. The "+" target indicates where a rising/falling tone ends. Secondly, many target plots are needed to calculate "baseline" and "topline". According to a model, fundamental frequency only makes sense to voiced signals. Since vowels and syllabic consonants are voiced signals, they can be chosen as input of linear regression.

Once the phonemes, durations, and targets are identified for the sentence, a MBROLA PHO file is generated accordingly. It then serves as an input to the post processing module of the acoustic parameters.

3.3 Duration modification part

This part modifies the duration of each category of phonemes (such as vowels, liquids, etc.) and it's implemented in the application. The new duration equals the original duration multiplied by a ratio (See Table 1). For a certain emotion, each category of phonemes has a fixed ratio defined by Activation, Evaluation, and Power. This ratio differs with different categories of phonemes (See Table 1).

3.4 F0 curve modification part

The F0 curve tracks the fundamental frequency of the speech signals over time. Our application implements emotion modification by applying rules which modifies the list of targets, and thus changes the shape of the F0 curve. The data in Table 1 describe this rule. This category of data is from a German Emotional TTS research Schroeder (2004), chapter 13 [2b], which is fine tuned to adapt MARY TTS (a German emotional TTS synthesizer) for a male NECA voice [2b]. That is because of the complexity of implementing an experiment for this category of data, and the lack of information about the Dutch emotional TTS system. The database we use is quite similar to the male NECA voice in the sense of the gender of the speaker, recording a diphone database, and recording 3 vocal efforts. The difference is the language. We implemented a Dutch emotional TTS system using this category of data and will discuss in section 7 (Evaluations and Discussion) whether the category of data in Table 1 is applicable language independently. Suggestions for further implementation and improvement are given in section 8.

The following parameters are implemented in our application (Also see Table 1):

Pitch: Mean value of the ToDI baseline [2b]. Once the original baseline is obtained by applying linear regression, we raise/drop the constant term of the regression curve to fit the mean value of the baseline.

Range: Mean value of distance between topline and baseline [2b]. The constant term of the topline regression curve is modified to fit the range value.

Pitch-dynamics and range-dynamics: Slope of baseline and slope of distance between topline and baseline [2b]. The linear terms of baseline and topline are modified.

Preferred-accent-shape: The rough type of accent tones on accented syllables [2b]. Rising shape means most of the accented syllables are on the baseline, falling means that most of the accented syllables are on the topline, and alternating means the location of accented syllables is alternating.

Durations of categories of phonemes: The durations of voiced phonemes have more correlation with Evaluation and Power, while unvoiced phonemes (plosives and fricatives) have more correlation with Activation and Evaluation.

Rate: The speech rate, it specifies how fast the speech is spoken.

Accent slope: The rising/declining rate of F0 on the F0 curve around the nucleus of each accented syllable. The new accent slope is the original slope multiplied by a ratio defined by Table 1.

Volume: The volume of the speech.




Prosodic Parameter




































































Table 1: The numeric data fields represent the linear coefficients quantifying the effect of the given emotion dimension on the acoustic parameter, i.e. the change from the neutral default value. As an example, the value 0.5% linking Activation to accent prominence means that for an activation level of +50, accent prominence increases by +25%, while for an activation level of -30, accent prominence decreases by -15%.[2b](Table 13.1)

The application modifies the PHO file according to the rules described in table 1. The MBROLA synthesizer reads the PHO file and produces 3 speech signals (wav file) by extracting diphones of different vocal efforts from the database. As is mentioned in section 2, the vocal efforts are low, medium, and high.

Emotional diphones database

To achieve the quality of intelligibility and naturalness of the synthesized speech, we should use the concatenative speech synthesis technology. The basic idea is to concatenate prerecorded human voice signals and produce a new speech. A general purpose state-of-the-art concatenative synthesizer could be implemented as a diphone synthesizer or a unit selection synthesizer. In this section, the reason for using diphone synthesis technology and a procedure to build an emotional diphones database using MBROLA will be given.

4.1 Comparison of different database types

The two primary technologies for generating databases of synthetic speech are based on unit selection and on diphones.

A unit selection synthesizer produces very natural sound, but on the other hand, a unit selection is usually very large. This type of database normally contains different types of prerecorded audio information (called 'unit' below): syllables, morphemes, word, phrases and sentences. In the very ideal case, if the unit selection database is big enough, all the sentences can be perfectly reproduced, with high clarity and naturalness. A rough calculation shows that to build a database for the Dutch language with unit selection, 850,000 units are required. 10GB is required for such a database if it's properly compressed.

Comparing with unit selection synthesis, diphone synthesis uses less memory. Only diphones are stored in a diphone database. There are 55 FONILEX [13] phonemes, so roughly speaking, the maximal number of diphones is 55*55=3025, a diphone doesn't require many speech samples, usually 1k memory. So, the size of a diphone database is in megabytes. The definition of diphone is given by:

"Diphones are speech units that begin in the middle of the stable state of a phone and md in the middle of the following one" [12].

Figure 4.1 shows the diphone /ne/.

Figure 4.1: Diphone /ne/ is composed of part of /n/ and part of /e/

Recall the example of /hAns/, while the phoneme list is /h/ /A/ /n/ /s/, the diphone sequence is

/0h/ /hA/ /An/ /ns/ /s0/,

where 0 (zero) stands for no voice, and /0h/ starts from no voice. The word /hAns/ is concatenated with these 5 diphones as is shown above.

To draw a conclusion, by using diphone speech synthesis, far less memory is needed. Nevertheless, the speech produced by a diphone synthesizer usually sounds less natural, because more signal processing is employed.

MBROLA is a very popular diphone synthesizer. Speech tempo and rising/falling of tones can be easily controlled. The MBROLA synthesizer is used in our application.

4.2 Building an emotional diphone database

Three steps are required to build such a database: preparing a Text corpus, recording the corpus, and segmenting the recorded corpus. [12]

The demo version of the Dutch emotional TTS can only synthesize 5 sentences, because recording all the diphones is impossible in this thesis: too much time is needed for the segmentation step. An incomplete database is built for diphones that have occurred in 5 sentences:

Omdat de bus te laat was heeft u dus tien minuten moeten wachten.

Hans had te veel gedronken die avond.

Geef de soep eens door naar de andere kant van de tafel.

Zaterdag morgen gaat zij altijd dansen.

Het optreden heeft ongeveer een uur geduurd.

They were just 5 random sentences.

4.2.1 Prepare a Text corpus

Once the phoneme list is known (discussed in section 3.2), a diphone list is immediately obtained, referring to the /hAns/ example in section 4.1.

The diphone list contains 190 diphones covering all the possible diphones that occur in the chosen sentences.

Each diphone is decided to be recorded twice. A more monotonous diphone should be chosen as source of the database.

380 words are chosen from the FONILEX [13] database, each of them contains at least one required diphone. These words were read out as monotonously as possible by a native Dutch speaking person.

4.2.2 Recording

The second step is to record the words from the word list. It's preferred that a professional speaker reads the words in the word list with the most monotonic intonation possible in a sound-proof recording studio, because the pitch of the diphone should be monotonous as is required by the MBR-PSOLA [1] algorithm to avoid the phase mismatch for the time domain overlap-add and the ambient noise, especially low frequency noise is very annoying in the recording.

Firstly play a monotonous word which is synthesized by using another database as a prompt. Then the Dutch speaking volunteer speaks out the word according to what he has heard [2d]. The vocal effort is controlled by adjusting the volume of the prompt: for high vocal effort, the volume is increased, vice versa [2d]. Three vocal efforts are recorded: low, medium, and high.

The microphone used for the recording was JVC Gumy HA-F120. The sampling frequency was 16000 Hz, 16 bits per sample, as is suggested by the documentation of MBROLA. The recording is implemented in a quiet room at midnight. Nevertheless, we don't have a sound-proof room, which limits the quality of the recording.

Because the recording environment and the recording equipment are not ideal, a possible backup is made. The synthesized voice of an electronic dictionary produces 3 types of voices that have effects that are very similar to the effects of 3 different vocal efforts. The speech signals from the words were chosen as the source of the second database.

The discussion and comparison of both approaches are shown in section 4.2.4.

4.2.3 Segment the recorded Word List

The next step is segmenting the recordings, and extracting the diphones out of the words by using visualization tool Praat.

Praat Objects [14] is one of the most promising tools for segmenting Diphones. A segmentation of the recordings is needed to find the meaningful section of each diphone, i.e. where a diphone starts and where it ends. Figure 4.2 shows the Praat interface that provides information about pitch (blue points), intensity (yellow line) and formants (red points).

Figure 4.2: The spectrogram and time domain waveform

MBROLA needs sound files providing speech signals for diphones, and a SEG (segmentation) file providing the location of segmentation points. It has several rows. Each row contains the name of a diphone, the file name of the corresponding wave file (recording of the word), the position where the diphone starts, the position where the diphone ends, and the position of the boundary between two phonemes that constitute the diphone, as shown in Figure 4.3.

Figure 4.3: The SEG file

4.2.4 Generating diphones database & test

By following the instructions of the MBROLA tools, two databases are produced:

A database with recordings from a Dutch speaking person.

A database with recordings from synthesized

words. The words are also synthesized in three modes.

Then the 5 chosen sentences are synthesized using the two databases. Two sets of speech signals are obtained, each corresponding to one of the databases. As the quality of the first database (recordings of a person) is poor, approximately half of the words cannot be recognized. The reason for the poor quality is the ambient noise, the variations of pitch in the recordings, and the use of an unprofessional microphone. A directional microphone with a preamp is preferable.

For the demo version, the second database is used. While by definition, the three modes that are inner-recorded are not 3 vocal efforts, the synthesized speech signals are perceived as speech recorded in 3 vocal efforts. For this reason, the second database is chosen. Note that the 3 modes are not simply speech signals amplified by a factor, otherwise the LP filters of the interpolation algorithm (see section 5.1) would be the same and this makes no sense for interpolation. (By amplifying the speech signals, residuals are also amplified with the same ratio, as a result the frequency responses of two filters are the same.)

5. The DSP module

5.1 Voice quality interpolation

Voice quality is an important factor in expressive and emotional speech synthesis. This section discusses the algorithm that numerically controls the voice quality: the intended vocal effort perceptual rating.

We have a database that contains 3 sets of diphones, which are spoken in different vocal efforts. By interpolating between speech signals that are synthesized with 2 out of these 3 vocal efforts, speech signals with an intermediate vocal effort can be obtained. The spectral interpolation algorithm has been introduced by [3] and data of a rating test for vocal effort interpolation has been given. It has been shown [3] that the interpolated voice quality is perceived as intended and that the effect of language background on the effort rating is very small. Hence, changing the interpolation ratio for Dutch should have almost the same effect on the vocal effort perception as that for other languages mentioned in [3]. This algorithm is based on linear prediction (LP) and line spectral frequencies (LSF). By a linear interpolation of two set of LSFs with a ratio r obtained from speech signals of different vocal efforts, a speech signal with intermediate vocal effort results.

According to table 11.8 of [2c], the vocal effort rating is given by

Vocal effort = 32.45+0.0346A+0.0191E-0.0207P (5.0)

where A, E, P are parameters for the dimensional expression of emotions (see section 2.2).

Figure 5.1 shows is the flowchart of the vocal effort interpolation part.

Figure 5.1: Flowchart of voice quality interpolation algorithm (from [3])

The algorithm has 8 steps:

The input wave files are divided into frames, with a window size of 20ms, and a skip-rate of 10ms (50% overlap). The frames are windowed with a Hamming window. Because the sampling frequency is 16 kHz, the frame size is 320 samples, and the skip-rate is 160 samples.

Select the required frames to interpolate. The input to this module is 2 out of 3 wave files. The possible combinations are soft vocal effort and modal vocal effort, or modal vocal effort and loud vocal effort. Note that these 2 wave files are produced by using the same set of acoustic parameters - The duration of each individual phoneme is the same for each of the 2 wave files. Each phoneme is aligned at the same position. For example, the start positions of the '/m/' phoneme of the "omdat…" sentence are all 34ms after the start of the 2 wave files. So for each frame in voice 1, the corresponding frame in voice 2 is found with the same sample index. (voice 1, 2 are wave files of speech signals with a certain amount of vocal effort, see Figure 5.1)

3. LP analysis.

We use LP analysis to model the vocal tract [1], based on "the hypothesis that the glottis and the vocal tract are fully decoupled". The sampling frequency fs = 16 kHz, the LP order p = 18, and the frame size N=320. LP of order p on a sequence of N speech samples is represented as


Where is the vector of signal values

, of size (N-p) and is the vector of predicted values. The error generated by such prediction is called residual and the vector containing residuals is


The goal of the LP analysis is to find the set of prediction coefficients, w, which minimize the mean-square of the residuals. Hence, we have

, ,

Where constant, for i=0…p

Let , where is the Projection of onto the subspace

, and hence is orthogonal to this subspace.


So, .Only when the first quadratic component is zero, and,, can be achieved.

Hence, the vector of minimum mean-square residuals is simply obtained by projecting orthogonally onto the prediction subspace

. [1]

for i=0, 1…p-1 (5.3)

From formula 5.2 and formula 5.3, we get the set of linear equation

For j=0,1…p-1 (5.4)

Formula 5.4 is the Yule-Walker Equation [1].

Let f for i=0, 1, 2…p, then

is a sequence containing a part of the autocorrelation estimate of the speech samples. Given the fact that frame size N is much larger than LP order p,


In appendix 1 it is shown how the autocorrelation is computed. According to formula 5.5, the set of linear equations 5.4 can be simplified to

for j=1, 2…p (5.6)

The left-side coefficient matrix is Toeplitz, so we can use the Levinson-Durbin algorithm to calculate the LP coefficients recursively. The pseudo code of the Levinson-Durbin algorithm is shown in appendix 2.

4. Residual extraction

From equation 5.2, it follows that,




The residual is calculated by filtering the speech signals with a p-th order FIR filter.

5. LSF computation

Interpolation between the two sets of line spectral frequencies (LSFs) successfully modifies the amplitude spectrum envelope without degrading speech quality [3]. Interpolating two sets of LSFs always results in stable filters, and another advantage is that the interpolation of the avoids pole pairing [1].

Here is the mathematical foundation of LSFs:

The LP polynomial is decomposed to



The LSFs are the zeros of F1(z) and F2(z) that are located in the first two quadrants of the complex plane. Because all the zeros are on the unit circle as long as the LP filter is stable, the LSFs are represented in angular form ranging from 0 to π. By using the Chebyshev polynomial method [4] (see appendix 3), the problem of solving complex zeros on the unit circle is simplified to finding the real roots of a Chebyshev polynomial. In our application, we find the roots by dividing the angular interval [0, π] into 1000 equal-length sections. The values on both ends of a section are calculated. The sign of the values are used to determine whether there is a root in the corresponding section. In those sections with a root, we use the Brent-Dekker method to estimate the root. [5] The number of LSF values is the order p. If the number of LSFs we found is less than p, the number of sections (1000) is doubled.

6. Weighted LSF interpolation

The LSF interpolation is defined by the following formula: [3] (5.11)

Where Lo is the output LSF vector, Ls and Lt are the LSF vectors of the source vocal effort and the target vocal effort, and r is the interpolation ratio. It's a number between 0 and 1.

7. Converting LSFs to LP coefficients

It is the inverse of the LP-to-LSF conversion procedure. See appendix 4.

8. LP synthesis

The filter coefficients are the LP coefficients obtained from step 7. The filter is an allpole filter. The excitation signal inputted into this filter is the residual of voice 1. The filter is implemented as a direct form II structure, which will be discussed in section 5.3.

By going through the above mentioned 8 steps, we get voice 3 (the source to target vocal effort). Spectral slope is an important parameter of voice quality. A statistical relationship between the spectral slope and the 3 emotional dimensions is discussed in the literature [7]. In order to modify the spectral slope to that determined by 3 emotional dimensions (according to table 11.8[2c]), a filtering module is designed.

5.2 Filtering

The basic idea is to firstly measure the spectral slope of voice 3 and then filtering voice 3 to get the desired spectral slope. The measurement of the spectral slope is achieved through linear regression of a periodogram at higher frequencies, because the lower part of the spectrum mainly contains the phonetic information, such as the fundamental frequency and the lowest two formants, whereas the higher part contains information that is more closely related to voice quality. [6]

The lowest frequency used in the linear regression calculation is called pivot frequency and the highest frequency used is 5 kHz. Note that the measurement of spectral slope is an alternative to measuring the Hammarberg Index, i.e. the difference of maximum energy in the [0, 2] kHz and [2, 5] kHz band.[6] It appears that the spectrum at frequencies higher than 5 kHz does not convey much information about voice quality. This is the reason why we choose the [pivot, 5] kHz band for the spectral slope measurement.

Figure 5.2: Flowchart of spectral slope measurement and filtering

This module has 4 steps:

Pivot Computation

Firstly we calculate the periodogram. Voice 3 is segmented in frames of 20ms, with 50% overlap. The frames are windowed with a Hamming window. So frame size N = fs*20ms = 320, frame shift P = 320/2 = 160. We set the number of frequency bins L = 512, i.e. each frame is zero-padded by appending 192 zeros. This value is large, hence the frequency resolution is high. As L is a power of 2, an efficient implementation of the FFT can be used.

Secondly, to obtain the pivot value, we compute I as the first value of i for which:

where X represents the periodogram vector of L bins, and =50, separating the lower and the higher part of the periodogram [6].

The frequency of the pivot is calculated as:


Spectral Slope Computation

Linear regression is performed within the [pivot, 5] kHz band. We use C function gsl_fit_linear_est from the GNU scientific library [5].

Filter Design

The slope of the filter is calculated as:

Where is the slope calculated by a linear combination of 3 emotional dimensional values A, E, and P, and is the spectral slope of voice 3. The unit for the three slopes is dB/oct.

To design such a filter, we use tehe frequency sampling method. The filter order is 64, which is 1/10 of the sampling points. The shape of the sampling plots is an ideal low pass/band pass filter, but with a big transition band [pivot, 5] kHz. The spectral slope in the transition band is. Figure 5.3 shows an example of the frequency response of a filter that could result from this frequency sampling.

Figure 5.3: example of a low pass filter

Figure 5.4: example of a peak filter, the falling slope is not ideal (infinite), because the number of samples in frequency sampling is not infinite

The Nyquist frequency is 8 kHz, the pivot frequency is 2 kHz, which corresponds to a normalized frequency of 0.25 periods per sample. The highest frequency is 5 kHz. In both [0,2] kHz and [5:8] kHz, the frequency response is flat. From 2 kHz to 5 kHz, the slope is . The peak filter is similar; except for the magnitude frequency response at the higher frequencies, which equals the magnitude frequency response at the lower frequencies. See Figure 5.4.

4. Filtering

A 64-th order FIR filter is used. It is a low pass or a peak filter depending on sign of the slope .

5.3 Filter

The DSP part has been discussed except for the filter algorithm. At two points in the processing scheme filters are used. In section 3.1, an allpole filter is used, and in section 3.2, an FIR filter is used. Both filters are implemented as a direct form II realization structure [8], as shown in Figure 5.5

Figure 5.5: direct form II realization structure of a filter of order n-1.

If we use this structure for the allpole filter, then b[1]=1, but b[k]=0, k=2…n. If we use this structure for the FIR filter, then a[1], but a[k]=0,k=2…n. In that case the direct form II realization structure reduces to the transversal realization structure [8].

There are n delays, so for both cases, n memory spaces are needed. For both cases, n-1 additions are required. For FIR, n multiplications are needed, so in total, 2n-1 operations for a sample. For the allpole filter, n-1 multiplications are needed, so in total, 2n-2 operations are required per sample.

6. GUI of emotional TTS synthesis

This section is about the function of the emotional TTS synthesis software and principle of Spectrogram. The GUI is shown in Figure 6.1.

Figure 6.1: The GUI of the emotional speech synthesis module.

6.1 Model response of parameters

As is mentioned in section 2, the emotions are dimensionally controlled. There are three dimensional bipolar scales: activation (active-passive), evaluation (positive-negative), and power (very dominant-very submissive).

Activation and evaluation form a square, and by clicking on a position in that square, these two scales are set. Power is set by adjusting the slide bar. There are three corresponding text fields for these three dimensional scales. The text fields indicate the current value of dimensional scales. Dimensional scales can also be set by modifying values in text fields. This is realized with the signal/slot mechanism in QT. The recursive invocation of signals problem is properly dealt with.

6.2 Spectrogram

The spectrogram gives the impression of how the spectrum plot evolves with time. The horizontal axis of the spectrogram is the time axis and the vertical axis is the frequency axis, as is shown in Figure 6.2. Both axes are divided into small slots, because the spectra are calculated with FFT. Each time domain sample starts from the start of a frame and ends at the start of the next frame. Each frequency bin starts from 0 Hz and ends at the Nyquist frequency. The number of time domain samples equal the number of frequency bins. Small slots on the time axis and frequency axis divide the visual area of spectrogram into small blocks. The color represented in gray scale of each block represents the square magnitude spectrum of the corresponding frame.

Figure 6.2: The spectrogram and time domain waveform

There are 5 steps to visualize the spectrogram:

Determining the parameters N (frame size), P (frame shift), L (length of zero-padded frame). To make the spectrogram fit the size of the visual area, the proportion between width (W) and height (H) of the visual area should be the same as the proportion between time slots and the number of the frequency bins. Hence the spectrogram wouldn't be too detailed in one domain while be too rough in another, due to the uncertainty principle. It is represented in formula 6.1:

Where D represents the number of samples in the speech signal; P is the frame shift; N is the number of samples per frame; and L is the length of the zero-padded frame. We choose 50% overlap and use a hamming window. So P=N/2. L should be no smaller than N. For the efficiency of the FFT, L should be power of 2. For these 2 reasons, L is the first power of 2, that is larger than N. Hence, N<=L<2N. Let k=W/H*L/N, from Formula 6.1 we have,

Where and D>>k

Hence, N=sqrt (2D/k). We are more concerned about the time resolution, because the durations of some phonemes are very short, a fine time resolution can give a better impression of how spectrum evolves with time. So we can choose a smaller N. Note that L/N is not fixed with N varies, but 1<=L/N<2, as an estimate, we assume L=2N first, and we get N=sqrt (DH/W), but in reality, L is the power of 2.So L=nextpower2(N).

2. Select a frame of length N and append it with L-N zeros.

3. Do an L point FFT for the zero-padded frame. Calculate the square magnitude spectrum in each frequency bin.

4. Display the square magnitude spectrum in a matrix which has more pixels than the spectrogram visual area. Down sample the matrix and display it on the visual area.

5. The start of the new frame is P samples after the start of the previous frame. Repeat step 2-5 until there are no speech signals left.

Like any other parts discussed in this paper, the spectrogram is implemented in C++. With the QT library, it's easy to visualize the spectrogram. It's designed to be a wideband spectrogram, the purpose is to give a good time resolution and make it easier for users to distinguish different phonemes.

7. Evaluations and discussion

In total 11 people were interviewed. The interviewees listened to 25 synthesized speech signals (wave files) which cover 5 sentences synthesized in 5 different emotions: Neutral, Sad, Angry, Afraid, and Happy. Each emotional state is dimensionally expressed in the parameters (A, E, P), where for the neutral sound is (A, E, P) = (0, 0, 0). For each emotion, the dimensional expressions are fixed. The free software Feeltrace defines parameters A and E, and the BEEVer study defines the parameter P [7].

The interviewees were asked to assign an emotional state to each of the 25 wave files. The results that are obtained by collecting and analyzing the questionnaires are shown in Figure 7.






Sentence 1






Sentence 2






Sentence 3






Sentence 4






Sentence 5






Figure 7 Percentage of interviewees who agree the emotions are the same as intended.

The synthesized speech signals with neutral emotion are always perceived as intended. The dimensional expression of the Neutral sound is (A=0, E=0, P=0), which means the rising/falling pitch slopes are not changed while all the targets are put on two declination lines. The smallest changes have been applied to the speech signals with neutral emotion. They inherit more properties from the acoustic parameters of the input PHO file.

Most of the synthesized speech signals with other emotions are not perceived as intended. The most obvious reason is that we applied results from a study of the German language (the figures in Table 1 are chosen from Schroeder's study (2004)) to Dutch sentences. It appears that the relationship between dimensions and parameters like pitch, pitch dynamics etc. (See Table 1) of the Dutch language is different from that of the German language. A further study needs to be done to determine this relationship (See chapter 7).

8. Further implementation

To improve the quality of the Dutch emotional speech synthesizer, 2 things can be done:

Conducting an experiment to obtain the prosody modification rules (section 3) for the Dutch emotional synthesizer. A linear regression between the ACCESS [11] variables and the 3 dimensional parameters A, E, and P [7] need to be identified. The ACCESS variables are in 4 categories: intonation, tempo, intensity, and voice quality. A new implementation could be based on this. ACESS variables are obtained from linear combinations of dimensional parameters, and they will be used to modify the F0 curve. A corresponding PHO file will be obtained immediately.

A linear regression between dimensional parameters and vocal effort also needs to be done. Recall that vocal effort is a subjective psychological quantity, the measurement of vocal effort should be based on a listening test supervised by a well-trained person with sufficient knowledge in the field.

Implementing these two experiments costs a lot of time. A lot of data points should be collected for the regression analysis and many TV recordings needs to be analyzed to determine three dimensional parameters [7].

To obtain a full version of the Dutch emotional speech synthesizer, NLP (natural language processing) should be added to generate the original PHO file. Then the input to the speech synthesizer can be ASCII text. To convert a text into a PHO file, at least 3 modules should be added: a grapheme-to-phoneme module, a duration module, and an intonation module.

The Grapheme-to-phoneme module translates a plain text made up of words into a list of phonemes. The duration of each phoneme is determined by the duration module. Finally, the intonation module gives the F0 targets which define the position and the pitch of the rising/falling tones.

9. Conclusion

The emotion states of the speech signals synthesized by the emotional Dutch speech synthesizer are not recognized. Besides applying the results from a German language to Dutch sentences, the quality of recordings is bad. It's almost impossible to obtain diphones that are both monotonous and with a desired level of vocal effort. The approach mentioned in this paper requires the Dutch speaking volunteer to be very professional, which seems to be impossible. A good recording costs too much time and money. The recording can be avoided if database of diphones with 3 vocal efforts is not required. Then some free Dutch diphone databases could be used. Voice quality modification could be achieved by "decompose the speech signal into periodic and aperiodic parts, modify the amplitude spectrum of the periodic part by adaptive filtering and recombine with the aperiodic component with a new mixing level" [3].