Text To Speech Tts System English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The basic goal of the text to speech system is to synthesize the speech for the given input text. TTS is thus an automatic counterpart of a human being loudly reading written text. For physically challenged people with viewing disability, TTS systems will be helpful to communicate with others. Limited domain TTS as the name suggests is built to serve a specific purpose e.g. The TTS used in announcement related queries. Unrestricted TTS system is capable to synthesize good quality of speech in different domains. Festival framework has been used for building the TTS system. In this paper an overview of the technique of Generic TTS is discussed in the first section. Section 2 contains the details of the speech corpus used for developing the TTS. Basic prerequisites and language specific issues needed for building TTS in Gujarati are explained. The technique called Hidden Markov Model (HMM) - based speech synthesis, has been demonstrated to be very effective in synthesizing acceptable speech. The last section in the chapter provides the conclusion of the present work along with future works to be carried out for improving the quality of synthesized speech. As an initial step we have developed a TTS for Gujarati language using phoneme as the basic unit of concatenation.

Keywords: Hidden Markov Model, phoneme, feature extraction, prosody, Linguistic analysis, Letter to Sound rules.

I. Introduction

With the increase in the power and resources of computer technology, building natural-sounding synthetic voices has progressed from a knowledge-based approach to a data-based one. Rather than manually crafting each phonetic unit and its applicable contexts, high-quality synthetic voices may be built from sufficiently diverse single-speaker databases of natural speech. Techniques of unit-selection synthesis where appropriate sub-word units are automatically selected from large databases of natural speech [9] are more resource consuming than from fixed inventories, found in diphone systems [13]. Unit-selection techniques have evolved to become the dominant approach to speech synthesis. The quality of output derives directly from the quality of recordings, and it appears that the larger the database the better the coverage. Unfortunately, recording large databases with variations is very difficult and costly [11].


Fig. 1: TTS System for Gujarati Language using Festival framework Overview

Statistical parametric speech synthesis has also grown in popularity over the last years [15]. Statistical parametric synthesis might be described as generating the average of some sets of similarly sounding speech segments. This contrasts directly with the target in unit-selection synthesis that retains natural unmodified speech units, but using parametric models offers other benefits. The quality issue comes down to the fact that, given a parametric representation, it is necessary to reconstruct the speech from these parameters. The process of reconstruction is still not ideal. Although modeling the spectral and prosodic features is relatively well defined, models of residual/excitation have yet to be fully developed, even though composite models like STRAIGHT [10] are proving to be successful.

In the past few years, text to speech synthesis are built up for several Indian languages( Hindi, Telugu, Tamil and Kannada). All these systems were developed using festival framework [3]. The Statistical parametric synthesis technique with Festival framework is used for developing the TTS for Gujarati language using phoneme as the basic unit for concatenation.

Using HMM the phones are aligned to corresponding spoken sounds. This basically finds the location of every phone. For every phone the statistical model is trained so as to obtain the parametrized speech as a function of the phone context. At synthesis time the statistical model is used to estimate speech parameters and the corresponding sound unit to it. The main disadvantage of statistical parametric speech synthesis is the speech quality. Due to the averaging when training the statistical models, the synthesized speech losses some of its sharpness and also some of the variability of speech is not captured by the parametric representation.

Text-to-Speech system for Gujarati language was developed using Festival framework. A generic TTS is built to read anything from a document, from news reading, reading stories and so on. The use of words and sentences are unrestricted to produce a synthesized voice for this TTS.

The generic Architecture of TTS contains text normalizer, linguistic analyser, prosody handler and waveform generator modules. In order to understand the text normalization process it is necessary to understand that the writing form of digits, dates, abbreviations are different from their pronunciation. For example: it is legal to write 100 however when we read the text we do not pronounce each digit individually, saying ``one zero zero", but rather we say ``one hundred". Similarly we undertake the same process for any symbols, dates and currencies. The process which handles conversion of digits, dates, currencies(Rs., $, etc) acronyms(ડ।. , મિ., Ms. etc) and abbreviations(USA, UK) to word sequences is known as text normalization process. Sometimes text normalization is a process which converts non-standard words to standard words. Non-standard words are those words which are not found in a dictionary. The output of the module will contain only pure word sequences instead of the arbitrary numbers and symbols.

These word sequences are now pass through a linguistic analysis module. This module requires rigorous research in the field of phonology and phonetics of a particular language. A phone set needs to be developed in order to incorporate all the sound units in a language. Phones are the basic distinct units of sound that make up speech - while phones are associated with the letters of the written language, the phone may be different depending upon context - these are called as allophones. The relationship between the phoneme and its allophones is often captured by writing phonological rule.genarch.epsgenarch.eps

Fig: 2: Overall System Architecture for developing TTS

The linguistic module includes a part-of-speech (POS) sub-module to tag the POS to each of the word for better phrase break prediction. The POS indicates whether a given word is a Noun, Pronoun, Adjective, Adverb, Preposition and so on. POS tagging can be done manually or be done using a statistically trained sub-module to predict a POS tag for each word. The phrase break sub-module basically indicates the system when to provide pause in a sentence by referring to the POS tag of the word. This prediction can be made by studying some rules like introduce a short pause if a preposition tag is encountered while synthesizing. The Letter-to-sound sub-module takes the word sequences and converts the words sequences to its corresponding phone sequences referring to the spelling of the word. To build the letter-to-sound rules we will make use of a pronunciation dictionary. The pronunciation dictionary will contain the word and its representation by a phone sequence. This dictionary may also contain other linguistic information like syllabification, schwa deletion, POS and stress pattern for all the words present in the dictionary. A statistical model like CART (Classification and Regression Trees, [6] can be use to predict the letter to phone sequences.

These phone sequences will then pass through the prosody module where prosodic information such as phone duration, contour and Energy contour are retrieved for these phones. Prosodic information is particularly important to provide naturalness in the voice. If a voice in any TTS sounds robotic then most probably the prosodic information of the sound segment is lacking. All these information is then passed to waveform generation module to produce a synthetic speech.

TTS system developed for Gujarati language with phoneme as the basic unit of concatenation was developed using Festival framework. Festival framework in conjunction with Festvox supports the development of new voice. It has several modules which can be adjusted on the basis of inputs from linguistics for generating better speech. This work is an extension of the TTS system developed for Gujarati Numeric's using Java API [14]. It takes Gujarati numbers as an input, converts these digits into sentence form and speech is generated for this sentence (words).

II. Speech Corpus

As a part of developing TTS for Gujarati language the speech corpus of about 90 minutes was recorded. This recorded speech was used to develop TTS system. The speech corpus was recorded in a noise free studio environment, rendered by an ordinary speaker. The sentences and words that were used for recording were optimised to achieve maximal syllable coverage. These sentences were recorded at dual - channel and were sampled at 48kHz. After recording the speech each of the recorded speech file is sampled down to 16kHz into monochannel and segmented into sentences.

The .wav files of the sentences are named according to the requirement of festival frame work. For building unrestricted TTS, large amount of typed text is collected from various domains such as news, sports, stories and education. Around 30,000 sentences are collected. All these sentences cannot be used for building TTS, because of the limitations due to size of memory and response time between text input and synthesized speech output. Therefore, optimal text is derived using optimal text selection algorithm which represents small scale representation of overall text. A total of 1000 sentences are obtained as the optimal text. These sentences were recorded by an ordinary speaker and used to build unrestricted TTS for Gujarati language. There are two phases for building the TTS for Gujarati language.

First phase is training phase or building voice.

The second phase is the synthesis phase, where the desired speech will be synthesized for the given text.

The processes of training phase are as follows:

selection of basic unit for synthesis,

collection of text corpus from several sources,

optimal text selection for recording the speech corpus,

deriving letter to sound rules,

recording and labeling the speech corpus and

organizing the speech corpus based on speech parameters and contextual information using clustering.

Selection of Basic Unit:

The first process for developing the TTS is to decide the basic unit of concatenation. Researchers have tried using phones, diphones, triphones, syllables, words, phrases and nonuniform units as basic units [11]. In our case we have chosen Phoneme as a basic unit for concatenation. Although the quality of the speech output is not natural, it is not absurd. As the units are smaller, there will be more number of concatenation points, which will lead to perceptual distortion due to lack of coarticulation at the joins. But with smaller units, the size of the speech corpus required is limited. On the other hand, with larger units such as diphones, syllables and words, the quality of the synthesized speech will be good as the number of join points are less. The storage requirement of speech corpus will be very large to cover the maximum number possible basic units. Therefore, while choosing the basic unit for developing the TTS system, there will be a tradeoff between the quality of synthesized speech and the size of the speech corpus.

Text Corpus Collection:

In Festival good quality of text from several areas is collected so that maximum words are covered. We have collected around 30,000 sentences from several books on Poetry, Literature, historical. It was observed that the selected words must be with less number of repetitions as it results in wastage of storage space and subsequent redundancy in efficiency of system. Therefore, this large corpus should be represented efficiently in a small scale by using appropriate optimum text selection criterion.

Optimal text collection:

The Optimal text prompts are collected from several sources, optimal text utterances i.e. syllable units are filtered in such a way that there is no repetition or frequency of repetition is less [5]. The following four parameters are applied for selecting the optimal text:

number of unique phonemes occurring in a syllable which are not covered in previously selected sentences,

the number of unique phonemes whose frequency of repetition is not yet reached to the desired threshold,

number of new previous phonemes context with respect to each unique phoneme and

number of new following phoneme context with respect to each unique phoneme.

The score for each sentence is obtained by calculating the sum of these weighted parameters of optimal text. The sentence with the highest score is selected from the text corpus. After a sentence is selected, counters which maintain the number of covered unique phonemes and the frequency of occurrence of unique phonemes are updated based on the selected sentence. The selected sentence is removed from the speech corpus and stored separately. Once again, the four parameters are calculated for each of the sentences in the text corpus, and again the selection of a sentence is done. This process is repeated until all the unique phonemes and their desired frequency of occurrence is achieved. Finally, the selected optimal text contains all the unique phonemes present in the text corpus and the desired frequency of unique phonemes.

For developing TTS system, pronunciation or letter to sound (LTS) rules plays an important role. Letter to sound rules indicate how the written text has to be spoken. Most of the American and European languages have their own specific letter to sound rules. For Indian languages, letter to sound rules are somewhat simpler. In language like Telugu and Kannada the written and the spoken form are almost same. In Hindi, rules to be derived are reasonably simpler and have been implemented in speech synthesis systems [8]. For building the TTS in Gujarati letter to sound rules are developed in two ways:

deriving the pronunciation dictionary,

deriving the rules using the linguistic knowledge.

If a particular sequence of graphemes yields a particular sequence of phones for most of the times, then this can be treated as a rule. These rules can be applied to similar set of grapheme sequence to obtain the corresponding phone sequence. Some special cases, which are exception of the previous rules and contain mostly conjugates, are handled with specific rules. Schwa deletion rules are used for deletion or retention of the dependent vowel /a/ in the grapheme of a word. The dependent vowel /a/ can occur at any position within the word. Rules are framed to discriminate the situations, where it should be deleted, and where it should be retained and modified.

Recording and labeling Speech Corpus:

Once the optimal text is derived from large text corpus, speech corpus will be recorded using optimal text. Recording the speech is carried out in the Studio environment. The recorded speech corpus will be initially segmented into

sentences and each sentence wav file is labeled according to the festival framework. In festival framework, phone level segmentation can be performed with ergodic hidden Markov models (HMMs) using forced alignment. Labeling and segmentation at syllable level can be performed from phone level segmentation using appropriate syllabification rules.

After segmenting the speech corpus into basic units, each basic unit is parameterized using the parameterization module present in festival. The parameterization module in festival is appropriate for the phoneme level units.

After segmenting and labeling the speech corpus at different levels, the next step is to organize the speech data in

a systemic manner. In festival framework, speech corpus is organized in the form of clusters [4]. Each cluster consists of multiple realization of a particular sound unit. Here the clusters are represented by CART trees, which are generated based on the questions related to the phonetic and prosodic contextual features of sound unit. For each type of the sound unit in speech corpus, a decision tree is constructed whose leaves are the list of candidate units which are best identified by the sequence of questions that lead to the leaf of tree.

Decision trees contain a binary question (yes/no answer) about some feature at each node in the tree. The leaves of the tree contain the best prediction based on the training data. Decision lists are a reduced form of this where one answer to each question leads directly to a leaf node. A tree's leaf node may be a single member of some class, a probability density function (over some discrete class), a predicted mean value for a continuous feature or a gaussian (mean and standard deviation for a continuous value).


Fig. 3: An example of part of a trained decision tree. The questions are asked at each node, after which the tree splits. The number indicates the number of data points at each node. The double circles indicate leaf nodes, where the splitting processes stops.

III. Hidden Markov Model Synthesis

To generate high quality speech it is necessary to determine which parameters to use for a given synthesis specification rather than converting the parameters into speech. It is possible to determine these by hand written rules can produce fairly intelligible speech, but the inherent complexities of speech seem to place an upper limit on the quality that can be achieved in this way. The various second generation synthesis techniques can solve the problem by simply measuring the values from real speech waveforms. While this is successful to a certain extent, it is not a perfect solution. It is difficult to collect enough data to cover all the effects so as to obtain Natural speech from the database is very uneven. Furthermore, the concatenative approach has a limit to recreating what was recorded; in a sense to reorder the original data.

An alternative is to use statistical, machine learning techniques to infer the specification-to-parameter mapping from data. While this and the concatenative approach can both be described as data-driven, in the concatenative approach we are memorising the data, whereas in the statistical approach learns the general properties of the data. Two advantages that arise from statistical models are that firstly we require orders of magnitude less memory to store the parameters of the model than memorise the data, and secondly that we can modify the model in various ways, for example to convert the original voice into a different voice.

There are several statistical synthesis approaches that can be used but the best way is to use hidden Markov Models (HMMs). HMMs themselves are quite general models and although originally developed for speech recognition has been used for many tasks in speech and language technology.

The input to an Automatics Speech Recognition system is a sequence of frames of speech, known as observations and denoted as

The frames are processed so as to remove the phase and source information using mel-scale cepstral coefficients(MFCCs). For each phone a probabilistic model is build up which tells the probability of observing a particular acoustic input.

The Gaussian distribution also called the normal distribution or bell curve which is defined by two parameters. The mean,describes its "average" value. The variance denoted describes whether the distribution is narrow or dispersed. The square root of the variance is called the standard deviation and is denoted as. The Gaussian distribution is shown in Figure 4 and defined by the equation

…… (1)


Fig. 4: The Gaussian function

When dealing with vector data, where the observations are acoustic frames, multivariate Gaussian is used. This is the natural extension of the univariate one, i.e. instead of having one mean value for each component in the vector, there is a covariance matrix, , not a variance vector. This is because to model not only the variance of each component, but the covariance between each component. The pdf of an N-dimensional Gaussian is given by:

….. (2)

Where n is the dimensionality, is the vector of means and is the covariance matrix.

In this way a system can be build for every phone, each described by its multivariate Gaussian. For an unknown utterance, if the phone boundaries are known, we can therefore test each phone model in turn and find which model gives the highest probability to the observed speech, and from this find the sequence of phones that are most likely to have given rise to the observations in the utterance.


Fig. 5: Schematic of how we can build a Gaussian model for a phone.

In order to improve the accuracy of these models mixture of Gaussians is adopted as shown in the figure 5 and the output probability is given by:

…… (3)


Fig. - Mixtures of Gaussians. This shows how three weighted Gaussians, each with its own mean and variance, can be "mixed" (added) to form a non-Gaussian distribution. The total probability, given by the area under the curve must equal 1.

Synthesis from Hidden Markov Models:

Hidden Markov Models in conjunction with a search algorithm such as Viterbi algorithm can be used in speech synthesis as recognition. Using HMM alone gives the specification such as phone sequence and duration of each phone. The phone sequence tells us which models to use in which order, but not which states to use, or which observations to generate from the state Gaussians. The duration in the specification, this tells us how many observations that we should generate from this model, but again not which states to generate from.

The most likely sequence of observations is generated from the sequence of models. It should be obvious that in all cases each state will generate its mean observation. Furthermore the spectra are the same during each state, meaning that, in one dimension, the generated speech is a flat line followed by a discontinuity, followed by a different flat line. This again does not look or sound like natural speech. In effect ignoring all variance information - by retrain on speech generated by this model results the same observation for each state and hence would calculate the same mean but calculate a zero variance in all cases.

The approach describe in the previous section generates the most likely observations from the models. The problem with this is that the states always generate their mean values, which results in jump at state boundaries. The key point in this technique is to use the delta and acceleration coefficients as constraints on what observations can be generated.

Consider the set of observations as

splitting each observation into its constituent parts of coefficients and delta coefficients:

For synthesis it is required to find the coefficients , so the problem is to generate the highest probability sequence of these that also obeys the constraints of the data coefficients . For finding both the observations and the state sequence, we use an algorithm specific to this problem. Consider a state sequence Q. The probability of observing a sequence of acoustic vectors is therefore equal to

The observation probability for a single n-dimensional Gaussian is given by

The observation probability for a single n-dimensional Gaussian is given by

Taking log probabilities the Equation 4 becomes

Substituting Equation (5) into Eq. (6), we get the following expressed log probabilities:

Putting this into Eq. 7, to find the log probability for the state sequence, we get


The rate of change of coefficients also called delta coefficients can be expressed as weighted versions of c. Hence we obtain maximum likelihood expression expressed in terms of the normal coefficients c:

… (11)

Where W is a matrix which expresses the weights of Eq.(10) and is given by


For maximizing Eq.(11), we differentiate it with respect to c which gives

. (13)

the above equation can be solved to find c by any matrix solution technique.

IV. Synthesis Stage

For synthesizing the speech, the input text is parsed into the sequence of sound units in Festival. For transforming the text to sound units, festival makes use of pronunciation dictionary. The rules derived from linguistic knowledge are applied for deriving the pronunciation (sequence of phonemes) to new words, which are not present in the pronunciation dictionary. After deriving the sequence of sound units by parser, linguistic analysis module will generate the phonetic and contextual features associated to each unit. Prosodic analysis module will generate the target prosody for the input text utterance.

The unit selection synthesis technique selects the best sequence of speech units from a database of speech units and concatenates them to produce speech. These selected speech units should satisfy the following two constraints.

They should best match the target specification given by the linguistic components of the text analysis module and

They must be the best units that join together smoothly when concatenated.

The cost associated with the first constraint is called the target cost and the cost associated with the second constraint is called the concatenation or join cost.

The target specification is a sequence of speech units along with features related to the phonetic and prosodic context for each unit. The phonetic context features include the identity of a particular speech unit, position of the speech unit in the word and the phonetic features of the previous and following speech units. The prosodic context features include the pitch, duration and stress of the particular unit and the prosodic features of the preceding and following units. The speech database is developed in the same way.

Festival uses a clustering technique [4] to organize the units in the speech database according to their phonetic and prosodic context. For example, if there is a speech unit /કામ/, all instances of the speech unit /કામ/ with different phonetic and prosodic contexts belong to the same class. Each class is organized as a decision tree whose leaves are the various instances of the speech unit. The branches of the decision tree are questions based on the prosodic and phonetic features that describe the units. During synthesis time, for each target unit in the target list, its decision tree is identified from the speech database. Using the target specification for each unit and the decision tree, a set of candidate units that best match the target specification is obtained.

For clustering, Festival defines an acoustic measure of distance between two units of the same type. The acoustic vector for a frame includes mel-frequency cepstral coefficients, fundamental frequency , energy and delta cepstral coefficients. This acoustic measure of distances is used to define the impurity of a cluster of units as the mean distance between all members. The CART method is used to build the decision tree such that the branches correspond to questions that minimize the impurity of the subclusters. With the whole process of clustering the speech database into clusters done offline, the only task that is done during synthesis time is to use the decision tree to find a set of possible candidate units. To join the consecutive candidate units from clusters selected by the decision trees, Festival uses a join cost which uses a frame based Euclidean distance. The frame information includes , mel-frequency cepstral coefficients, energy and delta cepstral coefficients.

In the synthesis phase, the text processing module of Festival first generates a target specification for the input text. For each target, based on questions from the target specification, the CART for that unit type gives the appropriate cluster which provides a set of candidate units. A function is defined as the distance of a unit to its cluster center (synonymous to finding the target cost). Another function is also defined to find the join cost between a candidate unit and the previous candidate unit . A Viterbi search is then used to find the optimal path through the candidate units that minimizes the following expression:

where is a weight that can be set to give more importance for the join cost over the target cost.

V. Conclusion and Future work:

In this thesis, a prototype Gujarati TTS using phoneme as the basic unit was developed using festival, festvox framework. Text corpus is collected from various domains in the UTF8 format. Optimal text selection algorithm is used in the text corpus. The given text in UTF8 format is converted to IT3 format using a parser. Clustering of units is done based on syllable specific positional, contextual and phonological features. Classification of phoneme is done based on its position within the word. Unrestricted TTS in Gujarati language was developed using Statistical parametric technique with phoneme as a basic unit.

In addition the Letter to sound rules need to worked out taking inputs from the linguists. Using the festival framework TTS for Gujarati language can be developed using diphone or syllable as the basic unit of concatenation.