Speech Recognition Using Context Independent Word Modeling

Published:

ABSTRACT :

Building a speech recognition system for the Indian language like Tamil is the challenging task. Speech in Tamil language has unique inherent features like long and short vowels, lack of aspirated stops, aspirated consonants and many instances of allophones. Pronunciation of words and sentences is strictly governed by set of rules. Like other Indian languages, Tamil is syllabic in nature. Stress and accent vary in spoken language from region to region. However in read Tamil speech, stress and accents are ignored. The objective of this paper is to build a small vocabulary context independent word based continuous speech recognizer for Tamil language.

In this experimentation, a word based context independent acoustic model, dictionary and Trigram based statistical language model have been built for a small vocabulary of 341 unique words as integral components of the linguist which is the heart of the speech recognizer. The entire vocabulary was drawn from a particular domain. The recognizer gives reasonable word accuracy for test sentences read by trained and new speakers. The limited vocabulary domain specific tasks. The results are encouraging and this recognizer is simple, robust more accuracy oriented since its deals with word as a basic acoustic unit.

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

Keywords: Context Dependent, Context Independent, Continuous Speech Recognition, Hidden Markov Models, Tamil language, Word Model.

Introduction

Automatic Speech Recognition (ASR) deals with automatic conversion of a acoustic signal into text transcription in the speech utterances. Even after years of extensive research and development, accuracy in ASR remains a challenge to researchers. There are number of well known factors which determine accuracy. The prominent factors include variations in context, speakers and noise in the environment. Therefore research in automatic speech recognition has many open issues with respect to small or large vocabulary, isolated or continuous speech, speaker dependent or independent and environmental robustness.

Fundamentally, the problem of speech recognition can be stated as follows. When given with acoustic observation X = X1X2…Xn, the goal is to find out the corresponding word sequence W = w1w2…wm that has the maximum posterior probability P(W|X) expressed using Bayes Theorem as shown in equation (1).

(1)

Where P(W) is the probability of word W being uttered and P(X|W) is the probability of acoustic observation X when word W is uttered. P(X|W) is also known as class conditioned probability distribution. P(X) is the average probability that observation X will occur. It is also called the normalization factor. Since the maximization of equation (1) is done with variable X fixed, to find the word W it is enough to maximize the numerator alone.

(2)

The first term in equation (2), P(W), is computed with the help of a language model. It describes the probability associated with a hypothesized sequence of words. The language model incorporates both the syntactic and semantic constraints of the language and the recognition task. Generally the language model may be of the form of a formal parser, syntax analyzer, N-gram model and hybrid model. Refer 4 for more details. In this experiment, a statistical tri-gram language model has been built using Carnegie Mellon University's (CMU) statistical language modeling toolkit.

The second term in equation (2), P(X|W), is computed using an acoustic model which estimates the probability of a sequence of acoustic observations conditioned on the word W. The recognizer needs to know the class conditioned probability P(X|W) from the acoustic model in order to compute the posteriori probability P(W|X). Hidden Markov Models (HMM) offer a viable solution to statistical automatic speech recognition compared to other classification techniques such as Artificial Neural Networks (ANN) or Support Vector Machines (SVM). Mari Ostendorf et al reviewed different classification models9.

HMM have become the common structure of acoustic models because HMM can normalize speech signal's time-variation and characterize speech signal statistically thus helping to parameterize the class conditioned probabilities. Thus the acoustic model forms the core knowledge base representing various parameters of speech in the optimal sense. At present, all state-of-the-art commercial and most laboratory speech recognition systems are based on HMM that give very low Word Error Rate (WER) when tested on standard speech databases. For detailed study refer 12 and 3.

The Choice of Sub-word Units

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

Speech recognition process requires segmentation of speech waveform into fundamental acoustic units. Phone is the preferred basic fundamental unit. Other units may be a word or syllable. The relative merits and demerits of different acoustic units are presented here.

Variation in context is an important issue in speech recognition. Phones are short in duration and show high variations. Phones can be realized depending on their context. Some phones are aspirated when they occur in the beginning of a word and the same phones are not aspirated when they occur at the end of a word. Therefore the acoustic variability of basic phonetic units due to context is sufficiently large and not well understood in many languages. Hence the entire word may be treated as basic acoustic unit. Word units have their acoustic representation well defined. Acoustic variability occurs mainly in the beginning and end of a word i.e. at word boundaries. Another major advantage is no need of having pronunciation dictionary. But the word based speech models have some disadvantages. The first disadvantage lies in obtaining reliable whole word models from a reasonable training set. Secondly, for large vocabulary the phonetic content of individual words overlap leading to redundancy in storing and comparing whole word patterns.

In tasks like speech driven automatic phone dialing, digits (0 - 9) along with few other words form the vocabulary. The acoustic model of such systems can be trained using Context Dependent (CD) word model for a reasonable size of training data. Context dependency means finding the likelihood of a given word (or acoustic unit) with respect to its left and right units. However when the size of vocabulary increases moderately e.g. around 500 words, CD word modeling becomes infeasible as the possible left and right context words increase exponentially. This also demands large training set, which is impractical. In such situations Context Independent (CI) word modeling simplifies the training process. CI models parameterize individual units by ignoring their contexts. The motivation for choosing the word as an acoustic unit in this paper is that small vocabulary and domain specific recognition systems can be easily realized using CI word modeling.

When dealing with large vocabulary recognition tasks, it would be more practical to train acoustic models in the phonetic level. However at phonetic level, detection of word boundary in continuous speech becomes very difficult. Therefore in English, large vocabulary continuous speech recognition (LVCSR) systems have used CD phone or triphone as the fundamental acoustic unit. Triphone models are powerful sub-word models because they account for the left and right phonetic contexts. Since there are only about 50 phones in English, they can be sufficiently trained by a reasonable amount of training data. Moreover, phones are vocabulary independent. Therefore one can train on one set of data and test the model on another set8. Triphones have been enormously successful in acoustic modeling of LVCSR systems.

The Tamil Language

Tamil is a Dravidian language spoken predominantly in the state of Tamilnadu in India and Sri Lanka. It is the official language of the Indian state of Tamilnadu and also has official status in Sri Lanka and Singapore. With more than 77 million speakers, Tamil is one of the widely spoken languages of the world.

Tamil alphabet

Some of the phonological features which are of interest to speech recognition research are discussed in this section. Tamil vowels are classified into short, long (five of each type) and two diphthongs. Consonants are classified into three categories with six in each category: hard, soft (a.k.a nasal), and medium. The classification is based on the place of articulation. In total there are 18 consonants. The vowels and consonants combine to form 216 compound characters. The compound characters are formed by placing dependent vowel markers on either one side or both sides of the consonant. There is one more special letter aytham (ஃ) used in classical Tamil and rarely found in modern Tamil. Summing up there are 247 letters in standard Tamil alphabet. In addition to the standard characters, six characters taken from the Grantha script which is used in modern Tamil to represent sounds not native to Tamil, that is, words borrowed from Sanskrit and other languages. Even though Tamil is characterized by its use of retroflex consonants similar to the other Dravidian languages, it also uses a unique liquid zh (ழ்). Extensive research has been reported in articulation of liquid consonants in Tamil. See 11 for more details.

Pronunciation in Tamil

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

Tamil has its unique letter to sound rules. There are very restricted numbers of consonant clusters. Tamil has neither aspirated nor voiced stops. Unlike most other Indian languages, Tamil does not have aspirated consonants. In addition, the voicing of plosives is governed by strict rules. Plosives are unvoiced if they occur word-initially or doubled. The Tamil script does not have distinct letters for voiced and unvoiced plosives, although both are present in the spoken language as allophones.

Generally languages structure the utterance of words by giving greater prominence to some constituents than others. This is true in the case of English: one or more phones standout as more prominent than the rest. This is typically described as word stress. The same is true for higher level prosody in a sentence where one or more constituent may bear stress or accent. As far as Tamil language is concerned, it is assumed that there is no stress or accent in Tamil at word level and all syllables are pronounced with the same emphasis. However there are other opinions that the position of stress in the word is by no means fixed to any syllable of individual word. However in connected speech the stress is found more often in the initial syllable. Detailed study on pronunciation in Tamil can be found in 5 and 2. In our experiment, stress on syllable is ignored because we are dealing with read speech.

Building CI Model for Tamil words

Building continuous speech recognizers for Tamil language is a challenging task. This is due to the fact that Indian languages like Tamil differ from English in several aspects pertaining to orthography and phonology, pronunciation and word stress as described in section (3.2). As a first step towards building a LVCSR system for Tamil language, in this paper the authors have attempted to build a small vocabulary continuous speech recognizer using CI word model for small vocabulary task using HMM.

The important modules in speech recognition are acoustic model, dictionary and language model. A statistical trigram language model was built using the CMU Statistical Language Modeling toolkit. The language model was trained on a text corpus of 341 unique words. Since the word model is being built, the dictionary component is created by mapping every word in the lexicon to itself.

Since speech databases are not available for Tamil, a speech corpus was created in-house. The corpus contains 12 hours of continuous read speech comprising 6 males and 5 females for training and 7.5 hrs of speech comprising 75 males and 75 females for testing has been created. The recording was carried out in a noise free lab environment. Finally, sentence level transcriptions were done manually.

The HMM based acoustic model trainer from Carnegie Mellon University, SphinxTrain, has been employed. The input file format and details of front-end processing are summarized in table 1.

Table 1. Front-end Processing Details

Parameter

Value

Input File Format

Wav (Microsoft) File

Sampling Rate

8,000 Hz

Depth

16 bits

Mono/Stereo

Mono

Window Length

0.025625 S

No. of FFT

512

No. of Filters

31

Min. Frequency

200 Hz.

Max. Frequency

3500 Hz.

Table 1 (Continued)

No. of Ceptrums

13

Output

Mel frequency Ceptral Co-efficient

The files used to create and train the acoustic model with sample data are as follows.

A set of feature files computed from the audio training data, one for every utterence in the training corpus. Each utterance can be transformed into a sequence of feature vectors in Mel Frequency Ceptral Co-efficients (MFCC) using a front-end executable provided with the SphinxTrain. Sample entries are listed below

S011F.mfc

S031M.mfc

A control file containing the list of filenames of feature-sets. Examples of the entries of this file are

S011F

S031M

A transcript file in which the transcripts corresponding to the feature files are listed in exactly the same order as the feature filenames in the control file. Sample entries are shown in figure 1.

Figure 1. Sample entries in transcript file

A main dictionary which has all acoustic events and words in the transcripts mapped onto the acoustic units we want to train. Here each word is mapped to the word itself since it is word based training. Examples of the entries in this file are shown in figure 2.

Figure 2. Sample entries in dictionary

A filler dictionary, which usually lists the non-speech events as "words" and maps them to user_defined phones. This dictionary must at least have the entries

<s> SIL

<sil> SIL

</s> SIL

The entries stand for

<s> : beginning-utterance silence

<sil> : within-utterance silence

</s> : end-utterance silence

A phonelist, which is a list of all acoustic units to train models. Examples are shown in figure 3.

Figure 3. Sample entries in phonelist

HMM model with 3 emitting and one non-emitting states with continuous Gaussian density has been used. The HMM topology is shown in figure 4.

Figure 4. HMM and its topology

The details of the training parameters are summarized in table 2.

With the speech corpus and above said files as input, training was done as follows using SphinxTrain

First training of full models using 15 iterations per step. This step involves generation of monophones seed models with nominal values.

Force aligning the training data against models from step (1) by Baum Welch training of monophones and re-estimation of single Gaussian monophones using Viterbi alignment process.

Using aligned transcripts from step (2) to train new models; convergence ratio set to 0.02. This resulted in around 5-7 iterations per step.

After the training is over, SphinxTrain generates the parameter files of the HMM namely the probability distributions and transition matrices.

Table 2. Training Parameters

Parameter

Value

Type of Training

Context Independent (Continuous Density)

Input Features

Mel frequency Ceptral Co-efficient

Feature Type

Ceptra, Delta and Double Delta

Dimensions

13

No. of States in HMMS

3 and one Non-emitting node

No. of Gaussians

1

Implementation

The CI word model based Tamil speech recognizer is implemented on Sphinx-4 which is a state-of-art HMM based speech recognition system. It is being developed on open source since February 2002. Sphinx-4 is the successor of Sphinx-3 and Sphinx-2 designed jointly by Carnegie Mellon University, Sun Microsystems Laboratories and Mitsubishi Electric Research Laboratories, USA. It is implemented in Java programming language and thus making it portable across a growing number of computational platforms10.

The Sphinx-4 framework

Fig. 5. Sphinx-4 Architecture. [Source: Carnegie Mellon University]

The Sphinx-4 framework has been designed with a high degree of flexibility and modularity. Figure 5 shows the overall architecture of the system. Each labeled element in figure represents a module that can be easily replaced, allowing researchers to experiment with different module implementations without needing to modify other portions of the system. There are three primary modules in the Sphinx-4 framework: the Front-End, the Decoder, and the Linguist. The Linguist comprises one or more Acoustic models, a Dictionary and a Language Model. Depending upon the linguist, different modules can be plugged into the system. This is done through the Configuration Manager module.

Decoding continuous Tamil speech using Sphinx-4 decoder

The language model, dictionary and the acoustic model developed in section (4) were deployed on the Sphinx-4 decoder. Sphinx-4 was configured to operate in CI mode with the following components

Linguist : Flat Linguist

Dictionary : Full Dictionary

Search Manager : Simple Breath First Search Manager

Flat linguist

This is a simple form of a linguist. A flat linguist takes a grammar graph and generates a search graph for the grammar. The following assumptions are made

Zero or one word per grammar node

No fan-in allowed

Only unit, HMM state and pronunciation states are allowed

Only valid transitions are allowed

No tree organization of units

Full dictionary

This component creates a dictionary by reading the Sphinx-3 format dictionary. In our experiment, each line in the dictionary specifies the word followed by space or tab, followed by its pronunciation. In our case, the pronunciation is the word itself since we are dealing with CI models. The full dictionary will read all the words and their pronunciations at startup. Therefore, it is suitable for low vocabulary task.

Simple breadth first search manager

With the acoustic features and linguist as input, this module performs simple breadth first search on the search graph rendered by the flat linguist.

Results

The hypothesis word sequences from the decoder are aligned with reference sentences. The result is generated in terms of WER and word accuracy. Word errors are categorized into number of insertions, substitutions and deletions. Other performance measures are speed and memory footprints.

The system was tested in batch mode with three trails. Firstly, a test set of 13 utterances for the trained sentences with trained voices was applied. The results are tabulated in table 3.

Table 3. Results from utterances of trained voice on trained sentences

Details

Values

Words

110

Errors

6 (Sub: 0 Ins: 0 Del: 6)

Accuracy

94.55 %

Sentences

13

Time

Audio: 26.80 s, Processing: 46.53 s

Speed

1.74 Ã- Real time

Memory

Average: 22.51 MB, Max: 26.40 MB

Secondly, a test set comprising 50 test utterances from trained voices was applied. The results are tabulated in table 4.

Table 4. Results for CI model with trained voice

Details

Values

Words

387

Errors

254 (Sub: 35 Ins: 2 Del: 217)

Accuracy

34.9%

Sentences

50

Time

Audio: 86.36 s, Processing: 232.79 s

Speed

2.70 Ã- Real time

Memory

Average: 21.93 MB, Max: 27.69 MB

Finally, a test set comprising 50 test utterances from new voices was applied. The results are tabulated in table 5.

Table 5. Results for CI model with new voice

Details

Values

Words

341

Errors

297 (Sub: 39 Ins: 1 Del: 257)

Accuracy

13.2%

Sentences

50

Time

Audio: 80.54 s, Processing: 198.26 s

Speed

2.46 Ã- Real time

Memory

Average: 22.14 MB, Max: 27.65 MB

Discussion and Conclusion

The accuracy of the system is better for trained voices than untrained voices. Also the accuracy for the utterances of trained sentences with trained voices is very high. In scenarios where the vocabulary is limited, repeatability is more and speakers are limited, this recognizer is highly suitable.

The word error rate shows a majority of deletions errors. This is due to the small training set. The speed of recognition process is also lower. This is due to word level comparisons in the search graph. But this system works reasonably well for small vocabulary and domain dependent task. The recognition accuracy for words and sentences will improve further if the size of the sentences is kept small.

For medium and large vocabulary, a triphone based approach is must. CD phone or triphone and syllable based modeling for Tamil language are under progress. These approaches are expected to give good results. There are inherent features in pronunciation of Tamil language which could be exploited in acoustic modeling. It is believed that larger sub-word units like syllable could improve system performance. Many attempts have been made for English language. But initially, there is an increase in WER and is reported in 1,7,6. In English pronunciation variation is high and syllabification is fuzzy. Even with increase in WER, syllable still remains a primary focus of research in speech recognition. But on the contrary, Tamil has well defined syllabification and sandhi rules which could help in syllable modeling which will in turn increase the recognition rates.