Arabic Speech Recognition Systems English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

This paper investigates the adaptation of Arabic speech recognition systems to foreign accented speakers. This adaptation is accomplished by using the Maximum Likelihood Linear Regression (MLLR), Maximum a posteriori (MAP), and combination of MLLR and MAP techniques. The HTK toolkit for speech recognition is used throughout all experiments. The systems were evaluated using both word and phoneme levels. The LDC West Point Modern Standard Arabic (MSA) corpus is used throughout the experiments. Results show that particular Arabic Phonemes such as pharyngeal and emphatic consonants, that are hard to pronounce for non-native speakers, benefit from the adaptation process using MLLR and MAP combination. An overall improvement of 7.37% has been obtained.


Numerous studies have been carried out to improve the automatic recognition of speech uttered by non-native speakers. Fakotakis [4] worked on the adaptation of standard Greek speech recognition systems to work with Cypriot dialect by using HTK toolkit [12], MLLR, MAP, and combined MLLR and MAP techniques [7][8]. The best accuracy improvement was encountered with digits strings database and the combined MLLR and MAP technique. This improvement was about 2%. Bartkova and Jouvet proposed in [3], multiple models for improved speech recognition of non-native French speakers. They addressed the problem of foreign accent by using acoustic models of the target language phonemes (French phonemes in their case) adapted with speech data from three other languages: English, German, and Spanish. Their results obtained for 11 language groups of speakers, showed that error rate can be significantly reduced when standard acoustic models of phonemes are adapted using speech data from other languages. In their outputs, the highest error rate reduction of 40% was obtained on English native speakers. They proved that the recognition performance was improved on almost all language groups, even though only three foreign languages were available in their study for acoustic model adaptation. In [5], Hui and et al. proposed a speaker adaptation method that modifies the principal mixture for improved continuous speech recognition. This method reduces the Hidden Markov Models (HMMs) complexity by choosing only the principle mixtures corresponding to particular speaker's characteristics. They proved that the new method improved both recognition accuracy (by 31%) and recognition speed (by 30%) when compared to full mixture speaker adaptation models. Other recent techniques only require the learners' utterances in their native language for adapting ASR to non-native speech [10]. In this context, it is important to note that compared to other languages, the Arabic language benefits from very limited number of research initiatives.

In a previous work [1], we have analyzed the results of both native and non-native speech recognition at phonetic level in order to determine the phonemes that have a significant part in the recognition performance. In this paper, we extend the previous work [1] by investigating how the adaptation techniques could improve a trained recognition system to be used by non-native Arabic speakers to get minimum amount of degradation in the system accuracy. This adaptation is accomplished through the use of the MLLR, MAP, and combination of MLLR and MAP techniques. The original (baseline) recognition system was trained by native Arabic speakers. Before adaptation, the system was tested by non-native Arabic speakers and its performance was considered for the sake comparisons with that of the adapted systems.

The organization of this paper is as follows. In Section 2 a basic background on Arabic language is given. In Section 3, the adaptation methods are briefly presented. Then, Section 4 presents the experimental framework, and Section 5 proceeds with a discussion of the obtained results. Finally, in Section 6 we conclude and give indications about the future work.

2. Basic AraBIc language background

Modern Standard Arabic (MSA) has 34 basic phonemes of which six are vowels, and 28 are consonants. The Arabic language has many differences when compared to European languages such as English. Among these differences, we can cite uniqueness of some Arabic phonemes, particular phonetic features, and complicated morphological structures. A major difference lies in Arabic text, where it is written with the absence of any information that leads to short vowels, geminate, and pharyngealization. This might lead to many identical-looking forms in a large variety of contexts, which decreases predictability in correct word pronunciation, sentence meaning, and rules of language model. Hence, accurate acoustic model testing, which depend on Arabic text, is difficult when the identity and location of short vowels, for example, is unknown [2][6].

The Arabic language has three long and three short vowels. Permissible syllables in the Arabic language include CV, CVC, and CVCC, where C indicates a consonant and V a long or short vowel. Arabic words (and speech) can only start with a consonant. The Arabic language is characterized by the presence of emphatic and pharyngeal phonemes. There are a total of five pharyngeal phonemes; two of which are fricatives /H/(ح) and /C/(ع). The main characteristic of these phonemes is the constriction existing between the tongue and the lower pharynx. Besides this we note the rising of the larynx. Also there are three uvular pharyngeal phonemes /x/(خ), /G/(غ), and /q/(ق) characterized by a constriction formed between the tongue and the upper pharynx for /x/ and /G/ and a complete closure for /q/ at the same level. These five consonants are considered as the Arabic pharyngeal phonemes [11]. On the other hand, there are four emphatic phonemes: /S/(ص), /D/(ض), /T/(ط), and /Z/(ظ). These phonemes are emphatic versions of the oral dental consonants /s/(س), /d/(د), /t/(ت) and /TH/ (ث).


The MLLR is a widely used parameter transformation technique that has proven successful in the case where a small amount of adaptation data is available [8]. It aims at reducing the mismatch between initial reference models and the adaptation data through the use of a set of transformations. In our experiments, we use MLLR to determine a set of linear transformations for only the means of the HMM Gaussian mixtures. The goal of these transformations is to linearly modify the mean components so that each HMM state of initial system is more likely to produce the new adaptation data. The new estimate of the adapted mean obtained through the MLLR transformation matrix is stated as:

, (1)

where is the transformation matrix (where is the dimensionality of the data) and is the extended mean vector defined as follows:

, (2)

where represents a bias offset whose value here is fixed at 1. Hence can be decomposed into:

, (3)

where represents an regression matrix and is an additive bias vector associated with the broad class c. The adapted kth mean vector for each state i can be written as follows:


The System adaptation can be accomplished using Maximum a posteriori (MAP) technique [7]. For MAP adaptation, the re-estimation formula for Gaussian mean is a weighted sum of the prior mean with the maximum likelihood mean estimate. It is formulated as:

, (5)

where is the weighting parameter for the kth Gaussian component in the state i. is the occupation likelihood of the observed adaptation data xt.

One of the drawbacks of MAP adaptation is that it requires more adaptation data to be effective compared to MLLR. When MLLR is combined with MAP we can benefit from both of the techniques. Theoretically, the combination offers compact transformations for rapid adaptation when only limited amount of data is available, thanks to MLLR, and the asymptotical efficacy of MAP adaptation when the amount of data increases. There are many ways to combine MLLR and MAP. We choose to use the MLLR transformed means as the priors for MAP adaptation. Hence, the adapted means can be written as:


The principal difficulty in MAP adaptation is to determine the mixing parameters. As it is commonly used, we chose a single mixing parameter for each model that we built, i.e. ik=  .


4.1. Data

The LDC corpus consists of 8,516 speech files, totaling 1.7 gigabytes or 11.42 hours of speech data. Each speech file represents one person uttering one prompt. The files were recorded using a 16-bit PCM low-byte-first ("little-endian") coding, with a sampling rate of 22.05 KHz. They were then converted to The NIST SPHERE format. Approximately 7,200 of the recordings are from native speakers and 1200 files are from non-native speakers [9]. From the WestPoint corpus we selected four different and disjointed lists; all have been chosen randomly from non-native Arabic speakers only. The first list called AD100, contains 100 utterances; the second list called AD150, contains 150 utterances; the third list called AD200, contains 200 utterances; the last list called AD250, contains 250 utterances. The four lists are chosen randomly from all available scripts, speakers, and genders. The designed lists were used to adapt a native Arabic speaker based system to deal with non-native Arabic speakers. For this purpose, three different adaptation techniques were used: MLLR, MAP and a combination of MLLR and MAP. The performance is analyzed both at the word level (by incorporating a language model) and at the phoneme level. The phoneme level permits us to investigate the improvement (if any) of system accuracy on individual phonemes; hence giving us the chance to analyze the weakness of non-native Arabic speakers' pronunciation as a phoneme-wise way of analysis. Based on the above explanations, we refer to our experiments as AD100/MLLR, AD100/MAP, AD100/MLLRMAP, etc.

4.2. Recognition platform and parameters

In our experiments, we used the HTK toolkit [12] to design and test the developed speech recognition systems. A phoneme level recognizer is considered as a baseline system. It is based on continuous and left-to-right HMM models with three active states. The reference Hidden Markov Models are generated for the MSA phones as given by the LDC catalog. Context-dependent triphone models were created from monophone models since most of the LDC words consisted of more than two phonemes. In the training step, the HMM models are re-estimated through the use of the Baum-Welch algorithm. For more accuracy, the decision-tree method is used to align and tie some models. We have shown in [1] that the context-dependent phoneme models (triphones) are useful to characterize formant transition information. Therefore, this helps the HMM models to make an effective discrimination between confusable speech units and thus to obtain better accuracy. The parameters of the system consisted of a 22 KHz sampling rate with a 16 bit sample resolution, a 25 millisecond Hamming window duration with a step size of 10 milliseconds. Each window contains 13 static MFCCs and their first and second derivatives. A pre-emphasis filtering is used with a coefficient of 0.95.


All native Arabic speakers' data provided by the LDC West Point corpus was used for training the original recognition system. After that, all non-native Arabic speakers' data provided by the same corpus was used for testing the system. As a result of that test, the accuracy (correctness) of the system was 89.02% and 93.19% for word level and phoneme level, respectively. This performance is relatively low compared to the same system but with testing data taken from native Arabic speakers. Table 1 shows the system performance for the four adaptation lists and for the three adaptation techniques. This performance is compared to the accuracy of the system prior to any kind of adaptation, where 89.02% for word level and 93.19% for phoneme level have been respectively obtained.

As it can be inferred from the results, the improvement of system performance increases when the size of the adaptation list is increased. This performance is improved rapidly to reach its best at 96.39% which represents a 7.37% improvement (at word level) in comparison with the original system. This result is obtained by the adaptation list AD250 and the adaptation combining MLLR and MAP techniques (i.e., experiment AD250/MLLRMAP). We noticed that in AD150 and AD250, the combined MLLR and MAP adaptation techniques gave better performance compared to others. We notice that there is no fixed rule governing the comparisons of MLLR, MAP, and their combination. In some experiments, MLLR gave improvement better that that of MAP. In other experiments MAP gave better accuracy improvement. The combined MLLR and MAP techniques sometimes gave less improvement compared to either MLLR or MAP. We believe that this is due to the random choice of sentences used in adaptation. In some cases, more relevant and specific Arabic phonemes are included in the adaptation data, while in other cases, the adaptation set contains less of these phonemes.

By investigating the system performances for individual phonemes, we can notice that the phonemes /H/(ح), /TH/(ث), /g/(ج), /q/(ق) and /z/(ز), gained more improvement in their performances for all experiments. Table 2 shows the increases in performance for these phonemes for all conducted experiments. Except for phoneme /z/(ز), these phonemes are Arabic phonemes that cannot be found in English.

It is worthy to note that the adaptation process permits to enhance the performance of phonemes that are hard for non-native Arabic speakers to pronounce, especially /H/(Ø­) that is a fricative unvoiced non-emphatic pharyngeal sound. Thus, in the non-adapted system these sounds and other particular Arabic phonemes produced errors due to the phonological and acoustical changes induced by the pronunciation of non-native speakers. Therefore, we confirmed according to the obtained results that by means of the adaptation process the performance of automatic recognition of Arabic foreign-accented speech can be significantly improved.


An automatic Arabic speech recognition system was trained by using native Arabic speakers' speech data provided by the LDC West Point corpus for MSA Arabic. Then, the system was adapted to non-native Arabic speakers' speech data provided by the same corpus. The adaptation techniques were MLLR, MAP, and a combination of them. The accuracies of the non-adapted system were 89.02% and 93.19% at word and phoneme levels respectively. The best system accuracy improvement was 7.37% and this was obtained in experiment AD250/MLLRMAP. The specific Arabic phonemes /H/(Ø­), /TH/(Ø«), and /q/(Ù‚) known as being hard to pronounce for a non-native Arabic speaker got better accuracy improvements in all experiments. This work will be continued by investigating an evolutionary-based technique in order to give the Arabic speech recognition system an auto-adaptation capability in the context of more foreign accents.