Adaptive Feature Extraction Method For Malay Vowel English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Speech recognition is the process of transforming a speech signal to a sequence of words. Speech is the most natural form of human communication. Speech recognition technology has made it possible for computer to follow human voice commands and understand human languages. The main goal of speech recognition area is to develop techniques and systems for speech input to machine. In speech recognition, the main goal of the feature extraction step is to compute a parsimonious sequence of feature vectors providing a compact representation of the given input signal [19].

There are many existing techniques for speech recognition that have been studied. However in this paper, Linear Predictive Coding (LPC) will use as a feature extraction method. LPC stands for linear predictive coding is a tool used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model. LPC analyzes the speech signal and can be used to estimate the formants. An inverse filtering is the process of removing the formants, and the remaining signal after the reduction of the filtered modeled signal is called the residue [21].

In speech, formant and pitch are the characteristic partial that identify vowels to the listeners. Vowel identity in speech is strongly correlated with the values of the first two or three formant frequencies, which are formed by resonances in the vocal tract that produce peaks of energy in the vowel spectrum. It means, the first two formants are most important in determining vowel quality and displayed in terms of plot of f1 and f2 as shown in Figure 1 below. However, pitch represents the perceived fundamental frequency of a sound. Pitch allows the construction of melodies; pitches are compared as "higher" and "lower" and are quantified as frequencies.

Figure 1: An example plot of formant one (f1) and formant two (f2)

This research will propose a LPC of adaptive feature extraction for speech recognition specifically Malay vowel. Analyses of vowel recognition by male and female subjects across their pitch range and will be evaluated with this method and also based on formant frequencies.


Over the last ten years, major advances were made in the speech recognition area. Although one may think that the isolated vowel recognition is a simple problem in this context that assumption is far from true [4]. There are many researches done on the topic of vowel recognition. Qin Yan and Vaseghi [15] studied formant features of formant frequency, bandwidth, and intensity to classify accents conversions between British, Americans and Australian speakers. Carlson [17] also analyzed Formant Amplitude for vowel classification while Vuckovic and Stankovic [14] researched on automatic vowel classification based on 2-dimensional formant Euclidean distance. Liu and Ng [9] obtained the first three formant values of F1, F2, and F3 using Praat's linear predictive coding algorithm to study formant characteristics of vowels produced by mandarin esophageal speakers.

According to Hillenbrand and Houde [5], majority of vowel identification models have assumed that the recognition process is driven by either the formant frequency pattern of the vowel (with or without a normalizing factor of fundamental frequency) or by the gross shape of the smoothed spectral envelope. Excellent reviews of this literature can be found in [3, 7, 8, 16]. The main idea underlying formant representations is the notion that the recognition of vowel identity is controlled not by the detailed shape of the spectrum but rather by the distribution of formant frequencies, chiefly the three lowest formants (F1, F2 and F3).

Vowels are produced by passing air through the mouth without a major obstruction in the vocal tract [13]. Vowels are voiced sounds, and we describe vowels in terms of formants. More generally, the vocal folds vibrate to generate a glottal wave, illustrated as series of spectra, then the vocal tract acts as a resonator to modify the shape of spectra. Peaks of these acoustic spectra are referred to as formants which results from the resonant frequencies of any acoustic system. Its acoustic energy concentrates around a particular frequency in the speech wave.

The vocal tract shape determines the sound energy transfer function from the glottis to the lips and can be described in terms of resonances (formants) and anti resonances (zeros). Each formant can be described by its resonance frequency (formant frequency) and its resonance bandwidth (formant bandwidth). Nasal sounds like the "m" in "meet", produce anti-resonances where energy is trapped in the vocal tract. The 3 lowest formant frequencies are called the first formant (F1), the second lowest formant frequency (F2), and so on.

Formant frequencies are properties of the vocal tract system and need to be inferred from the speech signal.  In practice, only the lowest three or four formants are of interest [20]. For example in [11] it was found that the vocal tract resonance frequencies averaged over all their male talkers of American English for the vowel /a/ as in "Bob" were at 730 Hz, 1090 Hz, and 2440 Hz. Formant bandwidths of the female speaker are wider and formant locations higher than for the male speaker [12].


Speech recognition is an important task in a variety of applications, such as in education sector, telephone, medical sector, etc. There are many techniques developed for feature extraction for vowel recognition but not many focusing on adaptive feature an extraction technique that is more accurate and more robust. Formant bandwidths of the female speaker are wider and formant locations higher than for the male speaker [2]. The slight differences in the first three formant between male and female may contribute to some level of accuracy due to pitch differences if a common feature extraction method is used. So, this study will attempt to investigate and enhance the feature extraction method based on formant peak with gender adaptive capability.


There are two main objectives in this research project namely:

1. To develop a gender adaptive feature extraction algorithm based on LPC for Malay vowels.

To classifier Malay Vowel based on Artificial Neural Network (ANN).


5.1 LPC Technique

Linear Predictive Coding or LPC is a method of predicting a sample of a speech signal based on several previous samples. LPC coefficients can be used to separate a speech signal into two parts which are the transfer function and the excitation. The transfer function of a speech signal is the part dealing with the voice quality which distinguishes one person's voice from another. The excitation component of a speech signal is the part dealing with the particular sounds and words that are produced. In the time domain, the excitation and transfer function are convolved to create the output voice signal [22].

Human speech is produced in the vocal tract which can be approximated as a variable diameter tube. The LPC model is based on a mathematical approximation of the vocal tract represented by this tube of a varying diameter [23]. LPC is one of the most powerful speech analysis techniques, and one of the most useful methods for encoding good quality speech at a low bit rate. It provides extremely accurate estimates of speech parameters, and is relatively efficient for computation [21].

All-pole LPC vocal-tract model could be interpreted as a modified piecewise-cylindrical acoustic-tube model [1, 10], and this interpretation was most explicit when the vocal-tract filters were realized as ladder filters [10]. The LPC method considers a speech sample at time n, s(n) and approximates it by a linear combination of the past p speech samples as shown in equation (1).


where the coefficients a1,a2,……ap are assumed constant over the speech analysis frame. Equation (1) can be transformed into Equation (2), by including an excitation term Gu(n):


where u(n) is a normalized excitation and G is the gain of the excitation. By expressing in the z-domain we get the equation (3).


and consequently, the transfer function H(z) becomes the equation (4).


The main parameters that can be obtained with the LPC model are the classification voiced/unvoiced, the pitch period, the gain and the coefficients ai. Autoregressive (AR) transfer function on the other hand has only poles.

5.2 ANN as Vowel Classifier

An artificial neural network (ANN), commonly called "neural network" (NN), is a mathematical or computational model that is inspired by the structure or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes information using a connectionist approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. Modern neural networks are non-linear statistical data modeling tools. They are usually used to model complex relationships between inputs and outputs or to find patterns in data [18].

5.3 Data Collection

The data that will be used in this project is based on the Malay Vowel corpus collected by Dr. Shahrul Azmi. Data Collection process was taken from a total of 100 individuals consisting of students and staff from Universiti Malaysia Perlis (UniMAP) and Universiti Utara Malaysia (UUM). The recordings were done using a conventional microphone and a laptop computer with a sampling frequency of 8000Hz. The words "ka, ke, ki, ko, ku, kÓ™" are used to represent the six vowels of /a/, /e/, /i/, /o/, /u/ and /Ó™/ because vowels have significantly more energy than consonants. Based on [5, 6, 12, 14], the first three formants for vowels are situated within 4 kHz and so are vowel's main characteristics. For this study, a sampling frequency of 8 kHz is used to sample the vowels. The recordings are done 2 to 4 times per speaker depending on situation convenience.

5.4 Proposed Methodology


Data Collection

Algorithm Development

Development of ANN Classifier


Classifier Accuracy Satisfactory?


Vowel Recognition


Figure 2: Flow Chart of the System

5.5 Expected Result

This study will attempt to improve the feature extraction method based on formant peak with gender adaptive capability since this study will focus on adaptive feature extraction technique that is more accurate and more robust. This research will also increase the accuracy of vowel recognition and will be more robust.


This speech recognition specifically Malay vowel method can be benefit by

Education sector by teaching overseas students to pronounce Malay correctly.

Speech language therapy for treatment specifically for most kids.

Telephone or communication sector where telephone directory enquiry without operator assistance.