Implementation Of Lpc Based Vocal Tract Biology Essay

Published:

H.Kinugasa et al. have reported on the spoken Word Recognition using Vocal Tract Shapes. This method is based on LPC for carrying out 16 sections of vocal tract shapes.

The calculated co-efficients are normalized to lie between 0 and 1. Furthermore the speech in each frame is processed through a high pass filter with cut-off frequency 2.5 KHz and the average power value of processed speech divided by counterpart in the original speech waveform. Similarly a low pass filter with cut-off frequency 500 Hz is applied. The feature parameters extracted are used for the recognition experiment. Eighteen co-efficients for each frame are used as input to the dynamic programming (DP) matching. The work mentioned demands tedious calculations of 18 parameters to recognize the word. Hence we propose a new technique which gives the result with few parameters as features to recognize the vowels spoken by male and female subjects.

4.1 IMPLEMENTATION OF LPC BASED VOCAL TRACT SHAPE ESTIMATION FOR VOWELS

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

The auto-regression method for speech analysis (Durbin's Recursive Algorithm) based on linear prediction has been used. This method is identified as LP Modeling and is being referred as AR Modeling. The model depends only on the previous outputs of the system. The simplest model of a vocal tract consists of co-axially many linked cylindrical tubes, producing an all-pole transfer function. Vocal tract shape is estimated from reflection co-efficient obtained from LPC analysis of speech signals, using Wakita's speech analysis model and Durbin's algorithm for optimum inverse filtering. The vocal tract length of an adult-male is normally 17 cm long, from the glottis to the lips. Vocal tract area values are obtained for the natural vowels for male and female speakers with voluntary participation. Participants chosen have spoken standard Indian English vowels without distinct accents, and special speech habits. They are aged between 20 and 21 years, and do not suffer from any speech/hearing disorder.

Each speaker is asked to record the required speech as naturally as possible, and their speech is recorded individually, in a speech laboratory, with a portable digital recorder by using a small collar microphone. The distance between the microphone and mouth of each speaker is kept approximately 10 cm [32], and samples are acquired with sampling frequencies 22,100 Hz per second, in 30 ms blocks and fixing the LPC order to 25.

From the speech production model [52], it is known that the speech undergoes a spectral tilt of -6 dB/octave. To counteract this drop, a pre- emphasis filter is used to boost the higher frequencies and flatten the spectrum. This pre-emphasis followed a 6 dB per octave rate. This pre-emphasized speech signal bereft of the ill effects of glottal pulse flattening and lip radiation, is windowed using Hamming windows, as shown in the Fig. 4.1.

C:\Documents and Settings\HOD ECE\Desktop\LL.bmp

Fig. 4.1 Block diagram for vocal tract shape calculation.

The procedure implementing LPC method is summarized as follows.

1). Let sn define the sampled speech data obtained by sampling the analog speech waveform s(t) at a sampling rate of Fs. Digital signals are sampled at a rate of 22,100 Hz and are divided into frames of size 30ms. There are 663 samples/frame shown in the Fig. 4.2. Further we have taken 19 frames for analysis.

Fig. 4.2 shows 663 samples speech segment from 22100 Hz signal.

4.1.1 Pre-emphasis

The area function obtained using reflection co-efficients is said to be the area function of the human vocal tract [52]. If pre-emphasis is used prior to linear predictive analysis to remove the effects due to the glottal pulse flattening and lip radiation effects, the resulting area functions are often very similar to vocal tract configurations that would be used in human speech. Pre-emphasis is carried out at 6 dB per octave rate, resulting in the amplitude increase of 6 dB/octave. The Fig. 4.3 shows the pre-emphasized signal. It is observed that the speech signal is flattened. This leads to a better result for the calculation of the co-efficients using LPC.

Fig. 4.3 The Pre-emphasized Signal.

4.1.2 Window Analysis

It is reported that the window function w(k-n) is a real window sequence, used to isolate the portion of the input sequence that will be analyzed at a particular time index k. An ideal window function acquires a frequency response with a narrow main lobe, which increases resolution, without side lobes, dictating the frequency leakage [50].

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

In this, LP analysis is performed on frames weighed with the Hamming window. This window, w(n), is chosen as it provides a good balance between its main lobe width and side lobe attenuation.

(4.1)

The Hamming window is adequate in determining the accuracy for approximating the transfer function of the vocal tract. While calculating reflection co-efficients for quantization purposes, Hamming windows of length 30 ms have been used, with an overlap of 10 ms incorporated, to obtain smooth estimates. The Fig. 4.4 shows windowed speech signal.

Fig. 4.4 The windowed signal.

4.1.3 Auto-Correlation Analysis

Each frame of the windowed signal is then auto correlated [1]. In the autocorrelation method, the analysis segment sn[n] is identically zero outside the interval 0 ≤ n ≤ N-1, and is expressed as shown in the equation 4.2

sn( n )=s( n+N ) w( n ) (4.2)

where w(n) is a finite length window (Hamming window) that is identically zero outside the interval. The sn(n) is nonzero for the interval 0 ≤ n ≤ N-1. The corresponding prediction error, en(n), for a pth order predictor will be nonzero over the interval 0 ≤ n ≤ N-1+p.

4.1.4 Partial Auto-Correlation Co-efficients (PARCOR) Extraction

To extract the processing parameters the PARCOR co-efficients are set, which are intermediate values during the calculation of Levinson-Durbin recursion [52]. We have observed that quantizing the intermediate values is less problematic than quantizing the predictor co-efficients directly, as the effect of small changes in the predictor co-efficients leads to relatively large changes in the pole positions.

To ensure stability of the filter co-efficients, the poles and zeros must lie within the unit circle in the z-plane. Thus the high accuracy of 8-10 bits per co-efficient is needed. The necessary and sufficient condition for the PARCOR co-efficients stability is the bound of +1 or -1. The Fig. 4.5 shows PARCOR co-efficients for speech signals.

C:\Users\DS&P\Desktop\phd\a\q.bmp

Fig. 4.5 The reflection (PARCOR) co-efficients for a frame.

4.2 RESULTS

4.2.1 The Normalized Speech Waveform for Vowels

In analysis of speech, it is observed that singular variations in amplitude speech or energy that occurs between different utterances of the same vowel or different vowels by different speakers or even by the same speaker at different times needs to be eliminated or at least reduced [53]. A common source of variation, both between speakers and for a single speaker's overtime, is due to changes in the glottal waveform of vowels, and the energy in high frequency fricatives.

During recording of a speech signal using microphone, the sound pressure level of the input speech signal varies from person to person during utterances of vowels from intra and inter-speakers. To overcome the problems addressed above, the speech signals are normalized before processing the speech signals. The normalized speech waveform for vowels /a/, /e/, /i/, /o/ and /u/ are shown in Fig. 4.7, Fig. 4.24, Fig. 4.31, Fig. 4.38 and Fig. 4.45 respectively.

4.2.2 Voiced Signal

It is known [51] that Speech Production must be viewed as a filtering operation, in which a sound source excites a vocal tract filter. The source may be periodic, resulting in voiced speech. The voicing source occurs in the larynx, at the base of the vocal tract, where airflow can be interrupted periodically by vibrating the vocal folds.

The pulse of air, reduced by the opening and closing of the vocal folds to generate a periodic excitation for the vocal tract, is shown for vowel /a/ in the Fig. 4.9. Voiced sounds are produced by forcing air through the glottis with the tension of vocal cords adjusting so that they vibrate in a relaxed oscillation, producing quasi periodic pulses of air, exciting the vocal tract. The voiced signal for vowels /e/, /i/, /o/ and /u/ are shown in Fig. 4.26, Fig. 4.33, Fig. 4.40 and Fig. 4.47 respectively.

4.2.3 Voiced Signal with Noise Elimination

The variation in speech amplitude of energy occurs between different utterances of the same vowel by different speakers or even by the same speaker at different times [53]. This variation is eliminated. A common source of variation, both between speakers and for a single speaker over time, is due to changes in the glottal waveform of vowels, and the energy in high frequency fricatives. We have minimized the variation, by filtering using a band pass filter with cut-off frequencies of 250 Hz to 3500 Hz.

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

This voiced signal with noise elimination for vowels /a/, /e/, /i/, /o/ and /u/ are shown in Fig. 4.10, Fig. 4.27, Fig. 4.34, Fig. 4.41 and Fig. 4.48 respectively.

4.2.4 Frame Blocking

In this step, the pre-emphasized speech signal for vowel /a/ is as shown in the Fig. 4.11, divided in frames of N-samples with adjacent frames being separated by n-samples [54]. N is chosen as 30 ms of speech and n is chosen between 15 - 25 ms.

4.2.5 Reflection Co-efficients

Linear predictive analysis gives insight into the shape of the vocal tract that produced the acoustic speech signal (Waluiter, 1972, 1973; Waluiter & Gray.1974, 1975).

We have assumed plane wave propagation along the axis of the cylindrical tubes, of equal length and different cross-sectional areas [1]. The cylindrical tubes are lossless, implying that no input energy of the sound wave is dissipated in the cylindrical tubes themselves.

A model of the vocal tract for sound generation is a set of concatenated lossless tubes. It is known that energy is lost due to various factors such as heat conduction, friction and vocal tract wall vibrations (Fant, 1960). In spite of these limitations, a lossless tube model, derived from LPC co-efficients, gives a good approximation to the vocal tract shape for oral vowels. There is a link between linear predictive analysis and the lossless tube model via reflection co-efficients that are equal to the partial-correlation (PARCOR) co-efficients (siato, 1973). In the lossless tube model, the reflection co-efficients determine both the portion of the travelling wave that is reflected at the junction between cylinders.

In the lossless tube model of the vocal tract, a wave, parameterized either as air pressure or volume velocity variations, travels from one end of the connecting cylinder (the glottal end) to the other end (the lip end). At each junction between the cylinders, part of the wave is propagated forward, and part is reflected back. The proportion of the wave that is propagated and reflected depends on the reflection co-efficient between each pair of abutting cylinders. The reflection co-efficient, in turn, depends entirely on the following ratios of cross sectional areas of the abutting cylinders.

rk = (Ak+1-Ak ) / (Ak+1 + Ak ) (4.3)

where rk is the reflection co-efficient for the kth junction. As the areas have positive values, the reflection co-efficient lies between -1 to 1. The Fig. 4.13 shows the reflection co-efficients taken for vowel /a/.

4.2.6 Area Function

Rabiner [1] describes the speech production model based upon the assumption that the vocal tract is represented as concatenation of lossless acoustic tubes. The cross-sectional area {Ak} of the tubes is chosen to approximate the area function, A(x), of the vocal tract. If a large number of tubes of short length (25) are used, we obtain the resonant frequencies of the concatenated tube close to the tube with continuously varying area function.

For 'vowel like sounds', the area function is the link between the position of the articulators and the acoustical quantities of the speech wave [48].

The articulatory parameters are determined from the speech wave (via the area function) rather than by direct measurements. The Fig. 4.14 and Fig. 4.15 show the vocal tract shape for frame of speech segment for 22,100 Hz with and without normalization. We have taken 19 frames of speech segment for vowel /a/ as shown in the Fig. 4.16 to 4.20 and then calculated the average of these 19 frames, as shown in the Fig. 4.21. The same procedure is repeated for vowels /e/, /i/, /o/ and /u/ and they are shown in the Fig. 4.28, Fig. 4.35, Fig. 4.42 and Fig. 4.49 respectively.

4.2.7 Dynamic Vocal Tract Shape

The dynamic models obtained by bringing 25 cylindrical lossless tubes with 24 Reflection Co-efficients are calculated. The Fig. 4.22 shows a Stem Plot of the reflection co-efficients [49].

Using the reflection co-efficient, we calculate the denominator co-efficient for the transfer function using the Eq. (4.4).

V (z) = (VL (z))/ (VG (z)) (4.4)

This transfer function mathematically models the flow of sound, as it travels through the vocal tract, and ultimately results in radiated human voice. The resonances are called Formants.

The Fig. 4.22 shows for N = 25 & 1/T = 22100 Hz [1]. It is observed that the area function data is sampled to give a 25 tube approximation for the vowel /a/. The reflection co-efficient shows the resulting set of 24 reflection co-efficient for area A = 0.9632 cm2.

This gives a reflection co-efficient at the lips of rk = 0.9032. It is observed that the largest reflection co-efficients occur where the relative change in area are at the greatest [1]. The same procedures are repeated for vowels /e/, /i/, /o/ and /u/ are shown in the Fig. 4.29, Fig. 4.36, Fig. 4.43 and Fig. 4.50 respectively. We obtain reflection co-efficient rk = 0.8800 for the area 0.8982 cm2 for vowel /e/, reflection co-efficient rk = 0.9642 for the area 0.8780 cm2 for vowel /i/, reflection co-efficient rk = 0.9334 for the area 0.9803 cm2 for the vowel /o/, and reflection co-efficient rk = 0.8835 for the area 0.7168 cm2 for the vowel /u/.

Fig. 4.6 The speech waveform for vowel /a/.

Fig. 4.7 The normalized speech waveform for vowel /a/.

Fig. 4.8 The pre-emphasis signal for vowel /a/.

Fig. 4.9 The voiced signal for vowel /a/.

noise-a.bmp

Fig. 4.10 The voiced signal with noise elimination for vowel /a/.

Fig. 4.11 First frame speech segment from 22100 Hz signal for vowel /a/.

Fig. 4.12 The windowed signal for first frame speech segment from 22100 Hz signal for vowel /a/.

C:\Users\DS&P\Desktop\phd\a\q.bmp

Fig.4.13 The reflection co-efficients for first frame speech segment from 22100 Hz signal for vowel /a/.

Fig. 4.14 The vocal tract shape for first frame speech segment from 22100 Hz signal for vowel /a/.

Fig. 4.15 The normalized vocal tract shape for first frame speech segment from 22100 Hz signal for vowel /a/.

Fig. 4.16 Shows approximate vocal tract shapes for vowel /a/ taken from frames 1 to 4.

Fig. 4.17 Shows approximate vocal tract shapes for vowel /a/ taken from frames 5 to 8.

Fig. 4.18 Shows approximate vocal tract shapes for vowel /a/ taken from frames 9 to 12.

Fig. 4.19 Shows approximate vocal tract shapes for vowel /a/ taken from frames 13 to 16.

Fig. 4.20 Shows approximate vocal tract shapes for vowel /a/ taken from frames 17 to 19.

normvts-a.bmp

Fig. 4.21 The averaged vocal tract shape for vowel /a/.

vts-a.bmp

Fig. 4.22 The dynamic vocal tract model for vowel /a/.

Fig. 4.23 The speech waveform for vowel /e/.

Fig. 4.24 The normalized speech waveform for vowel /e/.

Fig. 4.25 The pre-emphasis signal for vowel /e/.

Fig. 4.26 The voiced signal for vowel /e/.

4.bmp

Fig. 4.27 The voiced signal with noise elimination for vowel /e/.

5.bmp

Fig. 4.28 The averaged vocal tract shape for vowel /e/.

6.bmp

Fig. 4.29 The dynamic vocal tract model for vowel /e/.

Fig. 4.30 The speech waveform for vowel /i/.

Fig.4.31 The normalized speech waveform for vowel /i/.

Fig. 4.32 The pre-emphasis signal for vowel /i/.

Fig. 4.33 The voiced signal for vowel /i/.

5.bmp

Fig. 4.34 The voiced signal with noise elimination for vowel /i/.

6.bmp

Fig. 4.35 The averaged vocal tract shape for vowel /i/.

7.bmp

Fig. 4.36 The dynamic vocal tract model for vowel /i/.

Fig. 4.37 The speech waveform for vowel /o/.

Fig. 4.38 The normalized speech waveform for vowel /o/.

Fig. 4.39 The pre-emphasis signal for vowel /o/.

Fig. 4.40 The voiced signal for vowel /o/.

5.bmp

Fig. 4.41 The voiced signal with noise elimination for vowel /o/.

6.bmp

Fig. 4.42 The averaged vocal tract shape for vowel /o/.

7.bmp

Fig. 4.43 The dynamic vocal tract model for vowel /o/.

Fig. 4.44 The speech waveform for vowel /u/.

Fig.4.45 The normalized speech waveform for vowel /u/.

Fig. 4.46 The pre-emphasis signal for vowel /u /.

Fig. 4.47 The voiced signal for vowel /u/.

5.bmp

Fig. 4.48 The voiced signal with noise elimination for vowel /u/.

6.bmp

Fig. 4.49 The averaged vocal tract shape for vowel /u/.

7.bmp

Fig. 4.50 The dynamic vocal tract model for vowel /u/.

4.3 INTRA-SPEAKER VOCAL TRACT SHAPE VARIABILITY ESTIMATION FOR VOWELS USING LPC

4.3.1 INTRODUCTION

Phonetic distinctiveness and speaker individuality are deeply ingrained in the vocal tract shapes estimated from the vowels. This is demonstrated by the acoustic model on vowels and speaker based area function approximation to the vocal tract shape. Here we propose a new technique to approximate an area function of a person at different times, and in different contexts. The variability of the resulting shapes is measured on intra and inter-speaker basis. Such vocal tract shapes are arrived at, for each subject of pre-defined set of phonemes namely /a/, /e/, /i/, /o/ and /u/. We calculate the vocal tract shape correlation graphs of a vowel superimposed on itself versus the discrimination provided against other male pronounced phonemes. The time averages of the worst and the best patterns of the ensemble are plotted. The results are as shown in Fig. 4.65. The repeat tests for differing vowels are resulted in Fig. 4.69, Fig. 4.73, Fig. 4.77 and Fig. 4.81 respectively.

4.3.2 Intra-speaker Vocal Tract Shape Estimation Algorithm for Vowels.

Fig. 4.51 shows the block diagram of intra-speaker vocal tract shape variability for vowels of male speaker. We would like to propose an algorithm to find vocal tract shape variability as explained below:

30 samples of 30 subjects at different times for the vowels /a/, /e/, /i/, /o/ and /u/ are taken.

Vocal tract shape area values arrived at each subject for 30 sets of subject's data at different times for predefined set of phonemes namely /a/, /e/, /i/, /o/ and /u/ are obtained.

Using LPC and correlatory analysis, we have determined the vocal tract shape variability of the subject. Study of variability of the above vocal tract shape among 30 different subjects, is done to highlight and identify intra-speaker variability.

The above identified variability is useful as a cue for personal voice identification and auditory signature. It can be named "Vocal Tract Signature of an Individual''.

The worst (with least variability) and the best maximum variability for each speaker (subject) are found.

The Time averages of the worst and the best patterns of 30 subjects are found.

The resultant worst pattern and resultant best pattern for a subject for the phoneme /a/ are plotted.

The above steps for the remaining phonemes /e/, /i/, /o/ and /u/ are repeated.

S1

S3

S2

S4

S30

vts 1

vts 2

vts 4

vts 3

vts 30

Minimal vts

Maximal vts

Fig. 4.51 Block diagram of intra-speaker vocal tract shape variability for vowels where S1 to S30 are the subject samples and vts is vocal tract shape.

4.3.3 The Minimal and Maximal Vocal Tract Shape for Vowel /a /

There are varieties of ways for studying phonetics and their distinctive features or characteristics of the phonemes. For our purpose of study, it is sufficient to consider an acoustic characterization of the various sounds, including the place and manner of articulation, waveforms, and role of back reflection coefficients in modeling the vowels [1].

Vowels are produced by exciting a fixed vocal tract with quasi periodic pulses of air caused by the vibration of the vocal cords. The cross- sectional area varies along the vocal tract that determines the resonant frequencies of the vocal tract, and thus the sound is produced. The dependence of the cross-sectional area upon the distance along the tract is called the area functions of the vocal tract.

The Fig. 4.52 shows the time varying minimal and maximal vocal tract shape variability for different speakers. The databases of 30 samples for different subject vocal tract shapes are tabulated.

Sp1

Sp3

Sp2

Sp4

Sp30

Min vts1

Max vts1

Minimal average vts

Maximal average vts

Min vts2

Max vts2

Min vts3

Max vts3

Min vts4

Max vts4

Min vts30

Max vts30

Fig. 4.52 Time varying minimal and maximal vocal tract shape variability for different speakers where Sp1 to Sp30 are the subjects, vts is vocal tract shape, Min vts is minimal vocal tract shape and Max vts is maximal vocal tract shape.

Tables 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9 and 4.10 show the minimal and maximal vocal tract shapes for vowels /a/, /e/, /i/, /o/ and /u/ for male and female speakers. The Fig. 4.53 shows the male speaker model vocal tract shape with its bounds for vowel /a/. We have observed from the results that for the vowel /a/ vocal tract is open at the front, the tongue is raised at the back, and there is low degree of constriction by the tongue against the palate. The tongue is positioned as far as possible from the roof of the mouth.

For the front vowels, such as /e/ and /i/ the tongue is positioned forward in the mouth, during articulation as shown in the Fig. 4.54 and Fig. 4.55. The vowel /i/ is a front vowel, named for the position of the tongue during articulation. For the vowel /i/ the vocal tract is opened at the back, and the tongue is raised at the front. In addition, there is a high degree of constriction of tongue against the palate. As the Fig. 4.56 shows for the vowel /o/ [54], the lips are generally "pursed" outward, for exolabial rounding. The insides of the lips are visible, where as in mid to high rounding in front vowels, the lips are "compressed", with lip margins pulled in, and drawn towards each other [endolobial (compressed)].

For vowel /u/, as shown in the Fig. 4.57, it is observed that it is a back vowel, named for the position of the tongue during the articulation relative to the back of the mouth [54]. In back vowels, such as /u/, the tongue is positioned towards the back of the mouth. Further the lips get rounded and protrude. The last peak in the figure shows reflection at the teeth and then radiation and loss at the lips. This procedure is repeated for female speakers. The results are shown in figures 4.58 to 4.62.