Acoustic Tube Model Of The Vocal Tract English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The Acoustic tube model is an accepted model of the vocal tract (Chiba and Kajiyama, 1941, Dunn 1950, Stevens, et al., 1953; Fant 1965, Stevens 1972, and 1989). The vocal tract is modeled as a coaxial concatenation of lossless acoustic tubes of different lengths and diameters. The cross sectional area of any of the tubes can be varied independently to simulate the changing shape of the vocal tract. The first tube starts at the glottis and the last tube ends at the lips or the nostrils. Most of the acoustic tube models assume the Schroeter and Sondhi models, 1994. The vocal tracts of an average adult male and female speaker are approximately 17 cm and 15 cm long respectively (Dew and Jensen 1977). The vocal tract includes the oral cavity, pharyngeal cavity, and nasal cavity. It is the most important component in speech production. The number of tubes, diameter, and length of each tube acoustically determine the resonance and anti-resonance of the vocal tract, and place of articulation.

The vibrations of vocal cords classify the speech into voiced or unvoiced sounds based on their periodicity [1]. Vowel sounds are produced with periodic vocal fold vibrations, velum being raised to present entry of pressurized airstream into the nasal path. In voiced sounds the waveform is periodic, and the speech energy is concentrated in the lower half of the spectrum and formants, due to the resonances of the vocal tract. It is observed that the vocal tract and the speech production mechanism are the distinguishing aspects of individual speech. The vocal tract system acts as a filter with various resonances and anti- resonances, determined by the manner and place of articulation. As air flows through the vocal tract, the filter amplifies energy around the formant frequencies (i.e. resonances) while attenuating energy around the anti-resonances between the formants. Finally the air flows out of the lips with variations in pressure constituting the speech. Different sounds are produced in the acoustic tube model, when air is pushed out of the lungs towards the vocal folds by the co-ordinated action of diaphragm, abdominal muscles, chest muscles and ribcage through the vocal tract, and the resonances at the natural frequencies of the vocal tract. By adjusting the shape of the vocal tract, the natural resonant frequencies are changed and specific frequency components of the sound are amplified, resulting different sounds. The vocal tract articulators like tongue, soft palate, hard palate, hypoid bone and lips modify the voice sounds and produce recognizable words. The resonance frequencies of the vocal tract are called formant frequencies or simply formants, and they depend on the shape and dimensions of the vocal tract.

The vocal tract shape is characterized by a set of formant frequencies, the basic formant frequency for an average adult male being 500 Hz. These formants are numbered from low to high frequencies and are called the first formant F1, second formant F2, third formant F3, fourth formant F4 and so on (Pickel 1980). Their values depend on the vocal tract shapes.

Each vowel has its own sound spectrum and hence has a unique combination of formants. The first four formants have been found sufficient for classification of the different vowels. The shape of the vocal cavity modifies the spectrum of the excitation signal to create recognizable speech sounds. This forward transformation, using the shape of the vocal tract from glottis to lips, forms the acoustic characteristics of the sound (Fant, 1960; Flanagan, 1972). Hence from a given description of the vocal tract shape, the resulting sound can be estimated accurately. The inverse transformation from the speech acoustics to the vocal tract shapes is not yet very well understood. The problem of estimating vocal tract shape from speech sound is called Inverse Vocal Tract Problem. This Inverse Vocal Tract Problem has a variety of practical and theoretical applications. Various estimation techniques like measurement of acoustic impedance at the lips, measurement of formant frequencies, and LPC based analysis have been used for the estimation of vocal tract shapes. LPC based analysis of speech is the most preferred technique, as it is capable of providing real time estimation of the vocal tract shape directly from the speech signals with current computational/hardware capabilities. Further LPC co-efficients can be transformed to other parameter sets, useful for investigating and estimating the intra-speaker vocal tract shape.

Determining the intra-speaker vocal tract shape from the speech signal is a practical necessity, for the phonetic distinctiveness and speaker individuality among the vowels uttered by an individual male/female speakers have high correlations with the formants F1-F2.

The gross vocal tract shapes which are estimated from vowels arrived from F1-F2 is found to create the widest spread among the vowels, among individual speakers.

We also propose to look at the spread of formants for individual speakers at different occurrences and investigate the variability role on speaker identification and speaker specific recognition applications, useful in low cost ASR (Automatic Speech Recognition) applications for home/bank security, tele-banking and individual application specific access to data and data-bases.


Determining the intra-speaker vocal tract shapes from the speech signal is an important problem. There is some evidence that the phonetic distinctiveness and speaker individuality are deeply ingrained in the vocal tract shapes, estimated from the vowels using formants and area function approximation of the vocal tract shapes. Their solution is useful in speech applications like speaker identification, forensic applications, and speech recognition and speech-coding.

In practice, the output signal is observed without direct measurement of the input (glottal excitation). The ambiguities in the vocal tract shapes arise due to the limited bandwidth of the speech signal. To solve these difficulties in determining the intra-speaker vocal tract shapes from the speech signal, additional prior knowledge of the speech production mechanism is deployed.

This research employs a vocal tract model and determines the set of vocal tract shapes for same vowel utterance, at different interval of times, capturing the minimal-maximal movements of the articulator parameters.

Minimal-maximal variation of acoustic features will appear in speech due to change of status of the speaker, speech environment, speaker health, emotion and intentional imitation, or disguises. Previous studies [28], [32], [41] and [42] indicate that speaker variation results in the spread of spectral amplitudes, pitch, formant frequencies, formant bandwidth turbulent noises, etc.

They are usually characterized by source of voice and filter. The other features, like spectrum, the formant vector, and F0 vector provide high probability measures enabling discrimination among various speakers. Formant frequencies, as one of these features, are rather important parameters that are typically measured and compared in actual forensic and speech related applications.

Higher formants F3 and F4 further carry speaker specific information. Gross vocal tract shape estimation from the lower formants causes the largest spread among the vowels and of individual speakers.

We propose to use area function approximation of a person taken at different times, in different contexts. The steady state vowels of adult male and female speakers are recorded at different times and the variability of the resulting vocal tract shapes and formants spread are measured on intra and inter-speaker basis.

We also propose to look at the spread of formants for individual male and female speakers at different occurrences to investigate the variability role on speaker specific recognition applications.


The research objective is to study the variation in vocal tract shapes and formants spread for the male and female speakers, and the reliability of educated of Andhra Pradesh, and establish the use of this variation in vocal tract shapes and formants spread for speaker recognition purposes. Also it is used to model the solo vowels, non-contextually, obtaining from the model spectrograms, formants, pitch and vocal tract shape information. Error minimization is carried out using an all-pole LPC filter. Analysis is done for a vowel in the above format, to get the vocal tract shape for vowel /a/ of male speaker by taking 30 samples of 30 subjects at different times. The vocal tract shape arrived at for each subject for 30 sets of data at different times for predefined set of phonemes namely /e/, /i/, /o/, /u/. Using LPC, along with Correlation Analysis, is found that the vocal tract shape variability of the individual subject. Study of variability of the above vocal tract shape among 30 different speakers is highlighted to identify intra-speaker variability. This identified variability can be used as a cue for personal Identification in speaker specific recognition. They can also be used as vocal tract signature of an individual in forensic and other applications. The time averages of the worst and the best patterns for the 30 subjects have been found. The resultant worst pattern and resultant best pattern for a subject of the phoneme have been plotted, for different phonemes of a male speaker. The same have been repeated for female speaker.


1.4.1 Introduction

The purpose of speech is communication. Speech is characterized as a signal carrying message or information [1]. It is an acoustic waveform that carries temporal information from a speaker to the listener. Acoustic transmission and reception of speech works efficiently, but only over limited distance. At frequencies used by the vocal tract and ear, radiated acoustic energy spreads spatially and diminishes rapidly in intensity.

Even if the source could produce large amounts of acoustic power, the medium supports only a fraction of it without distortion and the balance gets dissipated in air-dust particles, molecular disturbance, and in overcoming aero-molecular viscosity. The sensitivity of the ear is limited by ambient acoustic noise in the environment and by physiological noises in and around ear drum.

Speech is the acoustic end product of voluntary, formalized motions of the respiratory and masticator apparatus. It is developed, controlled, maintained and corrected by the close-loop, acoustic feedback of the hearing mechanism, and the kinesthetic feedback of the speech musculature. Information from these senses is organized and coordinated by the central nervous system and is used to direct the function, to deliver wanted, linguistically dependent, vocal-articulators motion, and the acoustic speech.

1.4.2 The Speech Communication Pathway

A simplified way of speech [2] communication pathway is given in the Fig. 1.1, from the speaker to the listener. At the linguistic level of communication, an idea first originates in the mind of the speaker. The idea is then transformed into words, phrases, and sentences according to the grammatical rules of the language.

Fig. 1.1 Utterance "SHOP" waveform [2].

Simulated Longitudinal Acoustic Wave propagation in air, from speaker's lips to listener's ears is shown in Fig. 1.2.

Fig. 1.2 Chain reaction.

At the physiological level of communication, the brain creates electrical signals that move along the motor nerves. These electric signals activate muscles in the vocal cords, and vocal tract.

This vocal cord movement results in pressure changes within the vocal tract and in particular at the lips, initiating the sound wave that propagates in space. It propagates through space as a sequence of compression and rarefaction of air-dust molecules, resulting ultimately in temporal pressure variations at listener's exterior ear. The funnel shaped structure of the outer ear, collects this acoustic energy efficiently and manages to carry these media vibrations, right-up to the final vibration-sensor, and the ear-drum set in the interior ear.

The pressure variation at the lips of the speaker results in the sound. This sound propagates with channel losses, and results in pressure variations at the outer-ear of the listener. The final vibration in the ear-drum induces electric signals that move along the sensory nerves to the brain.

At the perceptual level of the listener, the brain decodes these language sensitive electric signals, filtering them in a recognized pattern, and this result in known language speech perception and hearing.

1.4.3 The Mechanism of Speech Production

The Mechanism of Speech Production is shown in Fig. 1.3

Fig. 1.3 The mechanism of speech production. Lungs

The purpose of the lungs is the inhalation and exhalation of air. Inhalation enlarges the chest cavity by expanding the ribcage surrounding the lungs and by lowering the diaphragm that sits at the bottom of the lungs and separates the lungs from the abdomen; this action lowers the air pressure in the lungs, thus causing air to rush in through the vocal tract and down the trachea into the lungs. The trachea, referred to as the "windpipe", is about a 1-2 cm long and 1.5-2 cm diameter pipe which goes from the lungs to the epiglottis. The epiglottises are a small mass, or "switch" which during swallowing and eating, deflect food away from entering the trachea. Exhalation reduces the volume of the chest cavity by contracting the muscles in the ribcage, thus increasing the lung air pressure. This increase in pressure then causes air to flow through the trachea into the larynx [2]. Larynx

The larynx is a complicated system of cartilages, muscles, and ligaments whose primary purpose, in the context of speech production, is to control the vocal cords or vocal folds [1]. The vocal folds are two masses of flesh, ligament, and muscles, which stretch between the front and back of the larynx. The folds are about 15 mm long in men and 13 mm long in women. The glottis is the slit-like orifice between the two folds. The folds are fixed at the front of the larynx where they are attached to the stationary thyroid cartilage. Vocal Tract

The vocal tract is comprised of the oral cavity from the larynx to the lips and the nasal passage that is coupled to the oral tract by way of the velum. The vocal tract takes [4] on many different lengths and cross sections by moving tongue, lips, and the lower jaw and has an average length of 17 cm in a typical adult male, and a little shorter in female, and a spatially-varying cross section of up to 20cm2 as shown in the Fig. 1.4.

Fig. 1.4 Vocal tract shape. Spectral Shaping

Under certain conditions, the relationship between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances, much like resonances of organ pipes and wind instruments. The resonance frequencies of the vocal tract are in speech science context, called formant frequencies or simply formants.

A formant is a concentration of acoustic energy around a particular frequency in the speech wave. There are several formants, each at a different frequency, roughly one in each 1000Hz band. Each formant corresponds to a resonance in the vocal tract. Formants change with different vocal tract configurations [1].

Speech is produced as a sequence of sounds. Hence the state of the vocal cords, as well as the positions, shapes, and sizes of the various articulators, change overtime to reflect the sound being produced.

The "spoken word" results in three components i.e.

Voice production = voiced sound + resonance + articulation Voiced Sound

The basic sound generated by vocal fold vibrations with positioned articulators is called "voiced sound". It is observed that the voiced sound in singing differs significantly from the voiced sound in speech [3]. Unvoiced Sounds

These sounds are generated by forming a constriction at some point i.e. towards the mouth end, and forcing air through the constriction at a high velocity to produce turbulence. This creates a broad-spectrum noise to excite the vocal tract.

When the vocal cords are tensed-up and closed, the air flow faces obstruction and the air pressure is built up behind the constriction. This highly compressed air passes through constriction in the vocal tract and becomes turbulent, producing so called unvoiced sound. Resonance

Voiced sound, frequency selected, modified and amplified by the vocal tract resonators (the throat, mouth cavity, and nasal passages) are helped by articulators to produce a person's recognizable voice [1]. Articulation

The vocal tract articulators (the tongue, the jaws, the cheek, soft palate, the lips, and the hyoid bone) modify the voiced/unvoiced sound, and with articulation produce recognizable words [1].


1.5.1 Classification of Speech Signal

Speech signals are composed of a sequence of sounds. These sounds and transitions [1] between them serve as a symbolic representation of information. The refined, distinctly recognizable, classified sounds in speech are called 'phonemes', and their study constitutes 'phonemics.'

A specific phoneme class provides a distinct meaning to a word. Within a phoneme class, there exist many sound variations that provide the same meaning. The study of these sound variations is called 'phonetics. The basic building blocks of a language, phonemes, are concatenated, as discrete elements into words, according to certain phonemic, grammatical and language specific rules. The American English consists of 42 phonemes including vowels, diphthongs, semivowels and consonants [1].

Most of the languages have their own distinctive set of phonemes, numbering between 30 and 50. Each of the phonemes is classified as either a continuant or a non continuant sound. Continuant sounds are produced by a fixed, time-invariant vocal tract configuration, excited by an appropriate source. The class of continuant sounds include the vowels, the fricatives (both unvoiced and voiced), and the nasals.

The remaining sounds (diphthongs, semivowels, stops, and affricates) are produced by a changing vocal tract configuration. These are therefore classed as non continuants.

1.5.2 Vowels

Vowels are produced by exciting a fixed vocal tract with quasi-periodic pulses of air, forced through the vibrating vocal cords. The source is quasi-periodic puff of airflow, through vibrating vocal folds at a certain fundamental frequency. We have used the term "quasi" because perfect periodicity is never achieved. Henceforth, the term "periodic" will be used in the sense. The cross sectional area [2] of the vocal tract varies along the tract. The shape of vocal tract length from glottis to lips determines the resonant frequencies of the tract, and thus defines the sound that is produced [2].

The variation of cross-sectional area normal to the axis of the tube as a function of a distance along the tube and as a function of time is called the area function of the vocal tract. The area function of a particular vowel is determined primarily by the position of the tongue, teeth, jaws, lips, and, to a small extent, the velum [1].

Fig. 1.5 Generating the vowels.

As shown in Fig. 1.5, in generating the vowel /a/ as in "father," the vocal tract is open at the front and somewhat constricted at the back by the main body of the tongue. In contrast, the vowel /I/ as in "eve" is generated by raising the tongue towards the upper palate, causing a constriction at the front and increasing the opening at the back of the vocal tract.

Thus, each vowel sound is characterized by the vocal tract configuration that is used during its generation [2].


The complete outline of the thesis report is structured into six main sections, namely, Introduction, Literature survey, Theoretical background, Technical design, Discussion of results, Conclusion, Recommendation of future work.

Chapter 2 discusses a literature review for estimation of the dynamic vocal tract shapes using different techniques that are used by different research persons and their limitations in human vocal tract modeling; this is followed by the mechanical measuring methods and their limitations. It provides a brief description about the mechanical to electrical synthesis of vowels. It explains the different techniques used for the estimation of vocal tract shapes from acoustic measurements, speech signals and their limitations. The concept of phonetic distinctiveness and formants frequency spread are the important parameters that are typically measured and compared on intra and inter-speaker basis in actual speaker identification and forensic applications. Nevertheless, both, between and within speaker variations in 'F' pattern are still not well established. Therefore, an attempt is made to focus further in this context.

Chapter 3 is focused on the basic uniform lossless acoustic tube model, the sound wave propagation in concatenated lossless tube models. Emphasis is placed on the time-dependent processing of speech signal. In addition it gives an introduction to different signal processing methods using LPC for speech analysis that are used in this work.

Chapter 4 elaborates the implementation techniques using LPC based dynamic vocal tract shape estimation for vowels of male and female speakers. It briefly describes the intra-speaker vocal tract shape variability estimation for vowels using LPC for male and female speakers. It includes time varying minimal and maximal vocal tract shape variability estimation of 30 samples of 30 subjects of male and female speakers. At the end, this chapter discusses the bounds for average maximal and average minimal for the test sample vocal tract shape estimation of vowels for male and female speakers. It also explains the Correlation Analysis of Vowels superimposed on itself versus the discrimination provided against the vowels pronounced by them and their percentage lying in various vowels.

Chapter 5 is devoted to the classification of vowels based on formants

using peak detection of spectrum using LPC. It emphasizes on an algorithm for intra-speaker formant estimation for vowels of 30 samples from 30 subjects, both male and female, and their averages F1, F2 are plotted. Further it presents an algorithm for inter-speaker formants estimation of 30 samples from 30 subjects, both male and female, and calculating the average mean and standard deviation of vowels. The proposed technique is compared with Praat Software and formants F1 and F2 of male speakers with that of female ones. Independent Speaker Recognition for vowels using Euclidean distance is also discussed in this chapter.

Chapter 6, the last chapter, gives a summary of the investigations, conclusions drawn from the results and some suggestions for further work.

The vocal tract shape area values obtained from 30 samples of a speaker for vowel /a/ are tabulated in Appendix A. Also the formant frequencies obtained from 30 samples of 30 different speakers for vowel /a/ are tabulated in Appendix B.