Acoustic Processing Of Speech English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Chapter 3

This section presents a very brief overview of the kind of acoustic processing commonly called feature extraction or signal analysis in the speech recognition literature. The term features refers to the vector of numbers which represent one time-slice of a speech signal. All of these are spectral features, which mean that they represent the waveform in terms of the distribution of different frequencies which make up the waveform; such a distribution of frequencies is called a spectrum. Begin with a brief introduction to the acoustic waveform and how it is digitized, summarize the idea of frequency analysis and spectra, and then sketch out different kinds of extracted features.

Machine recognition of spoken speech is one of the most difficult problems in signal processing areas. The key parameters are vocabulary number and size of speakers. The fundamental problems are:

(1) The vocabulary is too larger

(2) A phrase is seldom said the same way every time

(3) Words are not spoken separately, but in streams of connected sound

(4) There is a larger variety of speaker accents and rhythms

Sound Waves

The input to a speech recognizer is a complex series of changes in air pressure. These changes in air pressure obviously originate with the speaker, and are caused by specific way that air passes through the glottis and out the oral or nasal cavities. We represent sound waves by plotting the change in air pressure over time.

Figure A waveform of the vowel [iy] taken from "she just had a baby".

The y-axis shows the changes in air pressure above and below normal atmospheric pressure and the x-axis shows time. Notice that the wave are repeats regularly. Two important characteristic of a wave are its frequency and amplitude. The frequency is the number of times a second that a wave repeats itself or cycles. A high value on the vertical axis (amplitude) indicates that there is more air pressure at the point in time, a zero value means there is normal (atmospheric) air pressure, while a negative value means there is lower than normal air pressure (rarefaction).

Two important perceptual properties are related to frequency and amplitude. The pitch of a sound is the perceptual correlate of frequency; in general if a sound has a higher frequency we perceive it as having a higher pitch, although the relationship is not linear, since human hearing has different acuities for different frequencies. Similarly, the loudness of a sound is the perceptual correlate of the power, which is related to the square of the amplitude. So sounds with higher amplitude are perceived as louder, but again the relationship is not linear.

How To Interpret A Waveform

Since humans can transcribe and understand speech just given the sound wave, the waveform must contain enough information to make the task possible. In most cases this information is hard to unlock just by looking at the waveform, but such virtual inspection is still sufficient to learn some things. The difference between vowels and most consonants is relatively clear on a waveform; which vowels are voiced, tend to be long and are relatively loud.

Figure A waveform of a sentence "She just had a baby".

Length in time manifests itself directly as length in space on a waveform plot. Loudness manifests itself as high amplitude. Voicing is caused by regular openings and closings of vocal folds. When the vocal folds are vibrating; we can see regular peaks in amplitude. During a stop consonant, the closure of a [p], [t] or [k], we should expect no peaks at all, in fact we expect silence. Notice in Figure the places where there are regular amplitude peaks indicating voicing; from second .46 to .58 (the vowel [iy]), from second .65 t .74 (the vowel [ax]) and so on. The places where there is no amplitude indicate the silence of a stop closure; for example from second 1.06 to 1.08 (the closure for first [b]), or from second 1.26 to 1.28 (the closure for the second [b]). Fricatives like [sh] can also be recognized in a waveform; they produce an intense irregular pattern; the [sh] form second .33 to .46 is a good example of a fricative.


While some broad phonetic features can be interpreted from a waveform, more detailed classification requires a different representation of the input terms of spectral features. Spectral features are based on the insight of Fourier that every complex wave can be represented as a sum of many simple waves of different frequencies.

Figure The waveform of part of the vowel [á´‚] from the word had cut out from the waveform shown in Figure

Consider above figure, which shows part of the waveform for the vowel [á´‚] of the word had at second .9 of the sentence. Note that there is a complex wave which repeats about nine times in the figure, but there is also a smaller repeated wave which repeats four times for every larger pattern. The complex wave has a frequency of about 250Hz (we can figure this out since it repeats roughly 9 times in .036 seconds, and 9 cycles/ .036 seconds = 250Hz). The smaller wave then should have a frequency of roughly four times the frequency of the larger wave (1000Hz). The frequency of this tiniest wave must be roughly twice that of the 1000Hz wave, hence 2000Hz for the overall amplitude.

A spectrum is a representation of these different frequency components of a wave. It can be computed by a Fourier transform, a mathematical procedure which separates out each of the frequency components of a wave. Rather than using the Fourier transform spectrum directly, most speech applications use a smoothed version of the spectrum called the Linear Predictive Coding (LPC) spectrum. LPC is a way of coding the spectrum that makes it easier to see where the spectral peaks are reached.

Figure An LPC spectrum for the vowel [á´‚].

Spectrum is useful because these spectral peaks can easily visible in a spectrum very characteristic of different sounds. By looking at the spectrum of a waveform, we can detect the characteristic signature of the different phones that are present. This use of spectral information is essential to both human and machine speech recognition. In human audition, the function of the cochlea or inner ear is to compute a spectrum of the incoming waveform. Similarly, the features used as input to the HMMs in speech recognition are all representation of spectra, usually variants of LPC spectra.

A spectrum shows the frequency components of a wave at one point in time; a spectrogram is a way of environing how the different frequencies which make up a waveform change over time.

Figure A spectrogram of the waveform shown in Figure

Observation from Figure, we can see the darkness of a point on a spectrogram corresponding to the amplitude of the frequency component. The dark horizontal bars on a spectrogram representing spectral peaks.

Feature Extraction

Let start the process extraction of spectral features with the sound wave itself and ending with a feature vector. An input sound wave is first digitized. This process of analog-to-digital conversion has two steps that are sampling and quantization.

Sampled by measuring its amplitude at a particular time called signal, the sampling rate is number of samples taken per second. It is necessary to have at least two samples in each cycle to measure a wave with accurately. Two measuring of wave that is positive parts and negative parts are selected. Quantity sampled to increases the amplitude accuracy need more than two samples per cycle, but on other hand the frequency of the wave to be completely missed if the sampled less than two samples per cycle. Maximum frequency for a given sampling rate is called Nyquist frequency, these waves can be measured is one whose frequency is half the sample rate.

Process of representing a real valued number as an integer called quantization, because there is a minimum granularity (the quantum size) and all values which are closer together than this quantum size are represented identically. Once the waveform has been digitized, it is converted to some set of spectral features. An LPC spectrum is represented by a vector of features; each formant is represented by two features, plus two additional features to represent spectral tilt. It is possible to use LPC features directly as the observation symbols of an HMM. However, further processing is often done to the features.

One popular feature set is cepstral, which are computed from the LPC coefficients by taking the Fourier transform of the spectrum. Another feature set is Perceptual Linear Predictive (PLP) analysis, takes the LPC features and modifies them in ways consistent with human hearing. The spectral resolution of human hearing is worse at high frequencies, and the perceived loudness of a sound is related to the cube rate of its intensity. So PLP applies various filters to the LPC spectrum and takes the cube root of the features.

How Does Speech Recognition Work?

Step 1: Extract Phonemes

Phonemes are top pronounced as linguistic units. The sounds that group together to form our words, although quite how a phoneme converts into sound depends on many factors including the surrounding phonemes, speaker accent and age. Below have a few examples:























Phonemes are often extracted by running the waveform through a Discrete Fourier Transform, this waveform can be analyzed in the frequency domain. Looking at a spectrograph we are probably easier to understand this theory. A spectrograph is a 3D plot of a waveform's frequency and amplitude versus time. In many cases though, the amplitude of the frequency is expressed as a colour.:

Figure 3.2.1: Spectrograph saying "Generation5"

Figure 3.2.2: Spectrograph of the [ss] bit

Between above both figure, we can observe that the timescales are slightly different on the two spectrographs. It is relatively easy to match up the amplitudes and frequencies of a template phoneme with the corresponding phoneme in a word.

Step 2: Markov Models

Computer generates a list of phonemes that have to be converted into words and perhaps even the words into sentences. This occurs because very complicated indeed, especially for systems designed included speaker-independent and continuous dictation. However, use a Hidden Markov Model (HMM) from these theory behind is complicated, but a brief look at simple Markov Models will gain the understanding of system's operate. Basically, Markov Model (in a speech recognition context) is a chain of phonenes that represent a word; therefore the chain can branch and if it does, that is statistically balanced. For example:

In speech recognition, a computer has able to deal with tens of thousands of words all of which can be pronounced differently. A brute force search is simply too inefficient and time-consuming to be of any use. Imagine that those three nodes shown above (N1, N2, and N3) were actually small chunks of data that which put together in various ways produced phonemes (small parts of words, such as [iy] (heat)) and that Markov Model represented different ways that those three data chunks could produce valid phonemes (along with their likelihood of occurring).

This knowledge can be extended up to the level of sentences, and can critically develop recognition. For example:

Recognize speech

Wreck a nice beach

These two phrases are surprisingly similar but have wildly different meanings. A program using a Markov Model at the sentence level might be able to ascertain which of these two phrases the speaker was actually using through statistical analysis using the phrase that preceded it.

Speech Recognition And Synthesis

Speech recognition or speech-to-text, involves capturing and digitizing the sound waves, converting them to basic language units or phonemes. It involves constructing words from phonemes and contextually analyzing the words to ensure correct spelling for words that sound alike. Speech recognition systems use recognizers or speech recognition engines which are software that convert the acoustical signal to a digital signal. Speech recognition system can be categorized as discrete or continuous. A discrete speech recognition system requires that the user or speaker pause briefly between words so that each word can be an individually identifiable unit. In continuous speech recognition systems the user can speak fluently.

Systems are said to be either speaker dependent or speaker independent where the former is trained for a single voice. The system is trained to understand their pronunciations, inflections, and accents and can run more efficiently and accurately because it is tailored to the speaker. Systems that are speaker independent are design to deal with anyone as long as they are speaking the particular language it was built for. The major components of a typical speech recognition system includes signal representation where the digitized speech signal is transformed into a set of useful measurements, a modeling facility and a searching facility where the measurements are used to search for the most likely word candidate.

Figure 3.3.1: Process Flow of Speech Recognition

Speech synthesis or text-to-speech is the process of converting text into spoken language. This involves breaking down the words into phonemes, analyzing for special handling of text such as numbers, currency amounts, inflection, and punctuation; and generating the digital audio for playback. Speech synthesis systems uses software drivers called synthesizers, or text-to-speech voices. They perform speech synthesis, handling the complexity of converting text and generating spoken language. Although speech synthesis generates sounds similar to create by human vocal cords. The sound produced by synthesis technology tends to sound less human.

Figure 3.3.2: Process Flow of Speech Synthesis

Two main technologies used for the generating synthetic speech waveforms:


Syntheses are based on the stringing together of segments of recorded speech and are further divided into:

Diphone which uses a minimal speech database containing all the diaphones occurring in a given language

Unit selection which uses large speech database


Synthesis does not use any human speech samples at run time instead, the output synthesized speech is created using an acoustical model. Many systems based on formant technology generate artificial, robotic sounding speech and can be clearly differentiated from human speech.

Design Flow Of Speech Recognition

Speech recognition by use of continuous mixture HMMs for output probability determination shown those HMMs in a same class including:

first type of HMM having a small number of mixtures and same number of states

second type of HMM having a large number of mixtures

Comprising Steps

Speech Recognition Embodiment

The speech recognition system includes a speech input that including an A/D converter and a microphone through which unknown speech is input.

Figure A block diagram of first embodiment

Figure A flow chart of a process executer in the first embodiment



Sound processing

Determining voice parameters from the voice which is input from the voice input unit

Output probability processing

Determining the output probability by comparing the voice parameters obtained in the sound processing section with HMMs which are used as dictionary data for speech recognition

HMM storing

Used as a dictionary section for speech recognition in which the output probabilities are calculated


Storing rough HMMs for estimating how much a phoneme will contribute to the recognition of the input voice


Storing detailed HMMs for calculating the precise output probability

Language search

Executing language processing

Grammar and dictionary

A grammar and a dictionary used for the language processing are stored


Displaying the speech recognition results in the form of character strings.

The speech supplied from the speech input is analyzed and processed into speech parameters in the sound processing section. The output probability is determined in the output probability computation section by use of a rough HMMs stored in the HMMs storing portion that contained in a HMM stored in the HMM storing section. Further, the states of the HMM of the phoneme which are likely to contribute to the recognition results are determined. The precise output probabilities are re-determined by use of a detailed HMML stored in the HMML storing portion corresponding to the states of the phoneme which have been determined to contribute to the recognition results. According to the thus-obtained precise output probabilities, the language processing is performed in the language search section by use of the grammar and dictionary stored in the grammar and dictionary. The language processing reflects the data stored in HMM storing section. Finally, the recognition results are output to the display section.

State diagram of the first embodiment.

State diagram of the second embodiment

The second embodiment will now be described in more detail. In this embodiment, phonemes which are likely to contribute the recognition results that estimated by using phonemic context-independent HMMs. Hence, only the output probabilities of the estimated phonemes are re-determined by using phonemic context-dependent HMMs which achieve high recognition accuracy. Therefore, the overall number of mixtures providing for real processing can be reduced, thereby achieving faster speech recognition processing. The HMMs each formed of three states are employed for the reason similar to that given for the first embodiment.

The process of output pobability computation method according to:

First Embodiment

Second Embodiment

In first embodiment, a small number of mixture HMMs and a large number of mixture HMML are independently trained according to the well known EM-algorithm using the sound database called phonemic labels and feature parameters. The mixture number of the small number of mixture HMMs is indicated by s, while the mixture number of the large number of mixture HMML is represented by l (s<l).

In second embodiment, by using sound database called phonemic labels and feature parameters; phonemic context-dependent labels and feature parameters. Phonemic context-independent HMMt (m models) and phonemic context-dependent HMMD (M models) (m<<M) are independently trained according to the EM-algorithm.

The structures of the both stage and both types of HMMs are identical. For example: n-state n-loop, n=3 in this embodiment,