The Basic Properties Of Speech Communications Essay

Published:

Speech is created when air is forced from the lungs via the vocal cords and along the vocal tract. The vocal tract introduces short-term correlations (of the order of 1 ms) into the signal, and can be thought of as a filter with broad resonances, called formants. By varying the shape of the tract, the frequencies of these formants can be controlled, for instance by moving the position of the tongue. Speech sounds are be able to be broken into three classes depending on their mode of excitation: Voiced, Unvoiced and Plosive.

Voiced and Unvoiced sounds create different sounds and spectra because of their differences in sound information.

  • Voiced sounds are quasi periodic in the time field and harmonically built in the frequency domain. Their short-time spectrum is characterized by its fine and formant structure. The fine harmonic structure is attributed to the vibrating vocal chords. This pitch period is typically between 2 and 20 ms. The spectral envelope is characterized by a set of peaks, the formants. For the average vocal tract there are 3 to 5 formants below 5 kHz. The amplitudes and locations of the first 3 formants, usually occurring below 3 KHz, are quite significant both in speech synthesis and perception. Figure 1.1 shows a segment of voiced speech sampled at 8 kHz. Here the pitch period is about 8 ms or 64 samples. The power spectral density for this segment is illustrated in Figure 1.2.
  • Unvoiced sounds are produced when the excitation is a noise like turbulence created by forcing air at high velocities via a constriction in the vocal tract. Figures 1.3 and 1.4 illustrates sounds that show little long-term periodicity, although short-term correlations due to the vocal tract are still present. Time domain samples lose periodicity and the power spectral density does not display the clear resonant peaks that are found in voiced sounds
  • Plosive sounds are made when a complete closure is created in the vocal tract, and air pressure is built up behind this closure and released suddenly. Examples include /p/ and /b/.
Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

Some sounds may not be categorized from any one of the three classes above, but are a mixture. Voiced fricatives (/z/) result when both vocal cord vibration and a constriction in the vocal tract are present.

Though there are many possible speech sounds which can be created, the shape of the vocal tract and its mode of excitation change comparatively slowly, so that speech may be considered to be quasi stationary over short periods of time (of the order of 20 ms). From Figures 1.1, 1.2, 1.3, and 1.4, we can realize that speech signals show a high degree of predictability, because of the quasi-periodic vibrations of the vocal cords and also on account of the resonances of the vocal tract. Speech coders try to exploit this predictability so as to decrease the information rate necessary for good quality voice transmission.

2. Waveform Encoders

2.1 The general encoding of waveforms

Generally, waveform encoders are designed to be independent of the signals (this means that these encoders can be applied to other signals). Therefore, the quality does not depend on the original signal. Furthermore, the benefit of these waveform encoders is that the speech quality is stable in a large area and different noisy environments.

The waveform encoders can be subdivided into time-domain waveform coding and frequency-domain waveform coding. These waveform encoders try to produce a reconstructed signal whose waveform is as close as possible to the original signal without using any knowledge of how it was generated. Generally, they are low complexity encoders creating high quality speech at rates above about 16 kbps. When the information rate is lower than this level the reconstructed speech quality can be acquired rapidly [L.R.Rabinerand R.W.Schafer, 1978].

2.2 Discussion

Pulse Code Modulation (PCM) is used generally in standard voice grade circuits. The PCM encrypts into eight bit words Pulse Amplitude Modulated (PAM) signals that have been sampled at the Nyquist rate for the voice channel (8000 samples per second, or twice the bandwidth of channel). Thus the PCM signal demands a 64 kbps transmission channel. However, this is not possible over communication channels where bandwidth is a premium. It is also incapable when the communication is primarily voice that presents a certain amount of predictability as seen in the periodic structure from formants. The increasing use of limited transmission media such as radio and satellite links and limited voice storage resources require more efficient coding methods. Special encoders have been designed that assume the input signal is voice only. These encoders use speech production models to reproduce only the understandable quality of the original signal waveform.

3. Operation and comparison of three types of modern voice encoders

3.1 The operation of three types of modern voice encoders

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

The channel encoder signal is divided into several sub-bands by digital signal processors. After rectification the signal envelope is detected with band pass filters, sampled, and transmitted. (The power levels are transmitted together with a signal that represents a model of the vocal tract.) Reception is basically the same process in reverse. These encoders typically operate between 1 and 2 kbps. Even though these coders are efficient, they produce a synthetic quality and therefore are not generally used in commercial systems. Since speech signal information is primarily contained in the formants, an encoder that can predict the position and bandwidths of the formants could achieve high quality at very low bit rates. A formant encoder transmits the location and amplitude of the spectral peaks instead of the entire spectrum. These typically operate in the range of 1000 bit/s. Formant encoders are not very popular because the formants are difficult to predict.

3.1.1 Linear Predictive Coding (LPC)

Linear Predictive Coding is one of the most powerful speech analysis techniques. Furthermore, it is also one of the most useful methods for encoding the quality of speech for low bit rate transmission or storage. The importance of this method is to provide accurate estimates of the speech parameters such as pitch, formants, spectra, vocal tract area functions, and relatively efficient for computation. The model of LPC is illustrated in Figure 3.1 below:

An impulse train (voice speech) or random noise (unvoiced speech) is the input of the filter or excitation signal. Therefore, relying on the voice or unvoiced state of the signal, the switch is set to proper location. The output energy level is controlled by the gain factor. In parametric synthesis, speech units are subdivided into short frames (10-30 ms) of samples. For short frame, properties of the signal are essentially unchanged. In each frame, the model parameters are estimated from the speech samples. These parameters are listed below:

  • Voicing: The frame is whether voiced or unvoiced.
  • Gain: control the energy level of the frame.
  • Filter coefficients: identify the response of the synthesis filter.
  • Pitch period: in the case of voiced frames, time length between consecutive excitation impulses.

The parameter estimation process for each frame is repeated, with the results representing information on the frame. Therefore, instead of transmitting the PCM samples, the model parameters are sent. By allocating bits for every parameter in order to minimize the distortion, an efficient compression ratio can be obtained. For example, the bit- rate of 2.4 kbps for the FS1015 coder is 53.3 times lower than the corresponding bit-rate for 16 bit PCM.

3.1.2 Regular- Pulse- Excited Coding

In general, this coder is an ADPCM system, where a predictor is calculated from the signal. With the prediction error found then is subsequently quantized using adaptive scheme. The predictor is operated as a cascade connection of short-term and long-term predictors. The long-term predictor greatly increases the average prediction gain. Therefore, it also raises the overall performance.

The parameters of each frame/ sub-frame are extracted and packed to the form of bit-stream. This coder is similar to other coders; the encoder divides the input speech samples into frames for processing. Each frame has a length of 160 samples (20ms) and is also subdivided into four sub-frames of 40 samples. Figure 3.2 illustrate a presentation of the GSM Full-Rate LPC-RPE codec.

The encoder has three major parts:

  • Linear prediction analysis (short-term prediction).
  • Long-term prediction.
  • Excitation analysis.

The transfer function of the order of 8 is used by linear prediction. Altogether, 36 bits is used by the linear predictor part of the codec. The long-term predictor estimates pitch and gain four times at 5ms intervals. Every estimate provides a lag coefficient and gain coefficient of 7 bits and 2 bits, respectively. Together these four estimates require 4*(7+2) bits = 36 bits. The gain factor in the predicted speech sample ensures that the synthesized speech has the same energy level as the original speech signal.

3.1.3 Codebook Excited Linear Prediction (CELP)

Codebook Excited Linear Prediction is widely used in speech coding and this coder is based on the concept of LPC. The improvement is that on the encoder and decoder a codebook of different excitation signals is maintained. The encoder will seek the most suitable excitation signal and sends its index to the decoder and the coder will use it to reproduce the signal. A generic CELP encoder is illustrated in This encoder operates as the following steps:

  • The input signal is divided into frames and sub-frames. In one frame, there are four sub-frames. The length of each frame is around 20ms to 30ms, while to sub-frames it is long around 5 to 7.5ms.
  • To create the LPC, on each frame, Short-term LP analysis is performed. After that, long-term analysis will applied to each sub-frame. Normally, the input to short-term LP analysis is the original speech, or pre-emphasized speech and the input to long-term LP analysis is often the (short-term) prediction error. After this step, Coefficients of the perceptual weighting filter, pitch synthesis filter and modified formant synthesis filter are processed.
  • The excitation sequence is now determined. The length of each excitation code vector and that of the sub-frame have the same length, therefore, an excitation codebook search is activated on each sub-frame. The search procedure starts with the generation of an ensemble of filtered excitation sequences with the corresponding gains; mean-square error is calculated for every sequence and the selection will be applied to the code vector and gain that contained the lowest error.
  • The index of excitation codebook, gain, long-term LP parameters, and LPC are encrypted, packed and then will be transmitted as the CELP bit- stream. [W.chu. (2003)]

3.2 Comparison

LPC

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

The LPC coder provides intelligible reproduction at low bit rate, however, the use of only two types of excitation signal gives an artificial quality to the decoding speech. This idea also happens when used in noisy environments because the encoder can be confused by the background noise, telling a frame as unvoiced even it is voiced.

The limitations of LPC [W.chu. (2003)]:

  • In many cases, a speech frame cannot be categorized as voiced or unvoiced. Indeed, the LPC model can fail to correctly sort transition frames (voiced to unvoiced and vice versa). This makes annoying artefacts such as buzzes and tonal noises.
  • Using of random noise or periodic impulse train as excitation will not match practical observations using real speech signals. In general, the excitation for unvoiced frames can be reasonably approximated with white noise. However, for voiced frames, the excitation signal is a combination of a quasi periodic component with noise. Therefore, using of an impulse train is an approximation that degrades the naturalness of synthetic speech. In addition, the quasi periodic excitation component usually displays a changing period, and the pulse shape is not exactly an impulse.
  • No phase information of the original signal is maintained
  • The approach used to synthesize voiced frames, where an impulse train is used as excitation to a synthesis filter with coefficients obtained by LP analysis, is a violation of the foundation of AR modelling.

RPE-LTP

Generally, the RPE-LTP coder provides good quality and is quite robust under various channel errors and noise conditions. Because of its open-loop operational nature, but full potential has not been achieved. A large amount of bits are allocated to the excitation signal that is quite inefficient on account of the scalar quantization technique involved. The algorithm of CELP improves on this by utilizing vector quantization schemes.

 CELP

Compared with other coders, CELP has advantages below:

  • A strict voiced or unvoiced classification is eliminated. The main limitations of the LCP coder is the classification of a speech frame. By using two cascaded synthesis filters allows an efficient and accurate modelling of transition frames with smoothness and continuity, producing a much more naturally sounding synthetic speech.
  • Partial phase information of the original signal is preserved. The CELP captures some phase information via a closed-loop analysis-by-synthesis method. The best excitation sequence from the codebook is chosen to generate the synthetic speech that is as close as possible to the original. Therefore, the synthetic speech is not only matched in the magnitude spectrum domain but also in the time domain, where a difference in phase plays a significant role.

Although CELP speech coders deliver a reasonably fine voice quality at low bit rates, they required significant computational efforts.

4. Two test methods used to determine voice quality.

4.1 Mean Opinion Score (MOS)

In general, the MOS test is the most widely used to evaluate speech quality. Furthermore, it is one of the most popular subjective tests in evaluating the speech quality processing systems. It is also suitable for overall evaluation of synthetic speech. There are five scale levels from bad (1) to excellent (5) and it is known as Absolute Category Rating (ACR). The listener evaluates the tested speech with scale described in Table 4.1 below [referencing].

MOS (ACR)

5

Excellent

4

Good

3

Fair

2

Poor

1

Bad

Table 4.1: Scales used in MOS

4.2 Diagnostic Rhyme Test (DRT)

The Diagnostic Rhyme Test was introduced by Fairbanks in 1958. This method uses a set of isolated words to test for consonant intelligibility in initial position [Goldstein 1995, Logan et al. 1989]. The test consists of 96 word pairs that a single acoustic feature in the initial consonant is different. Word pairs are chosen to measure the six phonetic characteristics listed in Table 4.2. The listener listens to one word at the time and marks to the answering sheet which one of the two words he thinks is correct. Finally, the results are summarized by averaging the error proportions from answer sheets. Usually, only total error proportion percentage is given, but also single consonants and how they are confused with each other can be investigated with confusion matrices.

Characteristics

Description

Examples

Voicing

voiced - unvoiced

veal - feel, dense - tense

Nasality

nasal - oral

reed - deed

Sustension

sustained - interrupted

vee - bee, sheat - cheat

Sibilation

sibilated - unsibilated

sing - thing

Graveness

grave - acute

weed - reed

Compactness

compact - diffuse

key - tea, show - sow

Table 4.2: The DRT characteristics

DRT is a method used quite widely and it also provides lots of valuable diagnostic information how properly the initial consonant is recognized and it is very useful as a developing tool. However, any vowels or prosodic features are not tested, so that this method is not suitable for any kind of overall quality evaluation. Other lack is that the test material is quite limited and the test items do not occur with equal probability, therefore, it does not test all possible confusions between consonants. So that confusions presented as matrices are difficult to measure [Carlson et al. 1990].