Print Email Download Reference This Send to Kindle Reddit This
submit to reddit

Encoding

Background and Introduction to Encoding

Properties of Speech

The two types of speech sounds, voiced and unvoiced, produce different sounds and spectra due to their differences in sound formation. With voiced speech, air pressure from the lungs forces normally closed vocal cords to open and vibrate. The vibrational frequencies (pitch) vary from about 50 to 400 Hz (depending on the person's age and sex) and forms resonance in the vocal track at odd harmonics. These resonance peaks are called formants and can be seen in the voiced speech figures 1 and 2 below [1].

Unvoiced sounds, called fricatives (e.g., s, f, sh) are formed by forcing air through an opening (hence the term, derived from the word “friction”). Fricatives do not vibrate the vocal cords and therefore do not produce as much periodicity as seen in the formant structure in voiced speech; unvoiced sounds appear more noise-like (see figures 3 and 4 below). Time domain samples lose periodicity and the power spectral density does not display the clear resonant peaks that are found in voiced sounds .The spectrum for speech (combined voiced and unvoiced sounds) has a total bandwidth of approximately 7000 Hz with an average energy at about 3000 Hz. The auditory canal optimizes speech detection by acting as a resonant cavity at this average frequency. Note that the power of speech spectra and the periodic nature of formants drastically diminish above 3500 Hz.

Speech encoding algorithms can be less complex than general encoding by concentrating (through filters) on this region. Furthermore, since line quality telecommunications employ filters that pass frequencies up to only 3000-4000 Hz, high frequencies produced by fricatives are removed. A caller will often have to spell or otherwise distinguish these sounds to be understood (e.g., “F as in Frank”).

General Encoding of Arbitrary Waveforms

Waveform encoders typically use Time Domain or Frequency Domain coding and attempt to accurately reproduce the original signal. These general encoders do not assume any previous knowledge about the signal. The decoder output waveform is very similar to the signal input to the coder. Examples of these general encoders include Uniform Binary Coding for music Compact Disks and Pulse Code Modulation for telecommunications.

Pulse Code Modulation (PCM) is a general encoder used in standard voice grade circuits. The PCM encodes into eight bit words Pulse Amplitude Modulated (PAM) signals that have been samples at the Nyquist rate for the voice channel (8000 samples per second, or twice the channel bandwidth). The PCM signal therefore requires a 64 Kb/s transmission channel.

However, this is not feasible over communication channels where bandwidth is a premium. It is also inefficient when the communication is primarily voice that exhibits a certain amount of predictability as seen in the periodic structure from formants. The increasing use of limited transmission media such as radio and satellite links and limited voice storage resources require more efficient coding methods. Special encoders have been designed that assume the input signal is voice only. These vocoders use speech production models to reproduce only the intelligible quality of the original signal waveform. The most popular vocoders used in digital communications are presented below.

Types of Voice Encoders

The channel vocoder uses a bank of filters or digital signal processors to divide the signal into several sub-bands. After rectification the signal envelope is detected with bandpass filters, sampled, and transmitted. (The power levels are transmitted together with a signal that represents a model of the vocal tract.) Reception is basically the same process in reverse.

These vocoders typically operate between 1 and 2 kbit/s. Even though these coders are efficient, they produce a synthetic quality and therefore are not generally used in commercial systems. Since speech signal information is primarily contained in the formants, a vocoder that can predict the position and bandwidths of the formants could achieve high quality at very low bit rates. A formant vocoder transmits the location and amplitude of the spectral peaks instead of the entire spectrum. These typically operate in the range of 1000 bit/s. Formant vocoders are not very popular because the formants are difficult to predict.

Linear Predictive Encoder (LPC)

Linear Predictive Encoders are the most popular today and are used mainly in digital

Personal Communications