Harmonic vector excitation coding

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

1.1. Coding of Speech

Harmonic Vector eXcitation Coding or HVXC supports speech coding with bitrates that have values from 2 Kbit/sec to 4 Kbits/sec. Code Excited Linear Predictive or else CELP supports speech coding with bitrates that have values from 4Kbits/sec to 24 Kbits/sec. Bit rate scalability is provided with both coders but the HVXC speech coder also provides speed functionality and pitch modification into the decoder. CELP coding uses 2 sampling rates, one at 8 kHz and one at 16 kHz that are used to support narrowband and wideband speech.

1.2. Coding of General Audio

Transformation coding techniques are used for general audio coding that are varying from very poor (low bitrates) to high quality [9,10]. A big range of bandwidths and bitrates are covered by these tools. Mono and multichannel audio can be serviced with broadcasting starting at a 6 Kbits/sec bit rate and about 3,8 kHz's of bandwidth. Twin VQ (Vector Quantization) tool is used for the low bitrates and for the high bitrates an extended version of the AAC that stands for Advanced Audio Coding is used. This extended version includes Perceptual Noise Substitution (PNS) module and Long Term Prediction (LTP) module and at the same time has backwards compatibility with the AAC that is used in MPEG-2. Scalability is provided by AAC and Twin VQ and the scheme of AAC is allowing CELP coding for the bit stream of the base layer.

1.3. Structured Audio

Structured representations can be converted to synthetic sound signals with the help of the structured audio tools [11,12,13]. The language that is used for the decoding is called SAOL (Structured Audio Orchestra Language). These tools can also be used to create effects such as chorus, reverbs and etc. to the decoded objects (natural and synthetic audio) and compose them back to rebuilt the audio scene that is presented to the listener.

1.4. Text to Speech

Generation of synthetic speech from text is possible with Text To Speech synthesis (TTS) transmitted at bitrates that starts from 200 bits/sec up to 1.2 Kbits/sec. Face animation can be synchronized with the text and audio with the use of special parameters. MPEG-4 provides by default a text to speech (TTSI) interface for the operation of a text to speech decoder.

1.5. Levels and Profiles

A wide range of tools are available in MPEG-4 for the coding of the audio objects. Subsets of the tools were created in order for the standard to work correct and be efficient in different applications. These subsets were named Profiles and each one of them has one or many levels that in the end decrease the computational complexity. In Version 1 of MPEG-4 four profiles were created and these are:

  • Synthesis Profile: It uses SAOL language, wavetables and a text to speech interface for the generation of speech and sound. Low bitrates are used for the speech an sound.
  • Speech Profile: It provides a text to speech interface, CELP speech coder (narrowband and wideband) and HVXC, a parametric coder for speech that works in very low bit rates.
  • Scalable Profile: It is used for scalable coding of the audio and speech for usage in networks mostly like Internet. It is in fact a speech profile's superset, the bitrates that are used vary from 6 Kbits/sec to 24 Kbits/sec, with the bandwidth ranges from 3,5 kHz to 9 kHz.
  • Main Profile: It is in fact a very rich superset of all the previous profiles, mostly it contains tools for Synthetic audio and Natural audio.
  • In Version 2 of MPEG-4 four more profiles were added in the existing ones (is showed here and not in the next chapter for having them all together for better understanding):

  • High Quality Audio Profile: It concludes CELP coder for speech and the AAC coder for low complexity (long term prediction included). It is capable for scalable coding by using the AAC scalable object type.
  • Low Delay Audio Profile: It contains the HVXC coder sawn in speech profile as well as CELP speech coders, a TTSI text to speech interface and the low delay AAC coder.
  • Natural Audio Profile: It contains all the coding tools for natural audio MPEG-4 provides except the ones used for synthesis.
  • Mobile Audio Internetworking Profile: This profile aims to extend the applications used in communication with the use of speech coding algorithms not used by the MPEG but have capabilities for audio coding in very high quality. Low delay AAC and scalable AAC object types are included in this profile as well as TwinVQ and BSAC.

2. MPEG 4 Version 2 New Tools

In the extension of MPEG 4 version 2 new functionalities arise and extend the capabilities of the MPEG 4 framework. In this section some new tools will be examined for understanding.

2.1. Error Resilience

These tools improve the performance of the transmission channels that are open to errors. There are two kinds of error classes found here, the first one is for error robustness of the source coding like the Huffman and the second one that is used for error protection of the audio coding schemes of MPEG 4.

Three tools are included in the first class, one of them is the VCB11 tool that extends the information found sectioned in an ACC bit stream. With this way a lot of errors into the spectral data of an AAC bit stream can be detected. Another tool that belongs in the first class is the RVLC (Reversible Variable Length Coding) tool that with the usage of symmetric code words can activate the decoding (forward and backward) of the scale factor data. The number of bits of the bitstream is transmitted for the decoder to have a starting point for the backward decoding. The third tool is based in the well known Huffman code but with a few changes, it is named HCR (Huffman Codeword Reorde ring) tool and it extends the coding of spectral data in an AAC bit stream of MPEG 4. Error propagation can be avoided into the “priority code words” if Huffman code words are placed at known positions. This technique requires known length segments to be defined and in the beginning of these segments the priority code words should be placed. The rest of the code words are filling the gaps with the usage of an algorithm that has the ability to minimize the error propagation for the non priority code words. This special algorithm doesn't increase the size of the spectral data. Of course this procedure requires the code words to be sorted before the algorithm is applied to them in order to determine every code word's priority according to their importance.

In the second class of the tools that are used for Error Resilience is the Error Protection tool that gives MGEG 4 audio Unequal Error Protection (UEP). This is a method that improves the robustness in errors a scheme of the source coding has. Of course this is a method widely used in mobile networks and in every error-prone channel used for speech and audio coding system in general. Encoded bits of the signal representation are grouped into classes depending on their sensitivity to errors and then error protection is applied to each of these different classes according to their sensitivity bits. With this way the sensitive bits are having better protection than the non sensitive ones. Bitstream reordering is absolutely necessary in order to group the bits in different sensitivity classes. In MPEG 4 audio there are different source coding tools that are using different number of sensitivity classes but usually most of them are using 4 to 5 classes. Forward Error Correction codes as well as Cyclic Redundancy Check codes for the detection of errors can be applied. Coding schemes like the ACC don't have a simple structure and that's why from frame to frame can have different sizes of error sensitivity classes. In the figure below a block diagram of a complete error protection encoder can be observed.

2.2. Low delay Audio Coding

The general audio coder of the MPEG 4 is working fine at low bit rates but has an algorithmic delay of around 100 msec and this creates problems to real time applications such as bi-directional communication. There are some factors that are causing this kind of delay; one is the frame length that can interfere in any block based processing. Another one is the filterbank delay that is caused by the analysis of filterbank and the synthesis of it. Bit reservoir implies an additional delay but it is necessary for facilitating the use of the locally varying bit rate. In real time this means that for a coder working at 24 kHz sampling rate and 24 Kbits/sec, this would result at a delay of 110 msec without adding about 210 msec for the bit reservoir. MPEG 4 in version 2 managed to enable the coding of general audio signals without exceeding the algorithmic delay of 20 msec with the help of a low delay audio coder [17,18]. This coder has sampling rate of 48 kHz and uses a 512 samples frame length compared to the 1024 samples used before. It must also be mentioned that filterbank that used for analysis and synthesis is halved.

2.3. Small Step Scalability

The general audio coder in MPEG version 1 provides bit rate scalability in large steps where an enchantment layer bit stream can be combined with the base layer bit stream. This of course provides quality in the audio. However in a configuration that has a 24 Kbits/sec base layer and two enchantment layers of 16 Kbits/sec each, the decoded total bit rate would be 24 Kbits/sec for mono and 40 Kbit/sec or 56Kbit/sec for stereo. Because side information is carried in each of the layers the enchantment layers in MPEG 4 version 1 are not supported in an efficient way.

This problem is solved in version 2 of MPEG 4 with the help of BSAC (Bit Sliced Arithmetic Coding) tool [19] that provides small step scalability into the equation. Steps of 1 Kbit/sec for each audio channel (2 Kbits/sec for stereo) provides the necessary scalability. BSAC uses many enhancement layer bit streams of small size for each base layer bit stream.

2.4. Parametric Audio Coding

These tools are combining low bit rates for general audio with possibility of modification of playback speed without a processing unit for effects to be needed. This combined with the coding tools of MPEG 4 version 1 for speech and audio improves the efficiency in object based applications because it allows different coding techniques to be selected. General audio signals are coded with 2 different coding techniques, the one is HILN (Individual Line plus Noise) technique that is coding signals at bit rates of 4 Kbits/sec and above with the usage of a parametric representation of audio signal [20,21] and the second one is the classic Harmonic technique. Basic idea behind the new HILN technique is decomposing of signal info objects with the help of the correct source models and then to be represented by using model parameters.

2.5. Environmental Spatialisation

These tools support modelling of sound environment by permitting the composition of audio scenes with sound sources of natural sound. Two approaches are supported to spatialisation and are described here:

Environment's acoustical description properties are the base of the first approach that is the physical approach and includes position of the sound source, room geometry, material properties and more. This can be used in a virtual reality application [29]

In Perceptual approach however visual and audio scenes can be composed separately as a movie application need [30]. This approach is letting a perceptual description of the scene that is based on parameters like those of a reverberation effect unit.

These tools are known as advanced AudioBIFS of MPEG 4 version 2 because even if they are related to audio they are still part of the Binary Format for Scene description or BIFS.

2.6. CELP Silence Compression

In situations that there is no voice activity a silence compression tool is used in MPEG 4 version 2 [33]. A detector of voice activity is used to recognise when there is speech activity and when there is silence combined with background noises in order to distinguish these two cases. When speech is detected everything works as in version 1 of MPEG 4 with CELP coding being used. If no speech is detected a SID (Silence Insertion Descriptor) is send and a Comfort Noise Generator (CNG) is enabled by it with the spectral shape and amplitude of this noise to be determined from the LCP parameters and energy like in the normal case of a CELP frame. SID's parameters are optional and can be changed to the required ones. A CELP decoder with a Silence compression tool is visible in the following figure.

3. MPEG audio

MPEG 4 is an “asymmetric” standard as MPEG 1 and 2 was, this means the syntax of the bit stream is fixed as the decoding process in the part of the standard that is normative. On the other hand an informative annex is describing the possible process of encoding. This permits the standard to be improved in the quality of the coding scheme even after its finalisation with the latest encoder optimisation. MPEG 4 audio is all about describing the coding of the audio objects and turn them into “audio scene” by composing them, this offers an advanced degree of flexibility for encoding and authoring. With flexibility comes also the improved coding efficiency that is characterizing this object based coding system.