MPEG4 Multimedia Standard Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

MPEG 4 is a standard for multimedia based in media objects. Media objects can represent images, synthesis instructions, video segments, visual graphics etc. MPEG 4 Audio and Video standards are specifying the coding methods for each of the type of media objects. In this paper we will examine the capabilities of the sound and the composition of it in the MPEG 4 standard. Sound coding tools and technical literature of the standard can be found in detail in (ISO 144496-3) where all the aspects of MPEG 4 Audio can be found as it is the official reference. The scope of this paper is to present the main capabilities and tools of the MPEG 4 Audio standard (in version 1 and 2 mainly) and briefly examine the functionalities of it.

1. Introduction

As multimedia applications started to evolve new demands arise with the audiovisual content found in World Wide Web, games, digital broadcasting and many more. A new flexible representation of audio visual content was needed, one that could have high coding efficiency and at the same time could handle with the limited bandwidth of the internet and communications in general. Besides that, new functionalities were needed support, like the ability of the recipient to handle the coded data and manipulate them. With these requirements in mind the first MPEG 4 standardisation activities were started and the solution was the coding of audiovisual objects. In 1998 the first version of the MPEG 4 Audio standard was finalized and provided the first tools able to code natural and synthetic audio objects and compose them into an "audio scene" [1,2]. Natural audio objects (music and speech) could now be coded at bitrates that started at 2 Kbits/sec and reached 64 Kbits/sec and more with the usage of speech coding and general audio coding. Scalability of bit rate was also supported in natural audio. Structured audio synthesis tools as well as text to speech interfaces were used to represent synthetic audio objects. Such tools were also used in order to mix different audio objects and add effects to the final audio scene presented to the listener.

2. Sound Coding in MPEG 4 version 1

Two sound groups of sound coding tools can be found in MPEG 4, natural tools [4,5] that are used for compression and transmition of digital audio and synthetic tools [6,7] that enable the parametric sound descriptions for synthesis after reception. Ranges of coding starting from 6 Kbits/sec (low bit rate) and reaching 64 Kbits/sec (high quality sound) are used for natural audio tools after compression. The high quality is in fact so good that in a psychoacoustic evaluation [8] that took place, very few skilled listeners could distinguish the original signal from the coded one.

Many good promising tools proposed for the MPEG 4 were not ready to be included in the first version of the standard because the schedule was very tight, but many of them were worked and included in version 2 of the standard. In fact in MPEG 4 version 2 many new tools were added as extension to the standard but none of the first tools were replaced in a try for version 2 to have fully backward compatibility. This can been seen in the figure below.

Figure 1: Relation between MPEG 4 Versions

In the subsections below the audio tools found in MPEG 4 version 1 are described

Coding of Speech

Harmonic Vector eXcitation Coding or HVXC supports speech coding with bitrates that have values from 2 Kbit/sec to 4 Kbits/sec. Code Excited Linear Predictive or else CELP supports speech coding with bitrates that have values from 4Kbits/sec to 24 Kbits/sec. Bit rate scalability is provided with both coders but the HVXC speech coder also provides speed functionality and pitch modification into the decoder. CELP coding uses 2 sampling rates, one at 8 kHz and one at 16 kHz that are used to support narrowband and wideband speech.

Coding of General Audio

Transformation coding techniques are used for general audio coding that are varying from very poor (low bitrates) to high quality [9,10]. A big range of bandwidths and bitrates are covered by these tools. Mono and multichannel audio can be serviced with broadcasting starting at a 6 Kbits/sec bit rate and about 3,8 kHz's of bandwidth. Twin VQ (Vector Quantization) tool is used for the low bitrates and for the high bitrates an extended version of the AAC that stands for Advanced Audio Coding is used. This extended version includes Perceptual Noise Substitution (PNS) module and Long Term Prediction (LTP) module and at the same time has backwards compatibility with the AAC that is used in MPEG-2. Scalability is provided by AAC and Twin VQ and the scheme of AAC is allowing CELP coding for the bit stream of the base layer.

Structured Audio

Structured representations can be converted to synthetic sound signals with the help of the structured audio tools [11,12,13]. The language that is used for the decoding is called SAOL (Structured Audio Orchestra Language). These tools can also be used to create effects such as chorus, reverbs and etc. to the decoded objects (natural and synthetic audio) and compose them back to rebuilt the audio scene that is presented to the listener.

Text to Speech

Generation of synthetic speech from text is possible with Text To Speech synthesis (TTS) transmitted at bitrates that starts from 200 bits/sec up to 1.2 Kbits/sec. Face animation can be synchronized with the text and audio with the use of special parameters. MPEG-4 provides by default a text to speech (TTSI) interface for the operation of a text to speech decoder.

Levels and Profiles

A wide range of tools are available in MPEG-4 for the coding of the audio objects. Subsets of the tools were created in order for the standard to work correct and be efficient in different applications. These subsets were named Profiles and each one of them has one or many levels that in the end decrease the computational complexity. In Version 1 of MPEG-4 four profiles were created and these are:

Synthesis Profile: It uses SAOL language, wavetables and a text to speech interface for the generation of speech and sound. Low bitrates are used for the speech an sound.

Speech Profile: It provides a text to speech interface, CELP speech coder (narrowband and wideband) and HVXC, a parametric coder for speech that works in very low bit rates.

Scalable Profile: It is used for scalable coding of the audio and speech for usage in networks mostly like Internet. It is in fact a speech profile's superset, the bitrates that are used vary from 6 Kbits/sec to 24 Kbits/sec, with the bandwidth ranges from 3,5 kHz to 9 kHz.

Main Profile: It is in fact a very rich superset of all the previous profiles, mostly it contains tools for Synthetic audio and Natural audio.

In Version 2 of MPEG-4 four more profiles were added in the existing ones (is showed here and not in the next chapter for having them all together for better understanding):

High Quality Audio Profile: It concludes CELP coder for speech and the AAC coder for low complexity (long term prediction included). It is capable for scalable coding by using the AAC scalable object type.

Low Delay Audio Profile: It contains the HVXC coder sawn in speech profile as well as CELP speech coders, a TTSI text to speech interface and the low delay AAC coder.

Natural Audio Profile: It contains all the coding tools for natural audio MPEG-4 provides except the ones used for synthesis.

Mobile Audio Internetworking Profile: This profile aims to extend the applications used in communication with the use of speech coding algorithms not used by the MPEG but have capabilities for audio coding in very high quality. Low delay AAC and scalable AAC object types are included in this profile as well as TwinVQ and BSAC.

MPEG 4 Version 2 New Tools

In the extension of MPEG 4 version 2 new functionalities arise and extended the capabilities of the MPEG 4 framework. In this section some new tools will be examined for understanding.

Error Resilience

These tools improve the performance of the transmission channels that are open to errors. There are two kinds of error classes found here, the first one is for error robustness of the source coding like the Huffman and the second one that is used for error protection of the audio coding schemes of MPEG 4.

Three tools are included in the first class, one of them is the VCB11 tool that extends the information found sectioned in an ACC bit stream. With this way a lot of errors into the spectral data of an AAC bit stream can be detected. Another tool that belongs in the first class is the RVLC (Reversible Variable Length Coding) tool that with the usage of symmetric code words can activate the decoding (forward and backward) of the scale factor data. The number of bits of the bitstream is transmitted for the decoder to have a starting point for the backward decoding. The third tool is based in the well known Huffman code but with a few changes, it is named HCR (Huffman Codeword Reorde ring) tool and it extends the coding of spectral data in an AAC bit stream of MPEG 4. Error propagation can be avoided into the "priority code words" if Huffman code words are placed at known positions. This technique requires known length segments to be defined and in the beginning of these segments the priority code words should be placed. The rest of the code words are filling the gaps with the usage of an algorithm that has the ability to minimize the error propagation for the non priority code words. This special algorithm doesn't increase the size of the spectral data. Of course this procedure requires the code words to be sorted before the algorithm is applied to them in order to determine every code word's priority according to their importance.

In the second class of the tools that are used for Error Resilience is the Error Protection tool that gives MGEG 4 audio Unequal Error Protection (UEP). This is a method that improves the robustness in errors a scheme of the source coding has. Of course this is a method widely used in mobile networks and in every error-prone channel used for speech and audio coding system in general. Encoded bits of the signal representation are grouped into classes depending on their sensitivity to errors and then error protection is applied to each of these different classes according to their sensitivity bits. With this way the sensitive bits are having better protection than the non sensitive ones. Bitstream reordering is absolutely necessary in order to group the bits in different sensitivity classes. In MPEG 4 audio there are different source coding tools that are using different number of sensitivity classes but usually most of them are using 4 to 5 classes. Forward Error Correction codes as well as Cyclic Redundancy Check codes for the detection of errors can be applied. Coding schemes like the ACC don't have a simple structure and that's why from frame to frame can have different sizes of error sensitivity classes. In the figure below a block diagram of a complete error protection encoder can be observed.

Figure 2: Block diagram of an Error Protection Encoder

Low delay Audio Coding

The general audio coder of the MPEG 4 is working fine at low bit rates but has an algorithmic delay of around 100 msec and this creates problems to real time applications such as bi-directional communication. There are some factors that are causing this kind of delay; one is the frame length that can interfere in any block based processing. Another one is the filterbank delay that is caused by the analysis of filterbank and the synthesis of it. Bit reservoir implies an additional delay but it is necessary for facilitating the use of the locally varying bit rate. In real time this means that for a coder working at 24 kHz sampling rate and 24 Kbits/sec, this would result at a delay of 110 msec without adding about 210 msec for the bit reservoir. MPEG 4 in version 2 managed to enable the coding of general audio signals without exceeding the algorithmic delay of 20 msec with the help of a low delay audio coder [17,18]. This coder has sampling rate of 48 kHz and uses a 512 samples frame length compared to the 1024 samples used before. It must also be mentioned that filterbank that used for analysis and synthesis is halved.

Small Step Scalability

The general audio coder in MPEG version 1 provides bit rate scalability in large steps where an enchantment layer bit stream can be combined with the base layer bit stream. This of course provides quality in the audio. However in a configuration that has a 24 Kbits/sec base layer and two enchantment layers of 16 Kbits/sec each, the decoded total bit rate would be 24 Kbits/sec for mono and 40 Kbit/sec or 56Kbit/sec for stereo. Because side information is carried in each of the layers the enchantment layers in MPEG 4 version 1 are not supported in an efficient way.

Figure 3: Low Delay audio decoder Block Diagram

This problem is solved in version 2 of MPEG 4 with the help of BSAC (Bit Sliced Arithmetic Coding) tool [19] that provides small step scalability into the equation. Steps of 1 Kbit/sec for each audio channel (2 Kbits/sec for stereo) provides the necessary scalability. BSAC uses many enhancement layer bit streams of small size for each base layer bit stream.

Parametric Audio Coding

These tools are combining low bit rates for general audio with possibility of modification of playback speed without a processing unit for effects to be needed. This combined with the coding tools of MPEG 4 version 1 for speech and audio improves the efficiency in object based applications because it allows different coding techniques to be selected. General audio signals are coded with 2 different coding techniques, the one is HILN (Individual Line plus Noise) technique that is coding signals at bit rates of 4 Kbits/sec and above with the usage of a parametric representation of audio signal [20,21] and the second one is the classic Harmonic technique. Basic idea behind the new HILN technique is decomposing of signal info objects with the help of the correct source models and then to be represented by using model parameters.

3.5. Environmental Spatialisation

These tools support modelling of sound environment by permitting the composition of audio scenes with sound sources of natural sound. Two approaches are supported to spatialisation and are described here:

Environment's acoustical description properties are the base of the first approach that is the physical approach and includes position of the sound source, room geometry, material properties and more. This can be used in a virtual reality application [29]

In Perceptual approach however visual and audio scenes can be composed separately as a movie application need [30]. This approach is letting a perceptual description of the scene that is based on parameters like those of a reverberation effect unit.

These tools are known as advanced AudioBIFS of MPEG 4 version 2 because even if they are related to audio they are still part of the Binary Format for Scene description or BIFS.

3.6. CELP Silence Compression

In situations that there is no voice activity a silence compression tool is used in MPEG 4 version 2 [33]. A detector of voice activity is used to recognise when there is speech activity and when there is silence combined with background noises in order to distinguish these two cases. When speech is detected everything works as in version 1 of MPEG 4 with CELP coding being used. If no speech is detected a SID (Silence Insertion Descriptor) is send and a Comfort Noise Generator (CNG) is enabled by it with the spectral shape and amplitude of this noise to be determined from the LCP parameters and energy like in the normal case of a CELP frame. SID's parameters are optional and can be changed to the required ones. A CELP decoder with a Silence compression tool is visible in the following figure.

Figure 4: CELP decoder with a Silence compression tool Block Diagram.

4. MPEG 4 Audio Advantage

MPEG 4 is an "asymmetric" standard as MPEG 1 and 2 was, this means the syntax of the bit stream is fixed as the decoding process in the part of the standard that is normative. On the other hand an informative annex is describing the possible process of encoding. This permits the standard to be improved in the quality of the coding scheme even after its finalisation with the latest encoder optimisation. MPEG 4 audio is all about describing the coding of the audio objects and turn them into "audio scene" by composing them, this offers an advanced degree of flexibility for encoding and authoring. With flexibility comes also the improved coding efficiency that is characterizing this object based coding system. An example here will make it easier in understanding, so let's assume that a radio program is coded in one case as one audio object by one general audio coder and in the other case the same program is coded as two different objects, one for the speech by a speech coder and another object for the "background music" is coded by a parametric coder or synthesized with the use of Structured audio tools. In the second case the two objects after coded would be mixed to produce the final radio program in the decoder. It should be clear from previous chapter that the second case of coding the radio program can be characterized to have better overall efficiency than the first case especially if we take in mind that very low bit rates are needed [25]. This no argued advantage can easily justified if the objects are available in the process of authoring. There is however a disadvantage and this is the decomposition of an "audio scene" into objects that is in fact a very difficult procedure and for many who try to simplify it is still object of research.

5. Conclusions

This paper examined briefly the MPEG 4 Audio standardization, mostly versions 1 and 2 of the standard although newer versions have been released but the key elements are covered in those two versions and the newer versions mostly improved these two versions. The functionalities of the standard were presented and described as well as the tools used by it. Version 2 was finalized in the end of 2000. Many applications today are using it and many evolving application will use it in the future to come.