# The psychoacoustic model

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

### The psychoacoustic model

The psychoacoustic model calculates with just-noticeable distortion (JND) profiles for each band in the filterbank. This noise level is used to determine the actual quantizers and quantizer levels. There are two psychoacoustic models defined by the standard. They can be applied any layer of the MPEG/Audio algorithm. In practice, Model 1 has been used for Layers I and II and Model 2 for Layer III. Both of models compute a signal-to-mask ratio (SMR) for each band (Layers I & II) or group of bands (Layer III).

The more sophisticated of the two, Model 2, will be discussed. The steps leading to the computation of the JND profiles is showing at bellow:-

- Time-align audio data
- Spectral analysis and normalization.
- Grouping of spectral values into threshold calculation partitions.
- Estimation of tonality indices.
- Simulation of the spread of masking on the BM.
- Set a lower bound for the threshold values.
- Determination of masking threshold per sub-band.
- Pre echo detection and window switching decision.
- Calculation of the signal-to-mask ratio (SMR).

The psychoacoustic model estimate the masking thresholds for the audio data that are want be quantized. It must account for both the delay through the filterbank and a data offset so that the relevant data is centered within the psychoacoustic analysis window. For the Layer III algorithm, time-aligning the psychoacoustic model with the filterbank load that the data feed to the model be delayed by 768 samples.

A high-resolution spectral estimate of the time-aligned data is fundamental for an accurate estimation of the masking thresholds in the critical bands. The low frequency resolution of the filterbank leaves no option but to work out an independent time-to-frequency mapping via a fast Fourier Transform (FFT). A Hanning window is applied to the data to decrease the edge effects of the transform window. Layer III operates on 1152-sample data frames. Model 2 uses a 1024- point window for spectral estimation. Ideally, the analysis window should completely cover the samples to be coded. The model work out two 1024-point psychoacoustic calculations. On the first pass, the first 576 samples are centered in the analysis window. The second pass centers the remaining samples. The model combines the results of the two calculations by using the more stringent of the two JND estimates for bit or noise allocation in each sub-band. Since playback levels are unknown3, the sound-pressure level (SPL) needs to be normalized. This implies clamping the lowest point in the fixed threshold of hearing curves to +/- 1-bit amplitude.

The uniform frequency decomposition and poor selectivity of the filterbank do not return the response of the BM. To correctly model the masking phenomenon characteristic of the BM, the spectral values are grouped into a large number of partitions. The exact number of threshold partitions depends on the choice of sampling rate. This transformation provides a resolution of around either 1 FFT line or 1/3 critical band, whichever is smaller. At low frequencies, a single line of the FFT will constitute a partition, while at high frequency/frequencies many lines are grouped into one.

It is necessary to identify tonal & non-tonal (noise-like) components because the masking abilities of the two types of signals differ. Model 2 does not explicitly separate tonal & non-tonal components. Instead, it computes a tonality index as a function of frequency. This is an indicator of the tone-like or noise-like nature of the spectral component. The tonality index is based on a measure of predictability. Linear extrapolation is used to expect the component values of the current window from the previous two analysis windows. Model 2 uses this index to butt in between pure tone-masking-noise and noise-masking-tone values. Tonal components are more predictable and so have a higher tonality index. As this process has memory, it is more likely to discriminate better between tonal and non-tonal components, unlike psychoacoustic Model 116.

A strong signal component affects the audibility of weaker components in the same critical band and the adjacent bands. Model 2 simulates this phenomenon by applying a Spreading function to extend the energy of any critical band into its surrounding bands. On the Bark scale, the distribution function has a constant shape as a function of partition number, with slopes of +25 and -10 dB per Bark.

An empirically determined absolute masking threshold, the threshold in quiet, is used as a lower bound on the audibility of sound.

At low frequencies, the minimum of the masking thresholds within a sub-band is chosen as the threshold value. At higher frequencies, the average of the thresholds within the sub-band is selected as the masking threshold. Model 2 has the same accuracy for the higher sub-bands as for low frequency ones because it does not concentrate non-tonal components16.

SMR is calculated as a ratio of signal energy within the sub-band (for Layers I and II) or a group of sub-bands (Layer III) to the minimum threshold for that sub-band. This is the final output of the psychoacoustic model. The masking threshold computed from the spread energy and the tonality index.

### Quantization and Coding

A system of two nested iteration loops is the familiar solution for quantization and coding in a Layer-3 encoder. Quantization is done via a power-law quantizer. In this way, larger values are automatically coded with less accurateness, and some noise shaping is already built into the quantization process. The quantized values are coded by Huffman coding. To adapt the coding process to different local statistics of the music signals, the optimum Huffman table is chosen from a number of choices. The Huffman coding works on parallel and, in the case of very small numbers to be coded, in quadruples. To get even better adaption to signal statistics, different Huffman code tables can be selected for different parts of the spectrum. Since Huffman coding is essentially a variable code length method and because noise shaping has to be done to keep the quantization noise below the masking threshold, a global gain value (which determines the quantization step size) and scalefactors (which determine the noise-shaping factors for each scalefactor band) are applied before actual quantization. The process to find the optimum gain and scalefactors for a given block, bit-rate and output from the perceptual model is usually done by two nested iteration loops in an analysis-by-synthesis way:

- Inner iteration loop (rate loop)
- Outer iteration loop (noise control loop)

The Huffman code tables assign shorter code words to (more frequent) smaller quantized values. If the number of bits resulting from the coding operation exceed the number of bits able to code a given block of data, this can be corrected by adjusting the global gain to result in a larger quantization step size, leading to smaller quantized values. This operation is repeated with different quantization step sizes until the resulting bit demand for Huffman coding is small enough. The loop is called rate loop because it modifies the overall coder rate until it is small enough.

To shape the quantization noise according to the masking threshold, scalefactors are applied to each scalefactor band. The systems starts with a default factor of 1.0 for each band. If the quantization noise in a given band is found to exceed the masking threshold (allowed noise) as supplied by the perceptual model, the scalefactor for this band is adjusted to reduce the quantization noise. Since achieving a smaller quantization noise requires a larger number of quantization steps and thus a higher bit-rate, the rate adjustment loop has to be repeated every time new scalefactors are used. In other words, the rate loop is nested within the noise control loop. The outer (noise control) loop is executed until the actual noise (computed from the difference of the original spectral values minus the quantized spectral values) is below the masking threshold for every scalefactor band (i.e. critical band).

While the inner iteration loop always converges (if necessary, by setting the quantization step size large enough to zero out all spectral values), this is not true for the combination of both iteration loops. If the perceptual model requires quantization step sizes so small that the rate loop always has to increase them to enable coding at the required bit-rate, both can go on forever. To avoid this situation, several conditions can be checked to stop the iterations more early. However, for fast encoding and good coding results, such a condition should be avoided. This is one reason why an MPEG Layer-3 encoder usually needs tuning of the parameter sets of the perceptual model for each bit-rate.