**Post: #1**

[attachment=1447]

A New Bit-Allocation Scheme for Coding of Digital Audio Using Psychoacoustic Modeling

Abstract:

In this paper we present a modified adaptive bit-allocation scheme for coding of audio signals using psychoacoustic modeling. Statistical redundancy reduction (lossless coding) generally gives compression ratios up to 2. We increase this ratio using perceptual irrelevancy removal (lossy coding). In this work MPEG-1 Layer-1 standard for psychoacoustic modeling is used and compression is improved by modifying the bit-allocation scheme, which without using a second stage (lossless) compression, is giving results comparable to that of MP3 (which uses Huffman coding as second stage). To increase this ratio further, some of the high frequency sub-bands which contribute less to the signal are removed. Subjective evaluation is the best method to assess the quality of audio signals. Original and reconstructed audio signals are presented to listeners that grade and/or compare them according to perceived quality. Without quality degradation the achieved bit-rate is 90 kbps from original bit-rate 706 kbps.

I. INTRODUCTION

Audio compression is concerned with the efficient transmission of audio data with good quality. The audio data on a compact disc is sampled at 44.1 kHz that requires an uncompressed data rate of about 706 kbps for mono with 16 bits/sample. With this data rate audio signals require large memory and bandwidth.

Digital Audio Broadcasting, communication systems and internet are demanding high quality signals with low bit-rates. Digital audio-coding techniques can be classified into two groups: time-domain and frequency domain processing. The time domain technique can be implemented with low complexity, but it requires more than 10 bits/sample for maintaining high quality. Most of the known techniques belong to the frequency domain. Traditional speech coders designed specifically for speech signals, achieve compression by utilizing models of speech production based on the human vocal tract. However, these traditional coders are not effective when the signal to be coded is not human speech but some other signal such as music. Hence in order to code a variety of audio signals the characteristics of the final receiver, i.e., human hearing system are exploited. To achieve high compression ratio, perceptually irrelevant information (to ear) is discarded.

Perceptual audio-coders use models of the auditory masking phenomena to identify the inaudible parts of audio signals. This is called psychoacoustic masking, in which the masked signal must be below the masking threshold. The noise masking phenomenon has been observed through a wide variety of psychoacoustic experiments. This masking occurs whenever the presence of a strong audio signal makes a spectral neighborhood of weaker signal imperceptible.

With the present algorithm, high quality can be achieved at bit rates 2 bits/sample (i.e., 90kbps) or above. Sub-band coding is used for mapping the audio signal into the frequency domain. This technique yields certain compression ratios due to its frequency distribution capabilities.

II. OVERALL STRUCTURE OF THE PROCESS

1. The audio signal is decomposed into M=32 equal sub-bands using a perfect reconstruction cosine-modulated M-band analysis filter banks and decimated by M.

2. Psychoacoustic modeling is performed in parallel to calculate the global threshold of hearing.

3. The bit-allocation to different sub-bands is calculated using this threshold.

4. The signals in different sub-bands are quantized using bit-allocation-I.

5. Then bit-allocation-II evaluates the actual number bits needed and accordingly signals are coded. The coded signals are sent to decoder.

6. The decoder simply passes these signals after M-fold expansion, through synthesis filters to reconstruct the output signal.

At the Encoder and Decoder

The following section describes the subband decomposition and filter banks. Section 4 describes the Psychoacoustic Modeling of standard MPEG-1 and section 5 gives details of bit-allocation and quantization scheme.

III. SUBBAND DECOMPOSITION AND FILTER BANKS

The filter banks divide time-domain input into frequency sub-bands and generate a time-indexed series of coefficients representing the frequency localized signal power within each band. By providing explicit information about the distribution of signal and hence masking power over the time-frequency plane, the filter bank plays an essential role in the identification of perceptual irrelevancies when using in conjunction with a psychoacoustic model. The filter bank facilitates perceptual noise shaping. On the other hand, by decomposing the signal into its constituent frequency components, the filter bank also assists in the reduction of statistical redundancies.

Filter banks for audio coding: Design considerations

The choice of an appropriate filter bank is essential to the success of a perceptual audio coder. The properties of analysis filter banks should adequately match the input signal. Algorithm designers face an important and difficult trade-off between time and frequency resolution. Failure to choose a suitable filter bank can result in perceptible artifacts in the output (e.g., pre-echoes) or impractically low coding gain and attendant high bit rates.

Most common audio content is highly non-stationary and contains tonal and atonal energy, as well as both steady state and transient intervals. As a rule, signal models tend to remain constant for a long time and then change suddenly. Therefore the ideal analysis filter bank should have time-varying resolutions in both the time- and frequency- domains. Filter banks emulating the properties of human auditory system, i.e., those containing non-uniform critical bandwidth sub-bands, have proven highly effective in the coding of highly transient signals. For dense harmonically structured signals, on the other hand, the critical band filter banks have been less successful because of their coding gain relative to filter banks with a large number of sub-bands. The following bank characteristics are highly desirable for audio coding:

Â¢ Signal adaptive time-frequency tiling

Â¢ Good channel separation

Â¢ Low resolution , critical-band mode, e.g., 32 sub-bands

Â¢ Efficient resolution switching

Â¢ Minimum blocking artifacts

Â¢ High resolution mode, up to 4096 sub-bands

Â¢ Strong stop-band attenuation

Â¢ Perfect reconstruction

Â¢ Critical sampling

Â¢ Availability of fast algorithms

Good channel separation and stop-band attenuation are particularly desirable for signals containing very little irrelevancy. Maximum redundancy removal is essential to maintaining high quality at low bit-rates for these signals. Blocking artifacts in time-varying filter-bank can lead to audible distortion in the reconstruction.

Although Pseudo-QMF banks have been used quite successfully in perceptual audio coders, the overall system design must compensate for the inherent distortion induced by the lack of perfect reconstruction to avoid audible artifacts in the codec output. The compensation strategy may be a simple one (e.g., increased prototype filter length), but perfect reconstruction is actually preferable because it constrains the sources of output distortion to the quantized stage. For the special case of L=2M, the filter banks with perfect reconstruction and low complexity are achieved. The Perfect-Reconstruction properties of these banks were first demonstrated by Princen and Bradley using time-domain arguments for the developments of the Time Domain Aliasing Cancellation (TDAC) filter bank. Analysis filter impulse responses are given by

hk(n )= w(n) v(2/M) cos{ (2n+M+1) (2k+1) p/4M }

k=0,1,Â¦..,M and n=0,1,Â¦.,L

and synthesis filters, to satisfy the overall linear phase constraint, are obtained by a time reversal, i.e., gk(n)=hk(2M-1-n)

where w(n) is a FIR prototype lowpass filter is to satisfy the following conditions for linea for linear phase and Nyquist constraints (PR constraints)

w(2M-1-n)=w(n)

w2(n)+w2(n+M)=1

In this algorithm a sine window is used, which is defined as

w(n)=sin{ (2n+1) p/4M }

The signal is decomposed into 32 sub-banks using these filter banks and each band is decimated by 32 (i.e., critically sampling).

IV. PSYCHOACOUSTIC MODELING

High precision engineering models for high-fidelity audio do not exist. Therefore, audio coding algorithms must rely upon generalized receiver models to optimize coding efficiency. In the case of audio, the receiver is ultimately the human ear and sound perception is affected by its masking properties. The field of psychoacoustics has made significant progress toward characterizing human auditory perception the time-frequency capabilities of the inner ear. Although applying perceptual rules to general coding is not a new idea, most current audio coders achieve compression by exploiting the fact that irrelevant signal information is not detectable by even a well trained or sensitive listener. Irrelevant information is identified during signal analysis by incorporating into the coder several psychoacoustic principles, including absolute threshold of hearing, critical band frequency analysis, simultaneous masking, the spread of masking along the basilar membrane, and temporal masking. Combining these psychoacoustic notions with basic properties of signal quantization has led to the development of perceptual entropy, a quantitative estimate of the fundamental limit of transparent audio signal compression. This section reviews psychoacoustic fundamentals and gives the details of the psychoacoustic model.

Absolute Threshold of Hearing:

ATH characterizes the amount of energy needed in a pure tone such that it can be detected in quiet. It is typically expressed in terms of dB SPL. The frequency dependence of this threshold was quantified as early as 1940 when Fletcher reported test results for a range of listeners which were generated in a National Institutes of Health (NIH) study of typical American hearing acuity. The quiet threshold is well approximated by the non-linear function given by

Tq(f) = 3.64(f/1000)-0.8 - 6.5exp(-0.6(f/1000 -3.3)2 + 10-3 (f/1000)4 (dB SPL)

where f is in Hz. When applied to signal compression, Tq(f) could be interpreted naively as a maximum allowable energy level for coding distortions introduced in the frequency domain.

Critical Bands:

Considering on its own the absolute threshold is of limited value in the coding context. The detection threshold for quantization noise is a modified version of the absolute threshold, with its shape determined by the stimuli present at any given time. Since stimuli are in general time-varying, the detection threshold is salso a time-varying function input signsl. In order to estimate this threshold, we must first understand how the ear performs spectral analysis. A frequency-to-place transformation takes in the cochlea (inner ear), along the basilar membrane. Dinstinct regions in the cochlea, each with a set of neural receptors, are tuned to different frequency bands. In fact, the cochlea can be viewed as a bank of highly overlapping bandpass filters. The magnitude responses are asymmetric and non-linear(level-dependent). Moreover, the cochlear filter passbands are of non-uniform bandwidth, and the bandwidths increase with increasing frequency. The critical bandwidth is a function of frequency that quantifies the cochlear filter passbands. Its notion is that the loudness (perceived intensity) remains constant for a narrowband noise source presented at a constant SPL even as the noise bandwidth is increased up to the critical bandwidth. For any increase beyond the critical bandwidth, the loudness then begins to increase. Critical Bandwidth tends to remain constant (about 100Hz) upto 500Hz, and increases to approximately 20% of center frequency. Its approximate expression is given by

BWc(f) = 25 + 75[1 + 1.4(f/1000)2]0.69 (Hz)

Although the function BWc is continuous, it is useful when building practical systems to treat the ear as a discrete set of band pass filters. A distance of one critical band is commonly referred to as one bark in the literature. The function

z(f) = 13arctan(0.00076f) + 3.5arctan[(f/7500)2] (Bark)

is often used to convert from frequency in Hz to the Bark scale.

Simultaneous masking and the Spread of masking:

Masking refers to a process where one sound is rendered inaudible because of the presence of another sound. Simultaneous masking refers to a frequency-domain phenomenon that can be observed whenever two o more stimuli are simultaneously presented to the auditory system. Although arbitrary audio spectra may contain comple simultaneous masking scenarios, for the purpose of shaping coding distortions it is convenient to distinguish between only two types of simultaneous masking, namely tone-masking noise, and noise-masking-tone. Inter-band masking has also been observed, i.e., a maker centered within one critical band has some predictable effect on detection thresholds in the other critical bands. This effect, also known as the spread of masking is often modeled in coding applications by an approximately triangular spreading function that has slopes of +25 and -10 dB per Bark.

Psychoacoustic model:

The model uses a 1024-point FFT for high resolution spectral analysis (43.07 Hz), then estimates for each input frame individual simultaneous masking thresholds due to the presence of tone-like and noise-like maskers in the original spectrum. A global masking threshold is then estimated for a subset of the original 256 frequency bins by additive combination of the tonal and atonal masking thresholds. The five steps leading to computation of global masking threshold are as follows:

Step 1: Spectral analysis and SPL normalization

First the incoming audio samples s(n) are normalized according to FFT length N, and the number of bits per sample, b, using the relation

s(n)

x(n) = ----------------

N(2b-1)

Normalization references the power spectrum to a 0-dB maximum. The normalized input x(n) is then segmented into frames (1024 samples) using a 1/16th overlapped Hann window. A power spectral density estimate P(k) is then obtained using a 1024-point FFT, i.e.,

P(k)=PN+10log10| N-1w(n)x(n)exp(-j2pkn/N)|2 n=0 0=k=N/2

where the power normalization term, PN, is fixed at 96.3 dB and the Hann window, w(n), is defined as

w(n) = Ã‚Â½[1-cos(2pn/N)]

Step 2: Identification of Tonal and Noise maskers

Local maxima in the sample PSD that exceed neighboring components within a certain bark distance by at least 7 dB are classified as tonal. Specifically, the tonal set, ST, is defined as

ST = { P(k) | P(k)>P(kÃ‚Â±1), P(k)>P(kÃ‚Â±k)+7dB }

where

[2,4] 4=k<126 (0.17-5.5kHz)

k { [2,6] 126=k<254 (5.5-1kHz)

[2,12] 254=k<512 (11-22kHz)

Tonal maskers, PTM(k), are computed from the spectral peaks listed in ST as follows

PTM(k) = 10log10{ P(k-1) + P(k) + P(k+1) } (dB)

A single noise masker for each critical band, PNM(g), is then computed from (remaining) spectral lines not within the Ã‚Â±k neighborhood of a tonal masker using the sum

PNM(g) = 10log10 100.1P(j) (dB),

where g is defined to be the geometric mean spectral line of the critical band, i.e.,

u

g = { j}1/(l-u+1)

j=l

and l & u are the lower and upper spectral boundaries of the critical band, respectively. The idea behind eq. for PNM(g) is that residual energy within a critical band not associated with a tonal masker must, by default, be associated with a noise masker.

Step 3: Decimation and Reorganization of maskers

In this step, the number of maskers is reduced using two criteria. First any tonal or noise maskers below the absolute threshold are discarded, i.e., only maskers which satisfy

PTM,NM(k) = Tq(k)

are retained, where tq(k)is the SPL of the threshold in quiet at spectral line k. Next, a sliding 0.5 bark-wide window is used to replace any pair of maskers occurring within a distance of 0.5 Bark by the stronger of the two. After the sliding window procedure, masker frequency bins are reorganized according to the sub sampling scheme

if PTM,NM(i) > PTM,NM(k)

PTM,NM(k) = 0

where k ranges between 0.5 Bark neighborhood of i.

Step 4: Calculation of individual masking thresholds

Having obtained a decimated set of tonal and noise maskers, individual tone and noise masking thresholds can be computed. Each individual threshold represents a masking contribution at frequency bin i due to the tone or noise masker located at bin j (reorganized during step 3). Tonal masking thresholds, TTM(i,j), are given by

TTM(i,j) = PTM(j) â€œ 0.275z(j) + SF(i,j) -6.025 (dB SPL)

where PTM(j) denotes the SPL of the tonal masker in frequency bin j, z(j) denotes the Bark frequency of bin j, and the spread of masking from masker bin j to masker bin i, SF(i,j), is modeled by the expression

17k-0.4PTM(j)+11, -3=k<-1

SF(i,j)= (0.4PTM(j) +6) k, -1=k<0

-17k, 0=k<1

(0.15PTM(j)-17) k- 0.15PTM(j), 1=k=8

(dB SPL)

i.e., as a piecewise linear function of masker level, PTM(j), and bark maskee-masker separation, k = z(i) â€œ z(j).

Individual noise-masking thresholds, TNM(i,j), are given by

TNM(i,j) = PNM(j) â€œ 0.175z(j) + SF(i,j) -2.025 (dB SPL)

where PNM(j) denotes the SPL of the noise masker in frequency bin j, z(j) denotes the Bark frequency bin j and SF(i,j) is obtained by replacing PTM(j) with PNM(j) everywhere in eqÂ¦..

Step 5: Calculation of global masking thresholds

In this step, individual masking thresholds are combined to estimate a global masking threshold for each frequency bin. The model assumes that masking effects are additive. The global masking threshold, Tg(i), is therefore obtained by computing the sum

L M

Tg(i) = 10Tq(i)/10 + 10TTM(i,l)/10 + 10TNM(i,m)/10 (amplitude units)

l=1 m=1

where Tq(i) is the absolute hearing threshold for frequency i, TTM(i,l) and TNM(i,m) are the individual masking thresholds from step 4, and L and M are the number of tonal and noise maskers, respectively, identified during step 3. In other words, the global threshold for each frequency bin represents a signal dependent, additive modification of the absolute threshold due to the basilar spread of all tonal and noise maskers in the signal power spectrum.

Step 1: Obtain PSD, express in dB SPL. Absolute threshold is superimposed. Step 2: Tonal maskers identified and denoted by ËœOâ„¢ symbol; Noise maskers identified and denoted by ËœXâ„¢ symbol.

Steps 3,4: Spreading functions are associated with each of the individual tonal maskers satisfying

the rules outlined in the text.

Spreading functions are associated with each of the individual noise maskers that were extracted after the tonal maskers had been eliminated from consideration, as described in the text.

Step 5: A global masking threshold is obtained by combining the individual thresholds as de-scribed in the text. The maximum of the global threshold and the absolute threshold are taken at each point in frequency to be the final global threshold. The figure clearly shows that some portions of the input spectrum require SNRs of better than 20 dB to prevent audible distortion, while other spectral regions require less than 3 dB SNR. In fact, some high-frequency portions of the signal spectrum are masked and therefore perceptually irrelevant, ultimately requiring no bits for quantization without the introduction of artifacts.

V. BIT-ALLOCATION AND QUANTIZATION

The bit-allocation scheme for different sub-bands is calculated using the global threshold obtained from psychoacoustic model. First the minimum masking threshold for each sub-band is determined and is used to shape the distortion in quantizing the sub-band samples. That is, as the noise induced in quantization is at most (step-size)/2, we take it from the minimum of the calculated global threshold (min_thresh). Then the number of bits required to quantize a given sub-band is calculated using

b = log2( R/min_thresh )-1

where R is the range of input samples = 65536. With this bit-allocation quantization of each sub-band samples is performed.

To increase compression ratio obtained using psychoacoustic modeling, we used a modified bit-allocation scheme which is adaptive in nature. In this scheme, after performing quantization according to bit-allocation mentioned before, maximum absolute value of a sample (i.e., the range of values) in each sub-band is determined and from this we determine the actual number of bits needed to transmit each sub-band samples without loss of information. Since the decoder should know the number of bits transmitted, the number of bits calculated from above equation and the number of bits actually needed to transmit, are transmitted to the decoder for each sub-band. In this way the compression ratio is approximately doubled depending on the dynamic characteristics of the audio signal. To still increase the ratio we used an adaptive-high-frequency band elimination process. Since high frequencies contribute less percentage to the signal than low ones we do not transmit the samples of those sub-bands which are having less than a specified number of non-zero samples. This is the scheme which is employed in this paper to increase compression ratio to almost equal to the methods involving two stage compression.

The encoded samples are then sent to the decoder. The decoder simply applies synthesis filters to the encoded sub-band samples (after M-fold expansion), to reconstruct the output signal.

VI. RESULTS

We have experimented with a different set of audio signals (sampled at 44.1kHz) in Matlab (Mathworks, Inc) and the results (average compression ratio) for them are summarized below.

Audio Type Compression ratio with

Psychoacoustic model only

(MPEG-1 layer1) Compression ratio with

Modified bit

allocation Compression ratio with

Adaptive high frequency band elimination Compression ratio with

Half of the subbands eliminated

Pop music 1.74 6.38 7.59 8.99

Rock music 1.82 6.94 8.52 9.70

Male voice 1.78 6.66 8.01 9.16

Female voice 1.79 7.10 8.18 9.78

Gun Fighting 1.84 7.35 8.40 10.34

Trumpet 1.73 6.83 9.55 10.61

VII. CONCLUSIONS

We showed how psychoacoustic modeling can be used to compress audio data by reducing perceptual irrelevancies present in the signals. But this does not give a good compression ratio. For a higher compression ratio with this algorithm we presented a novel adaptive bit-allocation method which makes it to compete with MP3 algorithm (which uses advanced coding algorithms). MP3 adds very efficient noise shaping algorithm, which together with huffmann coding gives superior results. If the same type of filter-banks are used in this work we could have got even better results. We have implemented the same coding blocks as in MP1/MP2 but with change in bit-allocation scheme. With comparison to them our algorithm is very efficient.

IX. REFERENCES

[1] T. Painter and A. Spanias, Perceptual coding of digital audio, Proc. IEEE, vol. 88, pp. 451â€œ513, Apr. 2000.

[2] H. Najafzadeh-Azghandi, Perceptual Coding of Narrowband Audio Signals. PhD thesis, McGill University, Montreal, Canada, Apr. 2000.

[3] Christopher R. Cave, Perceptual Modelling for Low-Rate Audio Coding., Master of Engineering Thesis, McGill University, Montreal, Canada, June 2002.

[4] K.R. Rao and J. J. Hwang, Techniques and Standards for Image, Video, and Audio Coding, Prentice Hall PTR, 1996.

[5] P. P. Vaidyanathan, Multirate Systems and Filter Banks, PTR Prentice-Hall, 1993.

[6] Sanjit K. Mitra, Digital Signal Processing, Tata McGraw-Hill Edition, 2001.