Audio compression gets better and more complexThere's no shortage of methods for compressing digital audio. The best approach depends on your storage, your fidelity needs, and the amount of processing at hand.
The terrific popularity of portable multimedia players and Internet media services such as iTunes has generated a lot of interest in audio coding. Audio coding is the art and science of compressing audio signals for efficient storage (small file size) and high-quality streaming (low bandwidth). In this article, I'll briefly cover the principles of audio coding and describe at a high level the popular MPEG audio codecs (MP3, AAC) as well as a few proprietary alternatives. The main focus of this article is the recent advance in audio coding technology and on-going work in the MPEG audio committee. Specifically, we'll take a look at a number of recently discovered techniques such as spectral band replication (SBR), integer MDCT (intMDCT), parametric audio coding, and binaural cue coding (BCC), each of which is enabling new applications and improved quality at lower bit rates.
Basics of audio coding
A "raw" multichannel digital audio signal consists of sequences of 16-bit samples (one sequence per channel). Typical sampling frequencies in high-quality audio are 44.1KHz or 48KHz, although lower sampling frequencies are used for sub-woofer channels. Raw audio needs huge space for storage, or equivalently, high bandwidth for streaming. As an example, one minute of DVD-quality audio data requires almost 30MB of space or 3.5Mbps of bandwidth for real-time streaming. Audio coding can reduce this data rate by a factor of 20 with negligible impact on perceived audio quality.
Figure 1: Frequency masking
Audio coding relies on an in-depth understanding of the human hearing system. As an example, when we listen to a strong tone of a particular frequency, we tend to be insensitive to the presence or absence of weaker sounds in the nearby frequencies. This is known as frequency masking and is illustrated in Figure 1. The tone at frequency ƒ1 is "masked" (and therefore inaudible) by the loud tone at frequency ƒ0. The tone at frequency ƒ2 is audible however since it lies outside the "masking region" of frequency ƒ0.
The frequency masking phenomenon is exploited extensively in audio coding. Each frame (which is collection of samples such as 1024 sample) of audio is first transformed to the frequency domain, thereby decomposing it into a collection of tones at various frequencies. The signal is then analyzed to determine which tones would be irrelevant because of masking by nearby louder tones. By coding only the tones that are audible to the human ear, tremendous compression ratios can be achieved.
Masking is a complex phenomenon and what I just described is only a simplified model. In practical audio codecs, a comprehensive psychoacoustic model that mimics the properties of the human ear will capture the comparative relevance of the audible frequency tones while deeming other tones as inaudible. A good psychoacoustic model is at the heart of all high-quality audio codecs.
Step by step
Figure 2 shows the block-level organization of a typical audio encoder. The input audio signal is first transformed to the frequency domain to enable analysis. Reversible transforms are best used in this step. In the first-generation audio codecs such as Musicam, filter banks were employed. Of late, the modified discrete cosine transform (MDCT) is most preferred. As mentioned earlier, a psychoacoustic model is used to determine the relative importance (represented by a "step size") of each transform coefficient. A smaller step size indicates that the corresponding tone is more important. The next step is quantization, where each transform coefficient (equivalently, the tone amplitude) is scaled down by the step size and converted to an integer called a quantization index. Quantization is a lossy process and reduces the amount of information content significantly, thereby contributing to a large reduction in bit rate. Naturally, more important tones suffer less quantization error.
Figure 2: Audio encoder
Figure 3: Audio decoder
Next, the quantization indices are entropy coded. Entropy coding is a lossless technique that uses fewer bits to represent more likely quantization indices. Huffman coding is the most commonly used technique in entropy coding. Finally the resulting data is formatted based on the standard specifications and packetized for storage or streaming.
It's easy to see that the decoder can follow a precise reverse process to reconstruct the audio signal. Figure 3 illustrates the typical decoder.
The basic block diagram of most standard audio codecs mimic closely the typical structure described above. However, each standard may add some extra blocks (also called tools) to support new features or improve compression performance.
As an example, one possible feature of an audio codec is support for scalability. In scalable audio codecs, the output bit stream is divided into multiple "layers" of bit streams. The first bit stream is called the base layer, while the other bit streams are called enhancement layers. Although the superset of all the bit streams represents the complete audio codec output, the base-layer bit stream can independently represent the audio signal at a reduced quality. Scalability is useful for streaming applications, where a client, connected via a narrow-band modem, may listen to lower-quality music by decoding only the low-bit-rate base layer, while a broadband client may consume bits at all layers to enjoy higher-quality audio.
Let's look at some of the early audio codecs that have been internationally standardized and deployed in popular products and services.
Early MPEG audio
The Moving Picture Experts Group (MPEG) audio committee has standardized a series of audio codecs as shown in Figure 4.
Figure 4: Early MPEG audio standards
MPEG1 audio was the first-generation standard from the MPEG audio committee. Standardized in 1993, MPEG1 supports one or two channels (mono or stereo) at three sampling frequencies (specifically, 32KHz, 44.1KHz, and 48KHz). MPEG1 defines three layers namely L1, L2 (aka Musicam), and L3 (also known as MP3), with the layers representing tradeoff points between quality and computational complexity. The L1 layer is the least complex of the three while MP3 provides the highest quality. MP3 has been shown to be perceptually transparent at 192Kbps for stereo audio. That is, an expert listener cannot distinguish between raw audio and MP3 audio coded at this bit rate.
In 1994, the MPEG2 standard was announced, with extended MPEG1 to support multichannel audio (greater than two channels) and lower sampling frequencies of 16KHz, 22.05KHz, and 24KHz. The MPEG2 extension is backward compatible with MPEG1 audio. As an aside, a nonstandard extension of MPEG2 known unofficially as MPEG 2.5 allows even lower sampling rates (8 to 12KHz) at bit rates from 16 to 32Kbps.
Among all the standards introduced so far, MP3 is by far the most commercially successful with the widest range of applications. MPEG1 layer 2 is also used widely notably in DVD audio and some broadcast audio standards.
MPEG2 Advanced Audio Coding (AAC), introduced in 1998, was the next major technological milestone in the audio coding field. While MPEG2 AAC is capable of perceptually transparent quality at 128Kbps for most stereo signals, it is not backward compatible with the earlier standards. AAC defines three profiles, namely low complexity (LC) for embedded devices, main profile for high coding gains, and sampling rate scalable (SSR) for bandwidth scalability. MPEG2 AAC LC has become the most popular of these profiles.
MPEG4 Audio Version 1 (introduced in 1999) extends MPEG2 AAC by adding a perceptual noise substitution (PNS) tool to improve quality at intermediate bit rates (around 32Kbps). MPEG4-AAC is one of the most popular AAC coding formats in use currently, including such popular services as iTunes and multimedia messaging system (MMS) in 2.5G cellular systems.
To mitigate the complexity of the MPEG2 AAC main profile, an alternate tool called long-term prediction (LTP) has been proposed in MPEG4. LTP promises quality comparable to AAC main profile but with lower computation and memory costs. The AAC-LTP combination is one of the optional choices in the MMS specifications defined by 3rd Generation Partnership Project (3GPP). Today, however, very little content is available in the AAC-LTP format.
MPEG4 also standardized two other codecs that are much less popular in the content marketplace: for low bit rates around 6 to 16Kbps, a so-called TwinVQ codec that uses a vector quantization scheme was defined. Also defined was the AAC scalable profile that offers scalability in bandwidth as well as quality.
MPEG4 Audio Version 2 (a 2001 standard) standardized four new profiles in addition to those in MPEG4 version 1 audio.
To support error-prone environments like wireless channels, it defined the error resilience (ER) profile for AAC. AAC-ER uses new tools like reversible variable length codes (RVLC), virtual codebooks, and codeword reordering to protect against errors during transmission. The Digital Radio Mondiale (DRM) service uses the AAC-ER profile for audio coding.
For real-time conversation applications such as conferencing, MPEG4 v2 standardized the AAC low-delay (LD) profile. AAC-LD has an algorithmic delay of 20ms, which is much lower than the 130ms delay of standard AAC. There are few applications, however, that use the AAC-LD profile.
To support fine-grained scalability, MPEG4 v2 defined the bit-slice arithmetic coding (BSAC) profile. The AAC-BSAC codec is used in digital media broadcast (DMB) applications in Korea.
Finally, MPEG4 v2 introduced a first version of parametric coding scheme called HILN (harmonic and individual line plus noise) to support low bit rates (for example, 4 to 16Kbps).
Advances in audio coding
The major achievements of the MP3 and AAC era have been followed by some key technical improvements that I'll highlight here. Jointly they provide a big leap forward in the capabilities of audio coding.
Spectral Band Replication (SBR)
Spectral Band Replication (SBR) is a "bandwidth expansion" approach invented by Coding Technologies, which has emerged as one of the important contributors to the field of audio coding. SBR addresses a typical drawback of transform coding; that is, the bandwidth of the reproduced audio signal generally must decline as bit rate is reduced. The SBR approach overcomes this drawback. SBR retains the full bandwidth of the reconstructed audio. To make up for the shortage of bits to represent the full signal, SBR exploits the correlation that exists between the energy of the audio signal at high and low frequencies as shown in Figure 5. It uses a well-guided transposition approach to predict the energy at higher frequencies, and uses a few additional bits to encode the prediction error. The perceived audio quality at low bit-rates improves tremendously as a result.
Figure 5: Original signal (5a) and SBR reconstruction (5b)
SBR is seen to be capable of high-quality stereo at bit rates as low as 48Kbps. This technique can "extend" most standard algorithms and has already been deployed to extend MP3 and AAC to create new codecs called MP3Pro and AACplus.
The recent introduction of integer MDCT (IntMDCT) addresses a key demand of applications such as high-resolution audio and professional audio systems: lossless audio coding. Lossless coding of audio is like applying one of the popular "zip" file utilities to an audio signal. As opposed to the lossy approaches defined by other standards, lossless coding results in a reconstructed audio signal that is identical to the input digital audio. Traditional audio algorithms are hard-pressed to support lossless coding mainly because they use the MDCT that's based on finite precision of floating-point arithmetic. IntMDCT was developed by researchers at Fraunhofer IIS GmbH (hereafter abbreviated FhG) another pioneering company in audio coding. IntMDCT is an integer approximation of the MDCT transform, derived from the MDCT using a "lifting" scheme. IntMDCT is reversible, based on integer arithmetic, and inherits most of the attractive properties of the MDCT including a good spectral representation of the audio signal and critical sampling. IntMDCT is thus well suited for both a standalone lossless audio coder as well as a combined scalable lossy + lossless audio coder. Advanced Audio Zip (AAZ) from the Institute of Infocom Research (I2R) uses IntMDCT to achieve low-bit-rate lossless coding. Integer MDCT is a big boon to audiophiles, enabling high compression with the unmatched fidelity of lossless coding. Applications that mimic WinZip (perhaps "AudioZip?") are likely to become popular in the near future.
Parametric audio coding
Parametric audio coding is a technique that was first successfully used in speech coding, but is now applied to broader audio coding as well. In the audio coding arena, this technology is superior to standard transform-based codecs, especially at very low bit rates (less than 24Kbps). In the parametric approach, the audio signal is separated into its transient, sinusoid, and noise components. Next, each component is re-represented via parameters that drive a model for the signal, rather than the standard approach of coding the actual signal itself. The MPEG4v2-HILN standard is based on parametric coding. Two new enhancements to this model have been subsequently proposed to improve the overall performance of HILN: improved noise modeling and parametric stereo extension. The improved noise modeling technique uses a temporal noise envelope rather than a noise gain approach used previously. The parametric stereo extension enables support for stereo via stereo cues. Parametric audio extension enables reasonable quality audio at between 6 and 24Kbps. Another advantage of parametric coding is that audio manipulation (for example, time scale modification), pitch shifting, and so forth can easily be incorporated in a parametric representation.
Binaural cue coding (BCC)
Binaural cue coding (BCC) enables parametric representation of spatial audio, delivering multichannel output (with an arbitrary number of channels) from a single audio channel plus some side information. It uses parameters like interchannel time, level, correlations, and so forth to support surround sound (such as 5.1) at very low bit rate. MP3Surround technology from FhG uses this technique to add surround-sound capabilities to the MP3 algorithm at the cost of few additional bits. This technique can be used as parametric stereo for just two channels.
Recent MPEG4 audio
To keep pace with new technologies, MPEG4 has added amendments to existing standards. MPEG4 is currently the last standard planned in the core audio coding area. Future standards (such as MPEG7 and MPEG21) will continue to use the audio codecs standardized in MPEG4 but address advanced system-layer requirements such as audio representation, layering, etc. Figure 6 summarizes ongoing work in the MPEG audio committee.
Figure 6: Recent and future
The first amendment (AMD1) defines the High Efficiency profile (HE-AAC) version 1 by combining SBR technique with AAC. The same combination is also known as AACplus, which is a trademark of Coding Technologies. In the MPEG world, this codec is referred to as HE-AAC. HE-AAC is becoming accepted within many standardization bodies (such as 3GPP, DRM, DVD Forum, and ISMA). A few commercial CD rippers (such as Nero) support audio CD ripping in the HE-AAC format.
The second amendment (AMD2) improves the performance of parametric audio coding (HILN) with better noise coding and parametric stereo. This high-quality parametric audio coding results in reasonable quality audio at 6 to 21Kbps.
The third amendment (AMD3) defines usage of MPEG1 and MPEG2 audio in the MPEG4 system layer. This amendment is expected to become an international standard in 2005.
AMD4 and AMD5 deal with lossless audio coding. AMD 4: Audio Lossless Coding (ALS) is based on linear predictive coding followed by noiseless coding. ALS is based on models proposed by Nippon Telegraph and Telephone Corporation Cyber Space Labs, Real Networks, and Berlin University. This amendment is expected to become an international standard during 2005. Other parts of AMD4 define the use of parametric stereo toolset with HE-AAC. The resulting algorithm is known as HE-AAC v2 (also called AAC++ or Enhanced AACplus) and gives good quality stereo at below 48Kbps. Enhanced AACplus was adopted by 3GPP release 6 as an optional audio codec for MMS and streaming applications.
AMD5 defines scalable lossless coding (SLS) based on proposals by FhG, Microsoft, and I2R. It uses AAC technology to encode the base layer at 128Kbps. The base layer is lossy but perceptually transparent. AMD5 is backward compatible with standard AAC. The enhancement layer uses IntMDCT to achieve lossless compression at 600 to 700Kbps for two-channel audio. AMD5 is expected to become international standard in 2005.
AMD6 supports lossless coding of output/input of sigma delta modulated (SDM) based A/D and D/A converters. SDM technology uses over-sampling of signals to provide linearity over large dynamic range of audio signals (such as 24 bits). The AMD6 supports handling of over-sampled signals without unnecessary conversion to PCM and vice versa. This is known as Direct Stream Transfer (DST). It is based on a proposal submitted by Philips. AMD5 is expected to become an international standard in 2005.
In addition to the above amendments, the MPEG4 committee has issued a call for proposals for spatial audio coding (SAC). This will be an extension to AAC for multichannels, along similar lines as SBR. FhG/Agere, Coding Technologies/Philips, Dolby, and Panasonic have submitted proposals in response to the call. The proposals are based on BCC. The final model will be a combination of proposals from FhG/Agere and CT/Philips. The reference software model for this standard will be available in the near future.
A number of audio codecs have been developed outside the MPEG standardization framework and some of these codecs have become popular in the market. Here is an overview of some of those codecs.
As a part of the Bluetooth standard for short-distance wireless applications, Philips developed a sub-band codec (SBC) that is optimized for the low delay and complexity demanded by wireless handheld applications. SBC is accepted as a mandatory audio codec in the Bluetooth standard for the audio profile. It's gaining traction at the expense of MPEG4 AAC LD.
Voiceage Inc. developed an extension of AMR-WB (Adaptive Multi Rate, Wideband) known as AMR-WB+. This combines the standard GSM AMR-WB speech codec with a technique known as TCX (transform-coded excitation) to ensure good quality audio/speech coding at 14 and 24Kbps. It has been adopted by 3GPP Release 6 as an optional audio codec for MMS and streaming applications. This algorithm will compete in the market with Enhanced AACplus (HE AAC v2) and high-quality parametric audio from MPEG portfolio.
China is also developing its own standard for audio-video called AVS (Audio Visual Standard). This standard reached the final draft stage at the end of December 2004. The details will be known later after the standard becomes available in English.
Proprietary audio codecs are prominent especially in two market segments:
- Streaming and portable multimedia—Windows Media Audio (WMA) from Microsoft and Real Audio (RA) from Real Networks.
- Professional and high-performance audio—Audio codecs from Dolby Labs and DTS.
Dolby Labs and DTS dominate the professional audio space. Dolby has recently introduced new codecs such as Dolby-E, E-AC3, and others.
Recently Sony introduced its ATRAC3Plus to support lower bit-rate audio (less then 64Kbps). This is a successor to the ATRAC3 (Adaptive TRansform Acoustic Coding) standard and it's widely used in MiniDisc and Ministick storage devices used in Walkman and digital still cameras.
Although researchers are pursuing many new directions, clear trends that are emerging. AAC and its variations (HE-AAC, Surround, and parametric extensions) will become the foundations for future improvements and will slowly replace legacy MP3 in usage as well as content. Lossless audio coding and its applications will grab a prominent mindshare thanks to the fidelity advantages that they offer. Proprietary codecs will continue to hold market share in certain applications such as streaming media, portable multimedia, and professional audio.
In spite of our ability to understand technological trends, it's difficult to predict market acceptance of codecs. Commercial success or failure depends not only on the technology but also on issues such as licensing terms, hardware compatibility, upgradeability, costs to new technology, support for digital rights management, and more. Nevertheless, we can safely conclude that the near future is bright for budding researchers in audio coding.
Mihir Mody is a technical lead with the Multimedia Codecs group at Texas Instruments in Bangalore, India. You can reach him at Mihir@ti.com.
Zwicker, Eberhard and Hugo Fastl. Psychoacoustics: Facts and Models, 2nd Updated Edition. Berlin; New York, NY: Springer, 1999.
ISO/IEC 11172-3, "Coding of moving picture and associated audio for digital media at up to 1.5 mbps—Audio part." 1993, www.ISO.org.
ISO/IEC 13818-3, "Information Technology: Generic coding of Moving pictures and associated audio—Audio Part." 1994, www.ISO.org.
ISO/IEC 13818-7, "MPEG-2 advanced audio coding, AAC." 1997, www.ISO.org.
ISO/IEC 14496-3, "Information Technology: Coding of audio-visual objects—Audio part." 1999, www.ISO.org.
ISO/IEC 14496-3, "Information Technology: Coding of audio-visual objects—Audio part, Amendment 1 : Bandwidth Expansion." 2003, www.ISO.org.
Audio Video Coding Standard Working Group of China (AVS Working Group)
Microsoft, "Windows Media 9 Series Audio and Video Codecs"
Helix Community, "RealAudio and RealVideo technology for Helix"
Please note that access to documents is restricted to MPEG members; some of publicly available documents are accessible at www.chiariglione.org/mpeg.
MPEG document, N6130, Text of 14496-3:2001:/FDAM 2, Parametric Coding
MPEG document, N6671, Text of 14496-3:2001/FDAM 3, MPEG-1/2 Audio in MPEG-4
MPEG document, N6672, Text of 14496-3:2001/PDAM 4, Audio Lossless Coding (ALS)
MPEG document, N6673, Text of 14496-3:2001/PDAM 5, Scalable Lossless Coding (SLS)
MPEG document, N6674, Text of 14496-3:2001/FPDAM 6, Lossless coding of 1-bit oversampled signals