[转] Meridian Lossless Packing (MLP) in a Nutshell

来源：互联网发布：怎么在淘宝上开直播? 编辑：程序博客网时间：2024/04/29 20:06

Audiophiles have long desired a higher-resolution format to take the place of CDs. "16-bit just doesn't cut it," they said, "We want more." Many also want music that will take advantage of the high-quality multi-channel speaker systems they bought to listen to movie soundtracks. Unfortunately, 6 channels or more of high-resolution music take up way too much bandwidth to fit comfortably even on a DVD. Using lossy compression (information is lost) algorithms (like MPEG or DTS) is one solution, but to the purist, throwing away any of the music is a sin. Luckily, the creators of DVD-A were able to put together a format that allows for multi-channel, high-resolution audio, with reasonable playing times and peak data rates, using standard DVD technology, and no data loss at all (lossless).

How is it done? With MLP, which stands for Meridian Lossless Packing. We're going to take a look at how MLP is able to pack audio so tightly without losing a single bit of the original high-resolution recording, and examine its many nifty features and benefits. Some of this will get rather technical, but Secrets' readers love this kind of thing, so stick around - there's a lot of cool technology behind MLP.

Compression: Lossless vs. Lossy

Intuitively, many audio and videophiles think of data compression as a bad thing. Compressing video data excessively via MPEG leads to visible artifacts when the data are decompressed into the video we see (that is why the set of formulas is called a CoDec, for CodeDecode or CompressDecompress), particularly when dealing with rapid motion. Compressing audio via perceptual codecs, such as MP3, especially at higher compression ratios, robs music of subtle detail, leaving behind a hollow shadow of what might have been. Even DTS and AC-3, some of the most advanced multi-channel compression formats, sacrifice fidelity in order to fit 6 (or more) channels of hour long audio onto the same disc that must store relatively high-resolution video for the same amount of time.

A lossy codec (like MPEG, DTS, or AC-3) compresses content such that the result, when decompressed, is not exactly the same as the original. If everything goes well, it is almost (but not quite) indistinguishable from the original, but it isn't bit-for-bit the same. A lossless codec compresses the data without losing any of it when it is decompressed. The result, when decompressed, is exactly the same as the original, with no compromises. The ZIP format, used by PKZip, WinZip, and Stuffit on PCs, is an example of a lossless compression scheme, though one that is generic, and not optimized for any specific kind of file. MLP is able to get higher compression than ZIP in almost all cases, because it is optimized for one kind of data: audio.

Lossy compression encoders may accept high-bit data, such as 24-bits, and they output 24-bits as well. The catch is that the 24-bit output isn’t identical to the 24-bit input. The nature of the codec means they never will be, regardless of how sophisticated the reconstruction algorithms are to decode the process. Even though the output buffers used by the decompression algorithms are 24-bit, that doesn't mean that the effective resolution of the output is 24-bit. Most lossy compression algorithms throw away the least significant bits ("LSB") of the input, because (in theory) they represent detail that is impossible to hear, or at least difficult to hear. In the end, a "24-bit" lossy compression scheme may really only give results that are effectively only 18-bit, or 16-bit, or even lower. No matter how carefully the bits to throw away are chosen, there is always the possibility that audible information that is important to the music is being lost.

For example, here is a 24 bit binary number:

101101011010101110110011

The least significant bits could be the 0011 on the far right. The resulting number after the 24 bit number has been compressed and then decompressed is only 20 bits long, as shown below.

10110101101010111011

Although the 0011 are the least significant bits, they are not insignificant, as musical info was in there.

With a lossless encoding/decoding method like MLP, the output of the decoder is equal to the input of the decoder, bit for bit. If they weren’t equal, it wouldn't be lossless! You can be sure that whatever the original recording engineers recorded is reproduced.

To illustrate the difference for yourselves, do the following:

Start with a high-resolution bitmap file (BMP or TIFF formats are examples).
Encode the file with a lossless compression codec by archiving it to WinZip.
Encode the file with a lossy compression codec by saving it as a JPEG file with moderate compression.
Decode using the lossless codec by extracting the bitmap from the WinZip archive using a different name.
Decode using the lossy codec by opening the JPEG file and saving it to a new name as a bitmap file.

You'll note, in general, that the JPEG file is smaller than the WinZip file. The decoded files, however, are the same size. Now, compare the two compressed bitmaps to the original bitmap, side by side. The JPEG file will look slightly different from the original, while the WinZipped file will be identical. So, you see the compromises inherent in aggressive lossy compression.

Why Compress?

Given that DVD-A doesn't have any video on the disc to suck up bandwidth, why would you need to compress the audio at all? The short answer is that if the data were uncompressed, a DVD wouldn't be able to hold much audio, and the data rate would go over the limits of the hardware.

The data rate for six channels of uncompressed 24-bit / 96 kHz sampling is 13.18 megabits (Mb) per second. A single-layer DVD holds 4.37 GB (not 4.7 GB, as is commonly reported – see section 7.2 of the DVD FAQ by Jim Taylor for more on this), and a dual-layer holds 7.95 GB, so at this rate you could only store 45 minutes of audio on a single-layer DVD and 82 minutes on a dual-layer. Also, the DVD format has a maximum transfer rate of 9.6 Mb per second, so aside from the short playing time, transfer rate limitation makes 6 channels of 24-bit / 96 kHz sampling impossible.

In contrast, MLP can keep the peak data rate at or below 9.6 Mb per second, and generally keeps the average data rate well below that. This allow longer playing times and/or higher resolution recording. See our table above for a demonstration of playing times with and without MLP.

How do they do it?

Here's where it gets complicated, but don't let that hold you back. It's not that bad.

The basic techniques used by MLP are:

Bit Shifting - avoids wasting bits for unused dynamic range.
Matrixing - puts the audio common to multiple channels into one channel.
Prediction Filters - predicts the next bit of audio based on the previous audio.
FIFO Buffer - smoothes the instantaneous data rate.
Entropy Coding - compresses the final data as tight as possible.

Bit Shifting

MLP continuously varies the number of bits per sample so it uses only the number of bits that are necessary. In contrast, uncompressed PCM stores all bits, even if most of them are unused (bunches of zeros) most of the time.

In PCM (Pulse Code Modulation), there are a fixed number of bits stored for every sample. It might be 16, 18, 24, or some other number, depending on the recording format, but the number remains unchanged for the whole recording time. During silent sections, all or most of those bits are zero. Maybe they're only zero for a second or so, but that's thousands and thousands of samples that are all zeros or extremely low numbers. The MLP encoder recognizes that it could switch to 4-bits (for example) for that section. It stores a special flag that says "switching to 4-bits", then stores a long run of 4-bit values. When the music picks up again, perhaps it decides that the new section will require 22-bits. The encoder stores a new flag, saying, "switching to 22-bits", then starts storing 22-bit samples. Only when the music has a large dynamic range does it need to switch to full 24-bit storage.

On the decoding end, the 4-bit values in our example are converted to full 24-bit values by adding an appropriate number of zeros. Again, because it is a lossless codec, MLP only uses this technique when the data have a lot of zeros in it to begin with.

Matrixing Channel Information into Substreams to Reduce Redundant Data

In a multi-channel audio mix, often there is a significant amount of similar audio on multiple channels. Audio is rarely panned hard to a single channel, and when it is, it's generally just for a short time. When the same sound is coming out of all the speakers, it doesn't make sense to compress it separately for each individual channel and use 6 times the bandwidth. MLP compresses all the common elements from all the channels just once. A simple way of thinking about it is to imagine that MLP creates a "combo" channel containing the sum of everything common to all channels. Then for each additional channel, it only needs to store the differences from the common channel.

The advantage of this strategy is that while the "combo" channel is complicated and requires lots of bits to compress, the "difference" channels use only a few bits most of the time, because they only have to store data for how they vary from the main channel. If they don't vary at all from the combo channel, they use almost no bits. If they vary just slightly, they'll still use few bits. Only if they vary drastically from the combo channel will a difference channel require the full bandwidth, and typically this doesn't happen continuously, but rather for short intervals here and there.

For example, let's just look at a two-channel mix. In the compression stage, the compressor analyzes the two channels and puts together a "common" channel, that contains essentially a sum of the two channels. Then it puts together a "difference" channel that contains the difference between the original Right and original Left stereo channel. At decompression time, it can reconstruct the original Right and Left channels by inverting the mathematics used to create the matrixed channels in the first place. It's just simple math.

In the above scenario, since compressed channel #2 is a difference between the two original channels, whenever the same sound is playing at the same volume on both channels, the difference channel gets zeros. And as we've seen above, zeros compress well. Even if the sound is playing at a slightly different volume on the two channels, the difference channel will still use fewer bits than either of the original channels. It's only when the sounds coming from the two original left and right channels have no relationship to each other at all that the difference channel will use the same number of bits as the original channels. And that rarely happens in normal music.

In 6-channel music, the same basic idea is used, but the sums and differences are more complicated. In fact, it's possible for each finished compressed channel to be a weighted sum of proportions of every other channel, though in practice one channel will get the bulk of the common data, and other channels will be largely differences from the main channel.

One useful feature of MLP is that the audio data can be divided into substreams, each of which can contain multiple channels, and can build off of other substreams. A substream is a portion of the data that is easy for a decoder to extract separately. For example, it is possible (and encouraged) for a producer of DVD-A discs to put all the data for the 2-channel mix into substream 0. Then substream 1 can just contain four "difference" channels that enable the decoder to reconstruct the full 6-channel mix. So, if the player only has 2-channel output, it can just decode substream 0, which contains everything necessary for the 2-channel mix. But if the player has full 6-channel output, it decodes both substream 0 and substream 1, and gets the full 6 channels.

In the above scenario, the 2-channel mix in substream 0 is not the left and right channels from the 6 channel mix. It's a mixdown of all 6 channels into a special optimized 2-channel mix. It's essentially like feeding the 6 final channels into a 6-input, 2-output mixer, where the mixing engineer can adjust level and phase on the fly to produce just the mix desired. It does need to be just combinations of what's already on the original 6 channels - the mixer can't add extra effects or new sounds - but the specific mix is all under the engineer's control. Compare that to DTS and Dolby Digital, which create the 2-channel mixes on the fly during decoding, with absolutely no input from the recording engineer. With MLP, the control is in the hands of the people making the music.

Amazingly enough, just given the 2-channel mix, the mixdown coefficients (the "levels" the engineer used to make the 2-channel mixdown), and 4 more difference channels, the original 6-channel mix can be extracted, including the original untouched left and right channels. So, only 6 total "channels" of information are stored on the disc, yet from that information a full 6-channel mix and a 2-channel mixdown of those channels can be extracted. Very clever indeed.

In addition, this substream approach allows for simpler, cheaper 2-channel decoders. For example, you could have a 2-channel DVD-A walkman that would use a simpler chipset, because it only needs to extract substream 0 and decode the 2 channels there. It can completely ignore the other 4 channels, and doesn't need any buffer space for them or processing power to decode them. Again, with Dolby Digital and DTS, the decoder has to extract and decode all 6 channels before it can even begin to downmix the 2-channel version, so a 2-channel player needs just as complicated a decoder, and just as much buffer memory, as a 6-channel player.

Prediction Filters

This is the heart of the MLP codec, and what makes it so much spiffier for compressing music. The gist of it is this: music is not random. Given a certain chunk of audio, it is possible to make useful predictions about what kind of audio will come next. It's not necessary to be perfectly accurate. (And it's not possible - if you could always predict what sounds were coming next, you wouldn't need to listen to music. You could just listen to the first note, and the rest of the piece would be obvious.) The idea is that some prediction is closer than no prediction at all, which allows the MLP algorithm to store just the difference between the real music and the prediction.

Here's an example, grossly simplified: musical notes are generally held for some amount of time. They don't just instantly appear in the music and instantly disappear. They have attack, sustain, and decay. In the attack phase, the volume is going up. In the sustain phase, the volume is remaining constant. In the decay phase, the volume is going down. So if the prediction algorithm just predicts that the volume is going to change in the next millisecond the same amount that it did the last millisecond, it's going to be right, or close, a lot of the time. It's going to be really, really wrong when the note changes from attack to sustain, or sustain to decay, but those are short instants of time. For the rest of the note's duration, the prediction is quite close.

Since the prediction algorithm is completely known in advance, and shared by the encoder and the decoder, the encoder just knows that given the preceding music, the decoder is going to predict X (where X is the next sequence of bits). Since it knows the decoder will predict X, it doesn't need to store X, just the difference between the real music and X. As long as the prediction is fairly close much of the time, the differences will be small, and fit into a smaller number of bits than the raw data by itself. And as we saw before, fewer significant bits allow the encoder to store fewer bits in the data stream. Presto - compression!

In addition, the encoder stores special coefficients for the prediction algorithm in the bitstream, so the coming predictions will be closer to the real music. In effect, the encoder is storing things like "this next section has sharp attacks and decays, so adjust the predictions accordingly." This makes the predictions even better, which means the differences take up fewer bits, and the whole package takes less space, even taking into account that the special coefficients have to be stored as well.

In practice, the MLP decoder is obviously not pulling out individual notes. What it does is break down the sound into individual frequencies, and do predictions on each major frequency. When you see a real-time spectrum plot of a particular piece of music, you see certain strong frequencies that raise and lower and move around, and a bunch of relatively random noise at a lower level. MLP pulls out the major frequencies and does predictions on each of them individually, and then separates whatever is left (essentially the noise) and compresses it separately.

FIFO Buffer to Smooth the Instantaneous Data Rate

There's a fundamental problem with lossless compression: it's impossible to force the compression ratio to a specific amount. The encoder applies all the algorithms, and depending on how complex the music is, it gets some level of compression, and that's it. If the music is sufficiently complex, the compression ratio may be low, and it's just impossible to get more compression out of it. Luckily, in practice, no real audio signal is that complex all the time. But it can be complex enough for short periods of time that there is too much output data in too short a time for a DVD player to handle. Remember that the maximum data rate DVD players are designed for is 9.6 Mb/sec. Any more than that, and the DVD player just fails miserably, and the music cuts out. Since nobody wants that, some method needs to be used to make sure the data rate never goes above the maximum level, even when the music is complex enough to peak here and there above 9.6 Mb/sec.

The answer is a FIFO, or First In First Out buffer. It works just like it sounds - the data that come in first are the data that go out first. The useful thing about the FIFO buffer is that while the data can't be read off the DVD at more than 9.6 Mb/sec, data can be read out of the FIFO buffer at much higher speeds, because the buffer is all in RAM (Random Access Memory) on the player. The DVD player is constantly reading ahead, filling up the FIFO buffer at 9.6 Mb/sec or less. When a peak in the data rate happens, all the data the MLP decoder needs are already in the buffer, and can be read out quickly so the music doesn't cut out. The player then refills the buffer slowly, getting ready for the next peak. It's not unlike the shock buffers on modern CD walkmans, that buffer up music so they can keep playing if the player is bumped and the CD skips.

Obviously it's theoretically possible for a complex and sustained sequence of audio to peak above 9.6 Mb/sec so often and for so long that the DVD player can't fill the buffer fast enough. But in practice, that just doesn't happen with real material. It's possible to create special test signals that are impossible to compress with MLP and stay below the limits, but such tests would be highly unrealistic, wouldn't resemble real music or even real test and demonstration signals, and would serve no earthly purpose. In addition, if a real piece of music overflowed the buffer, it would be noticed in the encoding stage, and the mastering engineers would have many different options for dealing with it, and could take the steps necessary to reduce the data rate for that one section while keeping the sound as pristine as possible. Note that here the recording engineer makes the choice of what to do with the music, not the compression algorithm, and again this is a low-probability case.

Entropy Coding

Entropy coding is a fancy way of saying standard, generic lossless compression, the kind WinZip and other compression algorithms use. This is the final compression, used to try to get a few extra percent when all the other algorithms have done their best.

MLP uses several different types of entropy coding. Let's take a look at one kind: Huffman coding. This compression technique takes the most common patterns found in a type of file, in this case a music file (or rather a music file that has already gone through all the previous steps above) and replaces them with smaller, simpler codes.

Here's a simple example: a significant percentage of English text consists of common words like "a," "an," "the," etc. Let's say we decide that we'll take the most common words and replace them with the code sequences /0, /1, /2, etc. We can also take out the space before and after the words, because they're implied. This means we can compress, "The rain in Spain falls mainly on the plain" as "/0rain/1spain falls mainly/2/0plain" where

/0 = "the"
/1 = "in"
/2 = "on"

This compresses a 43 character sentence to 33 characters, for about a 25% savings. As long as the person reading the sentence knows the code, they can convert it back to the original sentence and read it. With more complicated systems, we could get it even smaller.

Huffman coding is just a more sophisticated version of the above, where a code book is created consisting of the most common sequences of bits found in the music stream. The compressor substitutes the codes for the original sequences, and the decompressor does the reverse, substituting the sequences from the code list for the codes it finds in the bitstream.

MLP uses several different forms of entropy coding besides Huffman. For example, most music has a fairly standard distribution of values - the bell curve we're familiar with from statistics and probability. In such cases, the encoder can use Rice coding, which maps the most common values (the center of the bell curve) to short code sequences, and the least common values (the "tails" of the bell curve) to long code sequences. There is no need for a "code book" as in Huffman coding, because the map of code sequences to original values is entirely mathematical, and the code book is implied (in other words, you don't have to list a definition that 2 + 2 = 4 when showing a formula, because it is already defined by the system of mathematics).

In some cases, where the original data are essentially random (which would happen if there was a lot of white noise, like an explosion or cymbal crash), the data are left uncompressed, because there are not enough common or repeated sample values to compress.

Other Benefits

Flexible Metadata

Metadata represents extra data that aren't audio. They could be track names, pictures, or anything else that are not the actual audio data. MLP has a flexible and extensible architecture for metadata, so new kinds of information can be embedded in the bitstream without causing problems for older decoders. The new decoders will read the new metadata, but the old decoders will just skip over it.

For example, Chesky demonstrated at CES a recording that used the subwoofer and center channel as left and right side "height" channels, using speakers placed up high to give a sense of the spaciousness of the recording venue. Potentially, the metadata could tell the decoder to treat a certain channel as a height channel instead of its normal designation (such as the center channel). If your stereo system had a decoder that recognized that metadata, and you had the appropriate speakers, it would automatically route that channel to the right place.

However, any new metadata would need to be standardized in some way to be useful. It's no good to put the data in the stream if there aren't any decoders that can use it. Similarly, it's no good to put the data in the stream if every different piece of mastering software uses a slightly different metadata ID and layout.

Still, it does make it easier to encode new types of data without having to change the format. This means if some new and exciting idea in audio becomes a reality (like Chesky's height channels), it will be possible to make new DVD-A decoders that can handle the new stuff without having to create a new format, or make the new discs incompatible with the old discs.

More Control of the Compression Process for Producers and Engineers

A content producer, if so inclined, can selectively adjust the data rates for any channel (or all of them) as required by the content, while still maintaining lossless compression.

Here’s an example: A producer decides that he doesn't want to record content above 24 kHz. Most content at this frequency (if not all) is noise, and 24 kHz is beyond most humans' hearing capability. Given this, he can selectively low-pass filter the rear channels (or all channels), resulting in greater compression through the MLP process while still maintaining the higher sampling rate.

Or, if the producer decides that he doesn't really need 24-bits for rear channels, the MLP encoder can dither that information down to 23-bits, or 22-bits, still maintaining the benefits of higher resolution (and much of the benefit of the 24-bit source resolution), trading off a slightly higher noise floor for space.

In contrast, most lossy compression algorithms just let you dial up the bit rate you need. If the audio doesn't sound so great at that bit rate, the producer has little choice other than to crank up the bit rate. With MLP, the producer can choose exactly what they want to do with the music to get the bit rate to a level they can accept.

Also, reducing the effective bit rate of the original music will always lower the bit rate of the compressed content with MLP. This isn't so with lossy schemes. For example, the producer might trim off the high-frequency content with a low-pass filter, and recompress with a lossy algorithm, only to find that nothing is saved because the lossy algorithm was already throwing away the high frequencies.

Robust

The MLP data stream contains restart information 200 – 1000 times per second, depending on the data, so that any error in transmission can be quickly recovered from. In other words, you won’t listen to a long burst of noise (or silence) simply because the decoder lost its lock on the incoming signal.

Relatively Simple Decoding

MLP is deliberately asymmetric, which means that it's easier to decode than encode. Most of the work is done in the compression stage: analyzing the music, choosing good prediction filters and coefficients, figuring out which entropy coding scheme will produce the most benefit, etc. On the decoding end, a much simpler chipset can be used, which makes players cheaper and easier to implement.

Cascadable

Because the decoded output is bit-for-bit identical to the encoder input, a recording engineer can master the content, encode it and package it up. On the far side, his customer can decode the data, apply a final round of mastering tweaks to that content, and encode it again for final production while not losing any information due to the extra encoding and decoding process. This is the hallmark of lossless compression and something that is very much NOT the case with lossy codecs such as MPEG, DTS, or AC-3.

Summary

MLP is not only an efficient compression scheme for multi-channel audio, but also a flexible and dynamic format with room to grow and change as the industry changes. It combines multiple layers of compression with a wide array of clever innovations. While the future is not at all clear for high-resolution audio formats like DVD-A, we at Secrets certainly are impressed with the technical innovation present in MLP, and we predict that it will be around for a good long time.