视频编码简介

来源：互联网发布：牛客网算法视频百度云编辑：程序博客网时间：2024/06/06 01:53

原文出处：http://diveintomark.org/tag/give是系列文章，当前只列出一部分

1 container formats

You may think of video files as “AVI files” or “MP4 files.” In reality, “AVI” and “MP4″ are just container formats. Just like a ZIP file can contain any sort of file within it, video container formats only define how to store things within them, not what kinds of data are stored. (It’s a little more complicated than that, because not all video streams are compatible with all container formats, but never mind that for now.) A video file usually contains multiple tracks — a video track (without audio), one or more audio tracks (without video), one or more subtitle/caption tracks, and so forth. Tracks are usually interrelated; an audio track contains markers within it to help synchronize the audio with the video, and a subtitle track contains time codes marking when each phrase should be displayed. Individual tracks can have metadata, such as the aspect ratio of a video track, or the language of an audio or subtitle track. Containers can also have metadata, such as the title of the video itself, cover art for the video, episode numbers (for television shows), and so on.

There are lots of video container formats. Some of the most popular include

MPEG 4, usually with an .mp4 extension. The MPEG 4 container is based on Apple’s older QuickTime container (.mov). Movie trailers on Apple’s website still use the older QuickTime container, but movies that you rent from iTunes are delivered in an MPEG 4 container, although Apple likes to use the .m4v file extension instead of .mp4.
Advanced Systems Format, usually with an .asf or .wmv extension. ASF was invented by and is primarily used by Microsoft in their Windows Media Player and their handheld Zune media device.
Flash Video, usually with an .flv extension. Flash Video is, unsurprisingly, used by Adobe Flash. Prior to Flash 9.0.60.184 (a.k.a. Flash Player 9 Update 3), this was the only container format that Flash supported. More recent versions of Flash also support the MPEG 4 container.
Matroska, usually with an .mkv extension. Matroska is an open standard, unencumbered by any known patents, and there are open source reference implementations for doing everything you might want to do with MKV files. It is primarily used by Free Software-loving hippies and, oddly enough, pirates who rip high-definition movies and television shows and release them in The Scene.
Ogg, usually with an .ogv extension. Like Matroska, Ogg is an open standard, open-source-friendly, and unencumbered by any known patents. Like Matroska, it is used primarily by Free Software-loving hippies, but that may begin to change once Mozilla ships their Firefox 3.1 browser. Firefox 3.1 will support — natively, without platform-specific plugins — the Ogg container format, the Ogg video codec (Theora), and the Ogg audio codec (Vorbis). On the desktop, Ogg is supported out-of-the-box by all major Linux distributions, and you can use it on Mac and Windows by installing the QuickTime components or DirectShow filters, respectively.
Audio Video Interleave, usually with an .avi extension. Also known as “the crappy, obsolete format that just won’t die.” The AVI container format was invented by Microsoft in a simpler time when the fact that computers could play video at all was considered pretty amazing. It does not officially support many of the features of more recent container formats; it does not officially support any sort of video metadata; it does not even officially support most of the modern video and audio codecs in use today. Over time, various companies have tried to extend it in generally incompatible ways to support this or that, and it is still the default container format for popular encoders such as MEncoder.

This is the briefest of overviews; there are literally dozens of other container formats in use today. DV, NUT, 3GP… the list goes on and on. And there are many others littering the landscape of failed or obsolete technologies, from companies that tried to corner the nascent digital video market by defining their own formats. Wikipedia has a good roundup of video container formats.

2 lossy video codecs

The most important consideration in video encoding is choosing a video codec. A future article will talk about how to pick the one that’s right for you, but for now I just want to introduce the concept and describe the playing field. (This information is likely to go out of date quickly; future readers, be aware that this was written in December 2008.)

When you talk about “watching a video,” you’re probably talking about a combination of one video stream, one audio stream, and possibly some subtitles or captions. But you probably don’t have two different files; you just have “the video.” Maybe it’s an AVI file, or an MP4 file. These are just container formats, like a ZIP file that contains multiple kinds of files within it. The container format defines how to store the video and audio streams in a single file (and subtitles too, if any).

When you “watch a video,” your video player is doing several things at once:

Interpreting the container format to find out which video and audio tracks are available, and how they are stored within the file so that it can find the data it needs to decode next
Decoding the video stream and displaying a series of images on the screen
Decoding the audio stream and sending the sound to your speakers
Possibly decoding the subtitle stream as well, and showing and hiding phrases at the appropriate times while playing the video

A video codec is an algorithm by which a video stream is encoded, i.e. it specifies how to do #2 above. Your video player decodes the video stream according to the video codec, then displays a series of images, or “frames,” on the screen. Most modern video codecs use all sorts of tricks to minimize the amount of information required to display one frame after the next. For example, instead of storing each individual frame (like a screenshot), they will only store the differences between frames. Most videos don’t actually change all that much from one frame to the next, so this allows for high compression rates, which results in smaller file sizes. (There are many, many other complicated tricks too, which I’ll dive into in a future article.)

There are lossy and lossless video codecs; today’s article will only deal with lossy codecs. A lossy video codec means that information is being irretrievably lost during encoding. Like copying an audio cassette tape, you’re losing information about the source video, and degrading the quality, every time you encode. Instead of the “hiss” of an audio cassette, a re-re-re-encoded video may look blocky, especially during scenes with a lot of motion. (Actually, this can happen even if you encode straight from the original source, if you choose a poor video codec or pass it the wrong set of parameters.) On the bright side, lossy video codecs can offer amazing compression rates, and many offer ways to “cheat” and smooth over that blockiness during playback, to make the loss less noticeable to the human eye.

There are tons of video codecs. Today I’ll discuss five modern lossy video codecs: MPEG-4 ASP, H.264, VC-1, Theora, and Dirac.

MPEG-4 ASP

a.k.a. “MPEG-4 Advanced Simple Profile.” MPEG-4 ASP was developed by the MPEG group and standardized in 2001. You may have heard of DivX, Xvid, or 3ivx; these are all competing implementations of the MPEG-4 ASP standard. Xvid is open source; DivX and 3ivx are closed source. The company behind DivX has had some mainstream success in branding “DivX” as synonymous with “MPEG-4 ASP.” For example, this “DivX-certified” DVD player can actually play most MPEG-4 ASP videos in an AVI container, even if they were created with a competing encoder. (To confuse things even further, the company behind DivX has now created their own container format.)

MPEG-4 ASP is patent-encumbered; licensing is brokered through the MPEG LA group. MPEG-4 ASP video can be embedded in most popular container formats, including AVI, MP4, and MKV.

H.264

a.k.a. “MPEG-4 part 10,” a.k.a. “MPEG-4 AVC,” a.k.a. “MPEG-4 Advanced Video Coding.” H.264 was also developed by the MPEG group and standardized in 2003. It aims to provide a single codec for low-bandwidth, low-CPU devices (cell phones); high-bandwidth, high-CPU devices (modern desktop computers); and everything in between. To accomplish this, the H.264 standard is split into “profiles,” which each define a set of optional features that trade complexity for file size. Higher profiles use more optional features, offer better visual quality at smaller file sizes, take longer to encode, and require more CPU power to decode in real-time.

To give you a rough idea of the range of profiles, Apple’s iPhone supports Baseline profile, the AppleTV set-top box supports Baseline and Main profiles, and Adobe Flash on a desktop PC supports Baseline, Main, and High profiles. YouTube (owned by Google, my employer) now uses H.264 to encode high-definition videos, playable through Adobe Flash; Youtube also provides H.264-encoded video to mobile devices, including Apple’s iPhone and phones running Google’s Android mobile operating system. Also, H.264 is one of the video codecs mandated by the Blu-Ray specification; Blu-Ray discs that use it generally use the High profile.

Most non-PC devices that play H.264 video (including iPhones and standalone Blu-Ray players) actually do the decoding on a dedicated chip, since their main CPUs are nowhere near powerful enough to decode the video in real-time. Recent high-end desktop graphics cards also support decoding H.264 in hardware. There are a number of competing H.264 encoders, including the open source x264 library. The H.264 standard is patent-encumbered; licensing is brokered through the MPEG LA group. H.264 video can be embedded in most popular container formats, including MP4 (used primarily by Apple’s iTunes Store) and MKV (used primarily by video pirates).

VC-1

VC-1 evolved from Microsoft’s WMV9 codec and was standardized in 2006. It is primarily used and promoted by Microsoft for high-definition video, although, like H.264, it has a range of profiles to trade complexity for file size. Also like H.264, it is mandated by the Blu-Ray specification, and all Blu-Ray players are required to be able to decode it. The VC-1 codec is patent-encumbered, with licensing brokered through the MPEG LA group.

Wikipedia has a brief technical comparison of VC-1 and H.264; Microsoft has their own comparison; Multimedia.cx has a pretty Venn diagram outlining the similarities and differences. Multimedia.cx also discusses the technical features of VC-1. I also found this history of VC-1 and H.264 to be interesting (as well as this rebuttal).

VC-1 is designed to be container-independent, although it is most often embedded in an ASF container. An open source decoder for VC-1 video was a 2006 Google Summer of Code project, and the resulting code was added to the multi-faceted ffmpeg library.

Theora

Theora evolved from the VP3 codec and has subsequently been developed by the Xiph.org Foundation. Theora is a royalty-free codec and is not encumbered by any known patents other than the original VP3 patents, which have been irrevocably licensed royalty-free. Although the standard has been “frozen” since 2004, the Theora project (which includes an open source reference encoder and decoder) only hit 1.0 in November 2008.

Theora video can be embedded in any container format, although it is most often seen in an Ogg container. All major Linux distributions support Theora out-of-the-box, and Mozilla Firefox 3.1 will include native support for Theora video in an Ogg container. And by “native”, I mean “available on all platforms without platform-specific plugins.” You can also play Theora video on Windows or on Mac OS X after installing Xiph.org’s open source decoder software.

The reference encoder included in Theora 1.0 is widely criticized for being slow and poor quality, but Theora 1.1 will include a new encoder that takes better advantage of Theora’s features, while staying backward-compatible with current decoders. (Info: 1, 2, 3, 4, 5, source code.)

Dirac

Dirac was developed by the BBC to provide a royalty-free alternative to H.264 and VC-1 that the BBC could use to stream high-definition television content in Great Britain. Like H.264, Dirac aims to provide a single codec for the full spectrum of very low- and very high-bandwidth streaming. Dirac is not encumbered by any known patents, and there are two open source implementations, dirac-research (the BBC’s reference implementation) and Schroedinger (optimized for speed).

The Dirac standard was only finalized in 2008, so there is very little mainstream use yet, although the BBC did use it internally during the 2008 Olympics. Dirac-encoded video tracks can be embedded in several popular container formats, including MP4, Ogg, MKV, and AVI. VLC 0.9.2 (released in September 2008) can play Dirac-encoded video within an Ogg or MP4 container.

And on and on…

Of course, this is only scratching the surface of all the available video codecs. Video encoding goes way back, but my focus in this series is on the present and near-future, not the past. If you like, you can read about MPEG-2 (used in DVDs), MPEG-1 (used in Video CDs), older versions of Microsoft’s WMV family, Sorenson, Indeo, and Cinepak.

3 lossy audio codecs

Unless you’re going to stick to films made before 1927 or so, you’re going to want an audio track. A future article will talk about how to pick the audio codec that’s right for you, but for now I just want to introduce the concept and describe the playing field. (This information is likely to go out of date quickly; future readers, be aware that this was written in December 2008.)

Like video codecs, audio codecs are algorithms by which an audio stream is encoded. Like video codecs, there are lossy and lossless audio codecs. Today’s article will only deal with lossy audio codecs. Actually, it’s even narrower than that, because there are different categories of lossy audio codecs. Audio is used in many places where video is not (telephony, for example), and there is an entire category of audio codecs optimized for encoding speech. You wouldn’t rip a music CD with these codecs, because the result would sound like a 4-year-old singing into a speakerphone. But you would use them in an Asterisk PBX, because bandwidth is precious, and these codecs can compress human speech into a fraction of the size of general-purpose codecs.

And that’s all I have to say about speech-optimized audio codecs. Onward…

As I mentioned in part 2: lossy video codecs, when you “watch a video,” your player software is doing several things at once:

Interpreting the container format
Decoding the video stream
Decoding the audio stream and sending the sound to your speakers
Possibly decoding the subtitle stream as well. (Tomorrow’s article will be all about subtitle formats! I can hardly wait!)

The audio codec specifies how to do #3 — decoding the audio stream and turning it into digital waveforms that your speakers then turn into sound. As with video codecs, there are all sorts of tricks to minimize the amount of information stored in the audio stream. And since we’re talking about lossy audio codecs, information is being lost during the recording → encoding → decoding → listening lifecycle. Different audio codecs throw away different things, but they all have the same purpose: to trick your ears into not noticing the parts that are missing.

One concept that audio has that video does not is channels. We’re sending sound to your speakers, right? Well, how many speakers do you have? If you’re sitting at your computer, you may only have two: one on the left and one on the right. My desktop has three: left, right, and one more on the floor. So-called “surround sound” systems can have six or more speakers, strategically placed around the room. Each speaker is fed a particular channel of the original recording. The theory is that you can sit in the middle of the six speakers, literally surrounded by six separate channels of sound, and your brain synthesizes them and feels like you’re in the middle of the action. Does it work? A multi-billion-dollar industry seems to think so.

Most general-purpose audio codecs can handle two channels of sound. During recording, the sound is split into left and right channels; during encoding, both channels are stored in the same audio stream; during decoding, both channels are decoded and each is sent to the appropriate speaker. Some audio codecs can handle more than two channels, and they keep track of which channel is which and so your player can send the right sound to the right speaker.

There are lots of audio codecs. Did I say there were lots of video codecs? Forget that. There are a metric fuck-ton of audio codecs. These are the ones you need to know about:

MPEG-1 Audio Layer 3
Advanced Audio Coding
Windows Media Audio
Vorbis
Dolby Digital
Digital Theater System

MPEG-1 Audio Layer 3

…colloquially known as “MP3.” If you haven’t heard of MP3s, I don’t know what to do with you. Walmart sells portable music players and calls them “MP3 players.” Walmart. Anyway…

MP3s can contain up to 2 channels of sound. They can be encoded at different bitrates: 64 kbps, 128 kbps, 192 kbps, and a variety of others from 32 to 320. Higher bitrates mean larger file sizes and better quality audio, although the ratio of audio quality to bitrate is not linear. (128 kbs sounds more than twice as good as 64 kbs, but 256 kbs doesn’t sound twice as good as 128 kbs.) Furthermore, the MP3 format allows for variable bitrate encoding, which means that some parts of the encoded stream are compressed more than others. For example, silence between notes can be encoded at a very low bitrate, then the bitrate can spike up a moment later when multiple instruments start playing a complex chord. MP3s can also be encoded with a constant bitrate, which, unsurprisingly, is called constant bitrate encoding.

The MP3 standard doesn’t define exactly how to encode MP3s (although it does define exactly how to decode them); different encoders use different psychoacoustic models that produce wildly different results, but are all decodable by the same players. The open source LAME project is the best free encoder, and arguably the best encoder period for all but the lowest bitrates.

The MP3 format was standardized in 1991 and is patent-encumbered, which explains why Linux ~~sucks~~ can’t play MP3 files out of the box. Pretty much every portable music player supports standalone MP3 files, and MP3 audio streams can be embedded in any video container. Adobe Flash can play both standalone MP3 files and MP3 audio streams within an MP4 video container.

Advanced Audio Coding

…affectionately known as “AAC.” Standardized in 1997, it lurched into prominence when Apple chose it as their default format for the iTunes Store. Originally, all AAC files “bought” from the iTunes Store were encrypted with Apple’s proprietary DRM scheme, called FairPlay. Many songs in the iTunes Store are now available as unprotected AAC files, which Apple calls “iTunes Plus” because it sounds so much better than calling everything else “iTunes Minus.” The AAC format is patent-encumbered; licensing rates are available online.

AAC was designed to provide better sound quality than MP3 at the same bitrate, and it can encode audio at any bitrate. (MP3 is limited to a fixed number of bitrates, with an upper bound of 320 kbs.) AAC can encode up to 48 channels of sound, although in practice no one does that. The AAC format also differs from MP3 in defining multiple profiles, in much the same way as H.264, and for the same reasons. The “low-complexity” profile is designed to be playable in real-time on devices with limited CPU power, while higher profiles offer better sound quality at the same bitrate at the expense of slower encoding and decoding.

All current Apple products, including iPods, AppleTV, and QuickTime support certain profiles of AAC in standalone audio files and in audio streams in an MP4 video container. Adobe Flash supports all profiles of AAC in MP4, as do the open source mplayer and VLC video players. For encoding, the FAAC library is the open source option; support for it is a compile-time option in mencoder and ffmpeg. (I’ll dive into all the different encoding tools in a future article.)

Windows Media Audio

…a.k.a. “WMA.” As you might guess from the name, Windows Media Audio was developed by Microsoft. The acronym “WMA” has historically referred to many different things: a lossless audio codec (“WMA Lossless”), a speech-optimized codec (“WMA Voice”), and several different lossy audio codecs (“WMA 1″, “WMA 2″, “WMA 7″, “WMA 8″, “WMA 9″, and “WMA Pro”). It is also (incorrectly) used to refer to the Advanced Systems Format, because WMA-encoded audio streams are usually embedded in an ASF container. Roughly speaking, the lossy audio codecs (WMA 1-9) compete with MP3 and low-complexity AAC; WMA Lossless competes with Apple Lossless and FLAC; WMA Pro competes with high-complexity AAC, Vorbis, AC-3, and DTS.

All the different codecs under the “WMA” brand are playable with Windows Media Player, which comes pre-installed on desktops and laptops running Microsoft Windows XP and Vista. Portable devices like the Zune and the ironically named “PlaysForSure” devices can play WMA 1-9; stores that allow you to “purchase” WMA files generally encrypt them with a Microsoft-proprietary DRM scheme. The open source ffmpeg project can play WMA 1-9, and Flip4Mac offers a commercial QuickTime component to encode and decode WMA audio on Mac OS X.

WMA 1-9 support up to 2 channels of sound; WMA Pro supports up to 8 channels of sound. All WMA formats are patent-encumbered; licensing information is available from Microsoft.

Vorbis

…known to many as “Ogg Vorbis,” although for some reason that pisses off both Ogg and Vorbis advocates. (Technically, “Ogg” is a container format, and Vorbis audio streams can be embedded in other containers.) Vorbis is not encumbered by any known patents and is therefore supported out-of-the-box by all major Linux distributions and by portable devices running the open source Rockbox firmware. Mozilla Firefox 3.1 will support Vorbis audio files in an Ogg container, or Ogg videos with a Vorbis audio track. Android mobile phones can also play standalone Vorbis audio files. Vorbis audio streams are usually embedded in an Ogg container, but they can also be embedded in an MP4 or MKV container (or, with some hacking, in AVI).

There are open source Vorbis encoders and decoders, including OggConvert (encoder), ffmpeg (decoder), aoTuV (encoder), and libvorbis (decoder). There are also QuickTime components for Mac OS X and DirectShow filters for Windows.

Vorbis supports an arbitrary number of sound channels.

Dolby Digital

…a.k.a. “AC-3.” AC-3 was developed by Dolby Laboratories. AC-3 is most well-known for being a mandatory format in the DVD standard; all DVD players must be able to decode AC-3 audio streams. It is also mandatory for Blu-Ray players, and many digital TV broadcasts send AC-3 audio streams as well. AC-3 supports up to 6 channels of sound and bitrates of up to 640 kbps, although its most popular application — audio on DVDs — is officially limited to 448 kbps. (Blu-Ray discs may use the maximum 640 kbps.)

There are open source encoders and decoders for AC-3, including liba52 (decoding), AC3Filter (decoding), and Aften (encoding). ffmpeg has a compile-time option to include liba52, which will allow all ffmpeg-based players and plugin chains (like GStreamer) to play AC-3 audio streams. However, the AC-3 format is patent-encumbered; licensing is brokered by Dolby Laboratories.

AC-3 is rarely seen in standalone audio files; it is designed to be embedded in a video container. Other than DVDs and Blu-Ray discs (which use a video container format I haven’t talked about yet), you can embed AC-3 audio streams in MKV, AVI, and — just standardized earlier this year — in MP4 files (discussion). Apple’s AppleTV set-top box is the only hardware device I know of that supports AC-3 in MP4; you can encode AppleTV-compatible AC3-in-MP4 videos with HandBrake, or manually insert AC-3 audio into existing MP4 files with this Windows-only fork of mp4creator.

Digital Theater System

…a.k.a. “DTS.” As you might guess from the name, DTS is designed for real-life movie theaters. Like WMA, “DTS” is a brand name for a family of different audio formats. The “core” DTS format supports up to six channels; later extensions like DTS-HD support up to eight channels. There is also DTS-HD Master Audio, a lossless variant by the same company. Core DTS is designed for high bitrates (up to 1536 kbps, which is virtually indistinguishable from being there in the first place). DTS-HD Master Audio bitrates can go even higher, although at some point even audiophiles will wonder why they should bother.

Core DTS was not originally part of the DVD specification, so early DVD players did not support it. Most recent DVD players support natively decoding core DTS audio or passing the audio stream through to an external speaker system which decodes it, but relatively few DVDs include a DTS stream due to size constraints. Core DTS is a mandatory part of the Blu-Ray specification, and many Blu-Ray discs include a DTS audio track — sometimes the exact same stream that was originally played in the movie theater. (DTS-HD Master Audio is an optional part of the Blu-Ray specification, but few Blu-Ray discs include it due to — you guessed it — size constraints.)

DTS is patent-encumbered; licensing is brokered by DTS, Inc.

And so forth and so on

As with everything else in this series, this article barely scratches the surface. (Really!) If you like, you can read about other audio codecs: ATRAC, Musepack, MP2, RealAudio, AMR, ADPCM, and so forth and so on. Wikipedia has a comparison of common audio codecs, HydrogenAudio has lots of technical details, and wiki.multimedia.cx is always your friend too.

4 captioning

The first thing you need to know about captions and subtitles is that captions and subtitles are different. The second thing you need to know about captions and subtitles is that you can safely ignore the differences unless you’re creating your own from scratch. I’m going to use the terms interchangeably throughout this article, which will probably drive you crazy if you happen to know and care about the difference.

Historically, captioning has been driven by the needs of deaf and hearing impaired consumers, and captioning technology has been designed around the technical quirks of broadcast television. In the United States, so-called “closed captions” are embedded into a part of the NTSC video source (“Line 21″) that is normally outside the viewing area on televisions. In Europe, they use a completely different system that is embeddable in the PAL video source. Over time, each new medium (VHS, DVD, and now online digital video) has dealt a blow to the accessibility gains of the previous medium. For example:

PAL VHS tapes did not have enough bandwidth to store closed captions at all.
DVDs have the technical capability, but producers often manage to screw it up anyway; e.g. DVDs of low-budget television shows are often released without the closed captions that accompanied the original broadcast.
HDMI cables drop “Line 21″ closed captions altogether. If you play an NTSC DVD on an HDTV over HDMI, you’ll never see the closed captions, even if the DVD has them.

And accessible online video is just fucking hopeless. (And no, it won’t change unless new regulation forces it to change. When it comes to captioning, Joe Clark has been right longer than many of you have been alive.)

So even in broadcast television, captioning technology was fractured by different broadcast technologies in different countries. Digital video had the capability of unifying the technologies and learning from their mistakes. Of course, exactly the opposite happened. Early caption formats split along company lines; each major video software platform (RealPlayer, QuickTime, Windows Media, Adobe Flash) implemented captioning in their own way, with levels of adoption ranging from nil to zilch. At the same time, an entire subculture developed around “fan-subbing,” i.e. using captioning technology to provide translations of foreign language videos. For example, non-Japanese-speaking consumers wanted to watch Japanese anime films, so amateur translators stepped up to publish their own English captions that could be overlaid onto the original film. In the 1980s, fansubbers would actually take VHS tapes and overlay the English captions onto a new tape, which they would then (illegally) distribute. Nowadays, translators can simply publish their work on the Internet as a standalone file. English-speaking consumers can have their DVDs shipped directly from Japan, and they use software players that can overlay standalone English caption files while playing their Japanese-only DVDs. The legality of distributing these unofficial translations (even separately, in the form of standalone caption files) has been disputed in recent years, but the fansubbing community persists.

Technically, there is a lot of variation in captioning formats. At their core, captions are a combination of text to display, start and end times to display it, information about where to position the text on a screen, fonts, styling, alignment, and so on. Some captions roll up from the bottom of the screen, others simply appear and disappear at the appropriate time. Some caption formats mandate where each caption should be placed and how it should be styled; others merely suggest position and styling; others leave all display attributes entirely up to the player. Almost every conceivable combination of these variables has been tried. Some forms of media try multiple combinations at once. DVDs, for example, can have two entirely distinct forms of captioning — closed captioning (as used in NTSC broadcast television) embedded in the video stream, and one or more subtitle tracks. DVD subtitle tracks are used for many different things, including subtitles (just the words being spoken, in the same language as the audio), captions for the hearing impaired (which include extra notations of background noises and such), translations into other languages, and director’s commentary. Oh, and they’re stored on the DVD as images, not text, so the end user has no control over fonts or font size.

Beyond DVDs, most caption formats store the captions as text, which inevitably raises the issue of character encoding. Some caption formats explicitly specify the character encoding, others only allow UTF-8, others don’t specify any encoding at all. On the player side, most players respect the character encoding if present (but may only support specific encodings); in its absence, some players assume UTF-8, some guess the encoding, and some allow the user to override the encoding. Obviously standalone caption files can be in any format, but if you want to embed your captions as a track within a video container, your choices are limited to the caption formats that the video container supports.

And remember when I said that there were a metric fuck-ton of audio codecs? Forget that. There are an imperial fuck-ton of caption formats (i.e. multiply by 9/5 and add 32). Here is a partial list of caption formats, taken from the list of formats supported by Subtitle Workshop, which I used to caption my short-lived video podcast series:

Adobe Encore DVD, Advanced SubStation Alpha, AQTitle, Captions 32, Captions DAT, Captions DAT Text, Captions Inc., Cheetah, CPC-600, DKS Subtitle Format, DVD Junior, DVD Studio Pro, DVD Subtitle System, DVDSubtitle, FAB Subtitler, IAuthor Script, Inscriber CG, JACOSub 2.7+, Karaoke Lyrics LRC, Karaoke Lyrics VKT, KoalaPlayer, MacSUB, MicroDVD, MPlayer, MPlayer2, MPSub, OVR Script, Panimator, Philips SVCD Designer, Phoenix Japanimation Society, Pinnacle Impression, PowerDivX, PowerPixel, QuickTime Text, RealTime, SAMI Captioning, Sasami Script, SBT, Sofni, Softitler RTF, SonicDVD Creator, Sonic Scenarist, Spruce DVDMaestro, Spruce Subtitle File, Stream SubText Player, Stream SubText Script, SubCreator 1.x, SubRip, SubSonic, SubStation Alpha, SubViewer 1.0, SubViewer 2.0, TMPlayer, Turbo Titler, Ulead DVD Workshop 2.0, ViPlay Subtitle File, ZeroG.

Which of these formats are important? The answer will depend on whom you ask, and more specifically, how you’re planning to distribute your video. This series is primarily focused on videos delivered as files to be played on PCs or other computing devices, so my choices here will reflect that. These are some of the most well-supported caption formats:

SubRip
SubStation Alpha
MPEG-4 Timed Text
SAMI
SMIL

SubRip

SubRip is the AVI of caption formats, in the sense that its basic functionality is supported everywhere but various people have tried to extend it in mostly incompatible ways and the result is a huge mess. As a standalone file, SubRip captions are most commonly seen with a .srt extension. SubRip is a text-based format which can include font, size, and position information, as well as a limited set of HTML formatting tags, although most of these features are poorly supported. Its “official” specification is a doom9 forum post from 2004. Most players assume that .srt files are encoded in Windows-1252 (what Windows programs frequently call “ANSI”), although some can detect and switch to UTF-8 encoding automatically.

Because .srt files are so often published separately from the video files they describe, the most common use case is to put your .srt file in the same directory as your video file and give them the same name (up to the file extensions). But it is also possible to embed SubRip captions directly into AVI files with AVI-Mux GUI, into MKV files with mkvmerge, and into MP4 files with MP4Box.

You can play SubRip captions in Windows Media Player or other DirectShow-based video players after installing VSFilter; in QuickTime after installing Perian; on Linux, both mplayer and VLC support it natively.

SubStation Alpha

SubStation Alpha and its successor, Advanced SubStation Alpha, are the preferred caption formats of the fansubbing community. As standalone files, they are commonly seen with .ssa or .ass extensions. They have a spec longer than three paragraphs. They are actually miniature scripting languages. A .ass file contains a series of commands to control position, scrolling, animation, font, size, scaling, letter spacing, borders, text outline, text shadow, alignment, and so on; and a series of time-coded events for displaying text given the current styling parameters. It has support for multiple character encodings.

The playing requirements for SubStation Alpha captions are almost identical to SubRip. The same plugins are required for Windows and Mac OS X. On Linux, mplayer prides itself on having the most complete SSA/ASS implementation.

MPEG-4 Timed Text

a.k.a. “MPEG-4 Part 17,” a.k.a. ISO 14496-17, MPEG-4 Timed Text (hereafter “MP4TT”) is the one and only caption format for the MP4 container. It is not a file format; it is only defined in terms of a track within an MP4 container. As such, it can not be embedded in any other video container, and it can not exist as a separate file. (Note: the last sentence was a lie; the MPEG-4 Timed Text format is really the 3GPP Timed Text format, and it can very much be embedded in a 3GPP container. What I meant to say is that the format can not be embedded in any of the other popular video container formats like AVI, MKV, or OGG. I could go on about the subtle differences between MPEG-4 Timed Text in an MP4 container and 3GPP Timed Text in a 3GPP container, but it would just make you cry, and besides, technical accuracy is for pussies.)

MP4TT defines detailed information on text positioning, fonts, styles, scrolling, and text justification. These details are encoded into the track at authoring time, and can not be changed by the end user’s video player. The most readable description of its features is actually the documentation for GPAC, an open source implementation of much of the MPEG-4 specification (including MP4TT). Since MP4TT doesn’t define a text-based serialization, GPAC invented one for their own use; since their format is designed to capture all the possible information in an MP4TT track, it turns out to be an easy way to read about all of MP4TT’s features.

MP4Box, part of the GPAC project, can take an .srt file and convert it into a MPEG-4 Timed Text track and embed it in an existing MP4 file. It can also reverse the process — extract a Timed Text track from an MP4 file and output a .srt file.

On Mac OS X, QuickTime supports MP4TT tracks within an MP4 container, but only if you rename the file from .mp4 to .3gp or .m4v. I shit you not. (On the plus side, changing the file extension will allow you to sync compatible video to an iPod or iPhone, which will actually display the captions. Still not kidding.) On Windows, any DirectShow-based video player (such as Windows Media Player or Media Player Classic) supports MP4TT tracks once you install Haali Media Splitter. On Linux, VLC has supported MP4TT tracks for several years.

SAMI

SAMI was Microsoft’s first attempt to create a captioning format for PC video files (as opposed to broadcast television or DVDs). As such, it is natively supported by Microsoft video players, including Windows Media Player, without the need for third-party plugins. It has a specification on MSDN. It is a text-based format that supports a large subset of HTML formatting tags. SAMI captions are almost always embedded in an ASF container, along with Windows Media video and Windows Media audio.

Don’t use SAMI for new projects; it has been superceded by SMIL. For historical purposes, you may enjoy reading about creating SAMI captions and embedding them in an ASF container, as long as you promise to never, ever try it at home.

SMIL

SMIL (Synchronized Multimedia Integration Language) is not actually a captioning format. It is “an XML-based language that allows authors to write interactive multimedia presentations.” It also happens to have a timing and synchronization module that can, in theory, be used to display text on a series of moving pictures. That is to say, if you think of SMIL as a way to provide captions for a video, you’re doing it wrong. You need to invert your thinking — your video and your captions are each merely components of a SMIL presentation. SMIL captions are not embedded into a video container; the video and its captions are referenced from a SMIL document.

SMIL is a W3C standard; the most recent revision, SMIL 3.0, was just published in December 2008. If you printed out the SMIL 3.0 specification on US-Letter-sized paper, it would weigh in at 395 pages. So don’t do that.

QuickTime supports a subset of SMIL 1.0. WebAIM provides a nice tutorial on using SMIL to add captions to a QuickTime movie.