mp4文档ISO/IEC 14496 part 12解读

来源：互联网发布：怎样修改淘宝销量显示编辑：程序博客网时间：2024/05/29 13:24

ISO/IEC 14496 part 12下载地址http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
相关参考文档：
一个大致了解MP4文件格式及基本box介绍文档：http://www.52rd.com/Blog/wqyuwss/559/1/
Complete List of all known MP4 / QuickTime 'ftyp' designations：http://www.ftyps.com/
所有Box类型（Four CC）定义：http://www.mp4ra.org/codecs.html
ffmpeg添加MP4的pcm音频支持：http://blog.csdn.net/garefield/article/details/45113363
H264—MP4格式及在MP4文件中提取H264的SPS、PPS及码流:http://www.cnblogs.com/skyseraph/archive/2012/04/01/2429384.html

“stsz” BOX（Sample Size Box）

Box定义：

Box Type: ‘stsz’, ‘stz2’Container: Sample Table Box (‘stbl’)Mandatory: YesQuantity: Exactly one variant must be present

本box主要用于记录本track下总的samples的个数及其每个sample占用的空间大小，语法格式为：

aligned(8) class SampleSizeBox extends FullBox(‘stsz’, version = 0, 0) {unsigned int(32) sample_size;unsigned int(32) sample_count;if (sample_size==0) {for (i=1; i <= sample_count; i++) {unsigned int(32) entry_size;}}}

包含字段：sample_size（32bit）、sample_count（32bit），可能包含一个samples表。当所有的sample大小均相同时，使用sample_size来表示即可，如果不相同，则sample_size赋值为0，后面跟一个大小为sample_count的size表来记录每个sample的大小。

“stsc” BOX（Sample To Chunk Box）

Box定义：

Box Type: ‘stsc’Container: Sample Table Box (‘stbl’)Mandatory: YesQuantity: Exactly one

个人理解为，由于同一个文件中既要存放音频，又要存放视频，所以一般情况下音视频数据是交错存放的，也即一段视频数据，一段音频数据，然后再是一段视频数据，依次交替存储。。。但是为了检索音视频帧方便，一段视频数据内一般存放一个完整GOP图像（即从I帧开始到下一个I帧之间的数据）。

标准文档是这样描述的：

Samples within the media data are grouped into chunks. Chunks can be of different sizes, and the samples
within a chunk can have different sizes. This table can be used to find the chunk that contains a sample, its
position, and the associated sample description.

The table is compactly coded. Each entry gives the index of the first chunk of a run of chunks with the same
characteristics. By subtracting one entry here from the previous one, you can compute how many chunks are
in this run. You can convert this to a sample count by multiplying by the appropriate samples-per-chunk.

语法格式：

aligned(8) class SampleToChunkBox extends FullBox(‘stsc’, version = 0, 0) {    unsigned int(32) entry_count;    for (i=1; i <= entry_count; i++) {        unsigned int(32) first_chunk;        unsigned int(32) samples_per_chunk;        unsigned int(32) sample_description_index;    }}

这里请注意，在thunk存储时，会进行压缩，即如果下一个thunk的samples_per_chunk、sample_description_index和上一个thunk对应字段相同，则只存储上一个即可（想象一下行程编码规则即可明白），每个字段的解释：

version is an integer that specifies the version of this box

entry_count is an integer that gives the number of entries in the following table

first_chunk is an integer that gives the index of the first chunk in this run of chunks that share the
same samples-per-chunk and sample-description-index; the index of the first chunk in a track has the
value 1 (the first_chunk field in the first record of this box has the value 1, identifying that the first
sample maps to the first chunk).

samples_per_chunk is an integer that gives the number of samples in each of these chunks

sample_description_index is an integer that gives the index of the sample entry that describes the
samples in this chunk. The index ranges from 1 to the number of sample entries in theSample
Description Box

“stco” BOX（Chunk Offset Box）

Box定义：

Box Type: ‘stco’, ‘co64’Container: Sample Table Box (‘stbl’)Mandatory: YesQuantity: Exactly one variant must be present

用于记录每个chunk中帧数据的偏移地址（在文件中的存放位置）

语法定义：

aligned(8) class ChunkOffsetBox extends FullBox(‘stco’, version = 0, 0) {unsigned int(32) entry_count;for (i=1; i <= entry_count; i++) {unsigned int(32) chunk_offset;}}

字段比较简单，文档中描述为：

version is an integer that specifies the version of this box
entry_count is an integer that gives the number of entries in the following table
chunk_offset is a 32 or 64 bit （co64 box）integer that gives the offset of the start of a chunk into its containing
media file.

“stts” BOX（Decoding Time to Sample Box）

BOX定义：

Box Type: ‘stts’Container: Sample Table Box (‘stbl’)Mandatory: YesQuantity: Exactly one

主要用于计算dts：DT(n+1) = DT(n) + STTS(n)，一般固定帧率的就一个entry

语法格式为：

aligned(8) class TimeToSampleBox extends FullBox(’stts’, version = 0, 0) {unsigned int(32) entry_count;int i;for (i=0; i < entry_count; i++) {unsigned int(32) sample_count;unsigned int(32) sample_delta;}}

字段解释
entry_count - is an integer that gives the number of entries in the following table.
sample_count - is an integer that counts the number of consecutive samples that have the given
duration.
sample_delta - is an integer that gives the delta of these samples in the time-scale of the media.

“stsd” BOX（Sample Description Format）

BOX定义：

Box Types: ‘stsd’Container: Sample Table Box (‘stbl’)Mandatory: YesQuantity: Exactly one

主要用来描述track的编码信息，如音频的声道数、采样率、位宽，视频的图像宽高，编码器tag ID，以及解码私有数据信息，

如H264的SPS、PPS、Profile ID，AAC的Profile ID等。

语法定义：

class VisualSampleEntry(codingname) extends SampleEntry (codingname){unsigned int(16) pre_defined = 0;const unsigned int(16) reserved = 0;unsigned int(32)[3] pre_defined = 0;unsigned int(16) width;unsigned int(16) height;template unsigned int(32) horizresolution = 0x00480000; // 72 dpitemplate unsigned int(32) vertresolution = 0x00480000; // 72 dpiconst unsigned int(32) reserved = 0;template unsigned int(16) frame_count = 1;string[32] compressorname;template unsigned int(16) depth = 0x0018;int(16) pre_defined = -1;// other boxes from derived specificationsCleanApertureBox clap; // optionalPixelAspectRatioBox pasp; // optional}// Audio Sequencesclass AudioSampleEntry(codingname) extends SampleEntry (codingname){const unsigned int(32)[2] reserved = 0;template unsigned int(16) channelcount = 2;template unsigned int(16) samplesize = 16;unsigned int(16) pre_defined = 0;const unsigned int(16) reserved = 0 ;template unsigned int(32) samplerate = { default samplerate of media}<<16;}aligned(8) class SampleDescriptionBox (unsigned int(32) handler_type) extends FullBox('stsd', 0, 0){int i ;unsigned int(32) entry_count;for (i = 1 ; i <= entry_count ; i++){switch (handler_type){case ‘soun’: // for audio tracksAudioSampleEntry();break;case ‘vide’: // for video tracksVisualSampleEntry();break;case ‘hint’: // Hint trackHintSampleEntry();break;case ‘meta’: // Metadata trackMetadataSampleEntry();break;}}}

“stbl” BOX（Sample Table Box）

定义：

Box Type: ‘stbl’Container: Media Information Box (‘minf’)Mandatory: YesQuantity: Exactly one

为一个container box

“minf” BOX（Media Information Box）

定义：

Box Type: ‘minf’Container: Media Box (‘mdia’)Mandatory: YesQuantity: Exactly one

This box contains all the objects that declare characteristic information of the media in the track.为一个container box。

“mdhd” BOX（Media Header Box）

定义：

Box Type: ‘mdhd’Container: Media Box (‘mdia’)Mandatory: YesQuantity: Exactly one

The media header declares overall information that is media-independent, and relevant to characteristics of
the media in a track.

语法：

aligned(8) class MediaHeaderBox extends FullBox(‘mdhd’, version, 0) {if (version==1) {unsigned int(64) creation_time;unsigned int(64) modification_time;unsigned int(32) timescale;unsigned int(64) duration;} else { // version==0unsigned int(32) creation_time;unsigned int(32) modification_time;unsigned int(32) timescale;unsigned int(32) duration;}bit(1) pad = 0;unsigned int(5)[3] language; // ISO-639-2/T language codeunsigned int(16) pre_defined = 0;}

语义：

version is an integer that specifies the version of this box (0 or 1)

creation_time is an integer that declares the creation time of the media in this track (in seconds since
midnight, Jan. 1, 1904, in UTC time)

modification_time is an integer that declares the most recent time the media in this track was
modified (in seconds since midnight, Jan. 1, 1904, in UTC time)

timescale is an integer that specifies the time-scale for this media; this is the number of time units that
pass in one second. For example, a time coordinate system that measures time in sixtieths of a
second has a time scale of 60.

duration is an integer that declares the duration of this media (in the scale of the timescale). If the
duration cannot be determined then duration is set to all 1s.

language declares the language code for this media. See ISO 639-2/T for the set of three character
codes. Each character is packed as the difference between its ASCII value and 0x60. Since the code
is confined to being three lower-case letters, these values are strictly positive.

“tkhd” BOX（Track Header Box）

定义：

Box Type: ‘tkhd’Container: Track Box (‘trak’)Mandatory: YesQuantity: Exactly one

语法：

aligned(8) class TrackHeaderBox extends FullBox(‘tkhd’, version, flags){if (version==1) {unsigned int(64) creation_time;unsigned int(64) modification_time;unsigned int(32) track_ID;const unsigned int(32) reserved = 0;unsigned int(64) duration;} else { // version==0unsigned int(32) creation_time;unsigned int(32) modification_time;unsigned int(32) track_ID;const unsigned int(32) reserved = 0;unsigned int(32) duration;}const unsigned int(32)[2] reserved = 0;template int(16) layer = 0;template int(16) alternate_group = 0;template int(16) volume = {if track_is_audio 0x0100 else 0};const unsigned int(16) reserved = 0;template int(32)[9] matrix= { 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 };// unity matrixunsigned int(32) width;unsigned int(32) height;}

当时长大于INT32_MAX时，version=1.

version is an integer that specifies the version of this box (0 or 1 in this specification)
flags is a 24-bit integer with flags; the following values are defined:
Track_enabled: Indicates that the track is enabled. Flag value is 0x000001. A disabled track (the low
bit is zero) is treated as if it were not present.
Track_in_movie: Indicates that the track is used in the presentation. Flag value is 0x000002.
Track_in_preview: Indicates that the track is used when previewing the presentation. Flag value is
0x000004.
creation_time is an integer that declares the creation time of this track (in seconds since midnight,
Jan. 1, 1904, in UTC time)
modification_time is an integer that declares the most recent time the track was modified (in
seconds since midnight, Jan. 1, 1904, in UTC time)
track_ID is an integer that uniquely identifies this track over the entire life-time of this presentation.
Track IDs are never re-used and cannot be zero.
duration is an integer that indicates the duration of this track (in the timescale indicated in the Movie
Header Box). The value of this field is equal to the sum of the durations of all of the track’s edits. If
there is no edit list, then the duration is the sum of the sample durations, converted into the timescale
in the Movie Header Box. If the duration of this track cannot be determined then duration is set to all
1s.
layer specifies the front-to-back ordering of video tracks; tracks with lower numbers are closer to the
viewer. 0 is the normal value, and -1 would be in front of track 0, and so on.
alternate_group is an integer that specifies a group or collection of tracks. If this field is 0 there is no
information on possible relations to other tracks. If this field is not 0, it should be the same for tracks
that contain alternate data for one another and different for tracks belonging to different such groups.
Only one track within an alternate group should be played or streamed at any one time, and must be
distinguishable from other tracks in the group via attributes such as bitrate, codec, language, packet
size etc. A group may have only one member.
volume is a fixed 8.8 value specifying the track's relative audio volume. Full volume is 1.0 (0x0100) and
is the normal value. Its value is irrelevant for a purely visual track. Tracks may be composed by
combining them according to their volume, and then using the overall Movie Header Box volume
setting; or more complex audio composition (e.g. MPEG-4 BIFS) may be used.
matrix provides a transformation matrix for the video; (u,v,w) are restricted here to (0,0,1), hex
(0,0,0x40000000).
width and height specify the track's visual presentation size as fixed-point 16.16 values. These need
not be the same as the pixel dimensions of the images, which is documented in the sample
description(s); all images in the sequence are scaled to this size, before any overall transformation of
the track represented by the matrix. The pixel dimensions of the images are the default values.

“mvhd” BOX（Movie Header Box）

定义：

Box Type: ‘mvhd’Container: Movie Box (‘moov’)Mandatory: YesQuantity: Exactly one

This box defines overall information which is media-independent, and relevant to the entire presentation
considered as a whole.

语法：

aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) {if (version==1) {unsigned int(64) creation_time;unsigned int(64) modification_time;unsigned int(32) timescale;unsigned int(64) duration;} else { // version==0unsigned int(32) creation_time;unsigned int(32) modification_time;unsigned int(32) timescale;unsigned int(32) duration;}template int(32) rate = 0x00010000; // typically 1.0template int(16) volume = 0x0100; // typically, full volumeconst bit(16) reserved = 0;const unsigned int(32)[2] reserved = 0;template int(32)[9] matrix ={ 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 };// Unity matrixbit(32)[6] pre_defined = 0;unsigned int(32) next_track_ID;}

当duration >= MAX_INT时,version=1.头里面duration和timescale的相关信息使用当前文件里面最大的track的对应信息.

version is an integer that specifies the version of this box (0 or 1 in this specification)
creation_time is an integer that declares the creation time of the presentation (in seconds since
midnight, Jan. 1, 1904, in UTC time)
modification_time is an integer that declares the most recent time the presentation was modified (in
seconds since midnight, Jan. 1, 1904, in UTC time)
timescale is an integer that specifies the time-scale for the entire presentation; this is the number of
time units that pass in one second. For example, a time coordinate system that measures time in
sixtieths of a second has a time scale of 60.
duration is an integer that declares length of the presentation (in the indicated timescale). This property
is derived from the presentation’s tracks: the value of this field corresponds to the duration of the
longest track in the presentation. If the duration cannot be determined then duration is set to all 1s.
rate is a fixed point 16.16 number that indicates the preferred rate to play the presentation; 1.0
(0x00010000) is normal forward playback
volume is a fixed point 8.8 number that indicates the preferred playback volume. 1.0 (0x0100) is full
volume.
matrix provides a transformation matrix for the video; (u,v,w) are restricted here to (0,0,1), hex values
(0,0,0x40000000).
next_track_ID is a non-zero integer that indicates a value to use for the track ID of the next track to be
added to this presentation. Zero is not a valid track ID value. The value of next_track_ID shall be
larger than the largest track-ID in use. If this value is equal to all 1s (32-bit maxint), and a new media
track is to be added, then a search must be made in the file for an unused track identifier.

0 0