H265 Intro - General Concepts

来源:互联网 发布:java启动exe程序 编辑:程序博客网 时间:2024/05/21 17:26

http://www.f265.org/f265/static/txt/h265_companion.html


H.265 Companion

Purpose and organization of this document

This document contains human-readable information about the more complex parts of the H.265 specification. It is intended to be read side-by-side with the specification text to make it easier to understand. The document assumes significant knowledge about video encoding. This is not a tutorial.

This document is divided into three parts. The first part describes the general concepts. The second contains a short description of some bitstream elements. The last part follows the specification clause-by-clause and summarizes what it is saying as clearly as possible.

You may want to start by reading "Overview of the High Efficiency Video Coding (HEVC) Standard"(see the bottom of this page). It's a well-written article. The two other papers are also good.

Current revision

Last updated: 27 June 2013 (specification draft 10, version 34).
Previous update: 22 November 2012 (specification draft 9).
Undocumented features: SVC, SEI, VUI.
Contact: laurent.birtz@vantrix.com
              francois.caron@vantrix.com

Notation conventions (IMPORTANT)

In the H.265 specification text:

  • Name all in lower case with underscores (like_this):
    • This is a bitstream element.
    • In bold: bits are written at this location in the bitstream.
    • Not in bold: reference to an element declared earlier. Nothing is written (the element was previously read).
  • Name in camel case (LikeThis and likeThis):
    • This is a variable used to hold a value.
    • Lower case first letter (likeThis): local variable.
    • Upper case first letter (LikeThis): global variable.

In this document text:

  • The notation is always informal and the meaning depends on the context.
  • Class-like notation (a.b):
    • Refer to object 'b' somehow related to object 'a'.
    • Example: refIdx.Frame means "the frame associated to the reference index".

A short rant about the specification

This document should not exist. A specification is written for humans. Since the number of readers far outnumber the number of writers, it follows that a specification should be both precise and readable to avoid wasting people's time.

The authors seem to write for an obfuscation contest. They apply the reversed DRY principle: Do Repeat Yourself (even HM is copy-pasted at 75%). They embrace delocality of reference: information is dispersed all over the place. Further, they use circular definitions such as "A is an object contained in B, B contains objects of type A".

Part I.
General concepts

A reader familiar with video compression may quickly skim through this part.

Image

An image (frame, picture) is a rectangular region containing pixels (pels, samples) to compress. The term "natural image" refers to an image exhibiting natural behavior with regards to lighting. In other words, neighbouring pixels in a natural image should exhibit strong similarities, unless they belong to different "objects" (foreground/background).

There are three components per image: luma (Y), chroma blue (Cb, U), and chroma red (Cr, V). The Y component is associated to gray levels, which enables the human visual system (HVS) to distinguish shapes. The later two components (Cb and Cr) represent a transition away from gray towards blue and red.

The selection of (Y,Cb,Cr) triplets over the primary (Red,Green,Blue) triplets captured by the HVS has to do with the fact we have more rods than cones. This means we are more sensitive to gray levels, and less sensitive to colours. Taking advantage of this property, it is possible to subsample (keep less) chrominance values without actually loosing information. If we were to stay in the RGB space, it wouldn't be possible to subsample values. What colour would we prefer over the other two?

Image compression uses 4:4:4, 4:2:2, 4:2:0, or 4:0:0 subsampling. The former specifies full resolution (all chrominance samples are kept), while the later indicates a monochrome image (no chrominance samples are kept). 4:2:2, and 4:2:0 indicate that a subset of chrominance values are kept. 4:2:2 specifies full horizontal resolution, and half vertical resolution. 4:2:0 specifies half resolution in both directions. The subsampling patterns are presented below.

         4:4:4            4:2:2             4:2:0             4:0:0     +--+--+--+--+    +--+--+--+--+     +--+--+--+--+     +--+--+--+--+          |Y |Y |Y |Y |    |Y |Y |Y |Y |     |Y |Y |Y |Y |     |Y |Y |Y |Y |     |Y |Y |Y |Y |    |Y |Y |Y |Y |     |Y |Y |Y |Y |     |Y |Y |Y |Y |     |Y |Y |Y |Y |    |Y |Y |Y |Y |     |Y |Y |Y |Y |     |Y |Y |Y |Y |     |Y |Y |Y |Y |    |Y |Y |Y |Y |     |Y |Y |Y |Y |     |Y |Y |Y |Y |     +--+--+--+--+    +--+--+--+--+     +--+--+--+--+     +--+--+--+--+     |Cb|Cb|Cb|Cb|    |Cb|  |Cb|  |     |Cb|  |Cb|  |     |  |  |  |  |       |Cb|Cb|Cb|Cb|    |Cb|  |Cb|  |     |  |  |  |  |     |  |  |  |  |       |Cb|Cb|Cb|Cb|    |Cb|  |Cb|  |     |Cb|  |Cb|  |     |  |  |  |  |       |Cb|Cb|Cb|Cb|    |Cb|  |Cb|  |     |  |  |  |  |     |  |  |  |  |       +--+--+--+--+    +--+--+--+--+     +--+--+--+--+     +--+--+--+--+     |Cr|Cr|Cr|Cr|    |Cr|  |Cr|  |     |Cr|  |Cr|  |     |  |  |  |  |     |Cr|Cr|Cr|Cr|    |Cr|  |Cr|  |     |  |  |  |  |     |  |  |  |  |     |Cr|Cr|Cr|Cr|    |Cr|  |Cr|  |     |Cr|  |Cr|  |     |  |  |  |  |     |Cr|Cr|Cr|Cr|    |Cr|  |Cr|  |     |  |  |  |  |     |  |  |  |  |     +--+--+--+--+    +--+--+--+--+     +--+--+--+--+     +--+--+--+--+

When 4:4:4 sampling is used, each component may be coded independently as if we were dealing with three different images (separate colour planes). This means that the samples of a component will be processed before the next component is coded. The alternative is to interleave the components; block-by-block, the Y, Cb, then Cr values are coded. This is how the components are processed when using 4:2:2 and 4:2:0 subsampling (H.265 restricts the use of separate colour planes to 4:4:4 sampling).

The bit depth determines how many bits are used to store each pixel. The range varies from 8 to 14. As the image resolution gets bigger (beyond HD), more precision is required for each sample to better capture the subtle changes. The bit depth for luma pixels may differ from the bit depth of chroma pixels.

Image compression

Image compression combines lossless and lossy coding. Each square block of pixels is first predicted using spatiotemporal information. The prediction is then subtracted from the pixels (lossless coding), and the result, the residual samples, are then transformed (lossless operation) and quantized (lossy operation). Graphically, the process looks like this.

                        Transform 1D (lossless)   Residual  Transform  Coeffs      Coeffs    Inverse     Residual    [32 21] * [1  1] = [53 11]     [53 11] * [1/2  1/2] = [32 21]    [ 1  3]   [1 -1]   [ 4 -2]     [ 4 -2]   [1/2 -1/2]   [ 1  3]                  Transform 1D + quantization (lossful)             Coeffs  Quant  Coeffs'    Coeffs'      Dequant            [53 11] * 1/4 = [13 2]     [13 2] * 4 = [52 8]            [ 4 -2]         [1  0]     [1  0]       [4  0]                 Dequant   Inverse     Residual'                 [52 8] * [1/2  1/2] = [30 22]                 [4  0]   [1/2 -1/2]   [2   2]                     Quantized coefficient scan              Coeffs'  Scan order     Bitstream non-zero coefficients.              [13 2]     [0 2]     =>         [13, 1, 2]              [1  0]     [1 3]        Note: this is a simplified view. In reality, the transform is 2D              (transform 1D + transpose + transform 1D) and a bias is added              before the quantization division (2 in the example).

Two transformation methods are supported: the discrete cosine transform (DCT), and the discrete sine transform (DST). Both transforms result in a square block of coefficients that are more suitable for quantization than raw pixels, as they "compact" energy in a few meaningful coefficients. Then, the quantization reduces the size of those coefficients so that they take fewer bits to encode. Typically, most quantized coefficients are zero and the non-zero coefficients are regrouped in one portion of the block. The quantized coefficients are scanned in an order that regroup the non-zero coefficients together and the resulting array is sent to the decoder.

There are two pixel prediction methods: intra and inter. In intra prediction, the pixels of a block are predicted from the pixels of the neighbouring blocks (spatial prediction). In inter prediction, the pixels of a block are predicted using the pixels of a previously encoded frame (a reference frame) at or near the same location (temporal prediction). A motion vector indicates the location of the block in the frame previously encoded.

All the information transmitted in the bitstream (residual, motion vectors, etc.) is further compressed by entropy coding (CABAC is the only method used in H.265, unlike H.264 that offered a choice between CAVLC and CABAC) which further eliminates redundancies in the data.

The decoder performs the steps described above in reverse order. It parses the bitstream, predicts the pixel values (this mimics the encoding process), dequantizes and transforms the extracted coefficients to obtain the residual, and then adds the residual to the predicted pixel values. The resulting image is called the reconstructed image prior to deblocking filter. Note that the deblocking filter may be deactivated. The encoder also performs these steps (except for the parsing process), to obtain the same reconstructed image. This ensures that the inter predicted samples the decoder obtains are identical to the ones obtained by the encoder.

There are two filters applied on the reconstructed image. The deblocking filter smoothes out the image to remove the artifacts caused by the block-based transform. Then, the sample adaptive offset (SAO) filter corrects small errors in the deblocked image.

Coding tree blocks and coding blocks

To manage complexity, each image is split into small chunks called coded tree blocks (CTB). All CTBs are square and have the same size. The CTB size (64x64, 32x32, 16x16 or 8x8) is specified in the active sequence parameter set (SPS). CTBs replace the fixed size 16x16 macroblocks used in previous standards. The optimal size of the CTBs is tightly coupled to the image resolution.

    "The support of larger CTBs than in previous standards is particularly      beneficial when encoding high-resolution video content."                                          - Overview of the HEVC standard

It is a safe assumption to consider that the size of the CTBs will not change during the encoding process. To properly change the size of the CTBs, the encoder needs to send a new SPS, and then send an instantaneous decoding refresh (IDR) picture to flush all the pictures currently held in the decoded picture buffer. This situation is more likely to occur in a decoder that can choose from multiple incoming streams (a set-top box is a good example).

The CTB is split into square coding blocks (CB) of size 64x64, 32x32, 16x16, 8x8 recursively with a quadtree (depth-first). The result is called the partitioning. A smaller CB has more overhead but it allows greater control over how the pixels are encoded. A CB is either all intra or all inter. The maximum and minimum CB sizes are specified in the bitstream. The maximum CB size is also implicitly the CTB size.

          +---------+          |abee|kkkk|           Example partitioning.          |cdee|kkkk|          |ffgh|kkkk|  Each different letter is a coding block (CB).          |ffij|kkkk|          |----+----|  The CTB is encoded depth-first [01] recursively.          |llll|mmnn|                                 [23]          |llll|mmnn|          |llll|oopq|          |llll|oors|          +---------+

The CTBs need not be aligned on the right and bottom image boundaries. For instance, consider an image 100 pixels wide and a 64x64 CTB size. Then, the total CTB width is 2*64 = 128 and there are 128-100=28 pixels that are outside the right boundary.

                    100                    Boundary          0     64   |  128                   |          +------+---|--+              +-----+|+-----+          |      |   |  |              | Full|||Empty|          |      |   |  |              +-----+|+-----+          +------+---|--+              |   Par|tial  |                     |                 |      |      |                                       +------|------+

A block is full if all its pixels are inside the image, empty if all its pixels are outside, else it is a partial block. It is mandatory to split partial or empty CBs recursively up to the minimum CB size. In the example, the second CTB may contain a full 32x32 CB to the left (64 + 32 = 96 < 100), necessarily followed by four 8x8 CBs to the right (assuming a minimum CB size of 8x8). One of those four 8x8 CBs would be partial and the others would be empty. Empty CBs are not included in the bitstream.

                                104                   64       96   |     128                    +--------+---|------+     f: full block                    |f f f f | p | e e e|     p: partial block                    ....                      e: empty block

The size of an image is sent in luma samples. That value must be divisible by the smallest CB size. The syntax elements pic_width_in_luma_samples and pic_height_in_luma_samples are sent in the SPS. If the real (displayed) image is not aligned on the minimum CB boundary, it is possible to specify a "crop" offset to discard the unwanted pixels. In the example, the image size in the bitstream is 13*8=104 and the crop offset is 104-100=4. Technically, there is a crop offset for each of the four edges and its value can be as large as desired.

When the real image size is not aligned on the minimum CB boundary, the image is padded by replicating the rightmost/bottommost pixel of the image up to the minimum CB boundary. The encoder then encodes the partial CB as any other CB. Technically, the specification doesn't require this padding method specifically, it merely requires the padding to be done e.g. by setting the unwanted pixels to grey. However, inter prediction may refer to pixels located outside the reference frame and in that case, the padding-by-replication behavior is assumed.

The encoding of a partial CB is generally inefficient because the padding disrupts inter prediction (no block in the reference frame matches the padded block, which causes expensive intra prediction). Presumably this is one reason why the specification enforces CB splitting rather than requiring the image to be padded up to the CTB boundary. A crude encoder can ignore the splitting issue altogether by aligning the image to the CTB boundary and by using a large crop offset.

The specification makes a distinction between a coding block (CB) and a coding unit (CU) (same for prediction block/unit, transform block/unit). The CB refers to the pixels of one image component (e.g. luma) and the CU refers to the pixels of all three image components together. This document never make the distinction, as the same process (prediction, transform, quantization) applies to Y, Cb, and Cr components.

Prediction and transform blocks

A CB can be split differently for prediction and transformation. The prediction split determines the size of the blocks for intra or inter prediction. This split is non-recursive (no sub split). The transform split determines the size of the blocks for transforming. This split is recursive up to 3 levels deep. Both splits apply both to luma and chroma. The different split shapes are as follow.

       unsplit   horizontal  vertical     both   <= Split line location.         UN          H2         V2         HV    <= This document name.        2Nx2N       2NxN       Nx2N        NxN   <= Specification name.        +--+        +--+       +--+       +--+        |aa|        |aa|       |ab|       |ab|      H1: 1/4 of the block.        |aa|        |bb|       |ab|       |cd|      H2: 1/2 of the block.        +--+        +--+       +--+       +--+      H3: 3/4 of the block.          H1         H3         V1         V3   <= Asymmetric motion partitions        2NxnU      2NxnD      nLx2N      nRx2N     (AMP).        +----+     +----+     +----+     +----+        |aaaa|     |aaaa|     |abbb|     |aaab|        |bbbb|     |aaaa|     |abbb|     |aaab|        |bbbb|     |aaaa|     |abbb|     |aaab|        |bbbb|     |bbbb|     |abbb|     |aaab|        +----+     +----+     +----+     +----+

The prediction split has the following restrictions:

  • Intra prediction allows only square splits (UN and HV).
  • Blocks smaller than 4x4 are not allowed.
  • Inter prediction cannot use 4x4 blocks (but 8x4 and 4x8 are allowed).
  • The HV split (inter or intra) is not allowed unless the CB has the minimum size. For example, if the minimum CB size is 16x16, then a 16x16 CB can use the HV split. If the minimum CB size is 8x8, then a 16x16 CB with HV must be converted into four 8x8 CBs with UN. Given the 4x4 inter restriction, it follows that if the minimum CB size is 8x8, the inter HV split cannot be used anywhere.
  • The AMP partitions cannot be used if AMP is disabled or if the CB has the minimum size.

The transform split has the following restrictions:

  • Square splits only (UN and HV).
  • The transform sizes are 32x32, 16x16, 8x8 and 4x4 (no 64x64).
  • The maximum and minimum transform sizes are specified in the bitstream.
  • An intra TB cannot be larger than the intra PB at this location. For example, an 8x8 intra PB may use the 8x8 or 4x4 transform, but a 4x4 intra PB requires the use of the 4x4 transform.

The chroma residual is encoded alongside with the luma residual. For 4:2:0, the chroma transform size is half the luma transform size (for example, 8x8 Y, 4x4 U, 4x4 V). In the case of 4:2:0 and the 4x4 luma transform block, since 2x2 chroma blocks are not allowed, the first three 4x4 luma residual blocks are encoded without a chroma residual block and the fourth 4x4 luma residual block is encoded alongside with a 4x4 chroma residual block.

A CB is either all intra or all inter, but a CTB may contain a mixture of both. A CB depends on its reconstructed CB neighbours for intra prediction.

Tiles and slices

An image is also partitioned into tiles and slices. At the minimum, an image contains one tile and one slice. Tiles are rectangular. Slices flow left-to-right, top-to-bottom. A slice or tile contains an integer number of CTBs, i.e. CTBs do not cross slice or tile boundaries. Either a tile fully contains multiple slices or a slice fully contains multiple tiles. Both cases can happen at different locations in the same image. Example:

         0     1      +-----------+   There are four tiles (0..3) and five slices (A..E).      |AAAAA|DDDDD|     Tile 0 fully contains slices A, B, C.      |AAABB|DDDDD|     Tile 1 fully contains slice D (and vice-versa).      |BBBBC|DDDDD|     Slice E fully contains tiles 2 and 3.      |CCCCC|DDDDD|      +-----+-----+      |EEEEE|EEEEE|      |EEEEE|EEEEE|      |EEEEE|EEEEE|      |EEEEE|EEEEE|      +-----+-----+         2     3

Notice slices A, B, and C. Unlike tiles, slices do not have to be rectangular. Slices are CTB containers. An encoder may choose to limit the number of CTBs each slice may carry, or it could opt to limit the number of bytes it carries. In the second case, the slice will still carry an integer number of CTBs.

The following setup is illegal because slice A is not fully contained in a tile and it does not contain both tiles fully.

                                0   1                              +---+---+                              |AAA|ABB|                              +---+---+                                   ^                                   Slice A crosses the tile boundary.

This restriction also applies to slice segments, described below.

Tiles and slices are encoded in raster scan (RS) order. The raster scan is defined as left-to-right, top-to-bottom. Example:

             2D image      Image scanned in raster scan              +----+              |0123|   ===>   [0123 4567 89AB CDEF]              |4567|              |89AB|              |CDEF|              +----+

The CTBs are encoded in tile scan order (TS, z-scan, depth scan) that is a hierarchical raster scan. Example:

            0  1      Image with four tiles, each containing four CTBs.          +--+--+       The 16 CTBs are labeled 0..F.          |01|23|          |45|67|     The tiles are scanned in raster scan order.          +--+--+     The CTBs in each tile are also scanned in that order.          |89|AB|       Tiles in TS: [ 0    1    2    3  ]          |CD|EF|        CTBs in TS: [0145 2367 89CD ABEF]          +--+--+            2  3

The tiles can be spaced evenly in the image. Alternatively, the size of each tile row and tile column can be specified explicitly. The layout of slices is arbitrary.

Tiles are used for parallelization purposes. A tile does not depend on the content of other tiles during encoding, which allows the tiles to be encoded and decoded in parallel with threads.

Slices are used for error resilience and packetization purposes. A slice does not depend on the content of other slices during encoding/decoding. If a slice is lost, the other slices can still be decoded and displayed correctly. Each slice is identified by the number of the first CTB it contains in raster scan.

A slice may be split into multiple small segments that fit in packets (e.g. smaller than the network MTU size). Each segment is stored in a network abstraction layer unit (NAL), which represents the content of the packet on the network. The first segment is an independent segment. The following segments are dependent segments. An independent segment contains a full slice header (SH). A dependent segment contains a minimal slice header. The full/minimal slice header contains the information needed to decode the slice/slice segment. In general, a segment cannot be correctly decoded if a previous segment is lost, due to missing dependencies (prediction info, CABAC state, etc.).

     "Basically, dependent slices provide fragmentation of regular      slices into multiple NAL units, to provide reduced end-to-end      delay by allowing a part of a regular slice to be sent out      before the encoding of the entire regular slice is finished."                         - Overview of HEVC High-Level Syntax and                           Reference Picture Management

Slices can also be used for parallelization since each slice is encoded/decoded independently, but in that case the encoding is less efficient due to the non-rectangular shape of slices (some spatial correlation is lost). Moreover, parallelization becomes even more difficult when slices are limited by an MTU size, as the start of the next slice depends on where the previous slice finished.

The entropy state (CABAC) is reset at slice and tile boundaries to ensure that the tiles and slices are independently decodable. Additionally, when a slice segment spans multiple tiles, the slice header specifies where the tiles start in the bitstream so that the tiles can be parsed in parallel. Those starting locations are called entry points and they represent offsets from the byte that follows the slice header after bitstream escaping. The first tile has no entry point since its offset is known to be zero. There is an entry point for each remaining tile, encoded as the offset from the previous offset (0 if no previous offset). Note that WPP affects entry points (see below).

Wavefront parallel processing (WPP)

WPP is another method that increases encoding/decoding parallelism. The CTBs are encoded in parallel as soon as their spatial dependencies are available.

   Spatial dependencies of            WPP CTB encoding. Same       the current CTB                 number => in parallel       +---+---+---+                        0123456       |NW | N | NE|                        2345678       |---+---+---+                        456789A     19 ticks parallel       |W  |Cur|                            6789ABC     49 ticks sequential       +---+---+                            89ABCDE                                            ABCDEFG                                            CDEFGHI   CABAC context import       +-+-+-+-+-+       |A|B|X|X|X ...  CTB A resets the CABAC context (first CTB in image).       +-+-+-+-+-+       |C|D|X|X|X ...  CTB C imports the CABAC context after B was encoded.       +-+-+-+-+-+       |E|F|X|X|X ...  CTB E imports the CABAC context after D was encoded.       +-+-+-+-+-+

When WPP is enabled, the CABAC context of the current row is imported after encoding the second CTB of the previous row. This removes the entropy coding dependency on the rest of the previous row. Indeed, a CTB may only be encoded after the top-right CTB has been encoded so a CTB row can be encoded when at least two CTBs have been encoded in the previous row. By importing the CABAC context of a row at the moment when parallel encoding can start at this row, the impact on the entropy coding efficiency is minimal with respect to the usual left-to-right, top-to-bottom entropy coding.

WPP affects the number of entry points in a slice, as follow.

  • If only tiles are used:
    • There is an entry point for each tile spanned by the slice segment, except for the first.
  • If only WPP is used:
    • There is an entry point for each CTB row spanned by the slice segment, except for the first.
    • If a slice spans only one CTB row, the slice may start and end anywhere in that CTB row (not necessarily at the beginning or the end of the row).
    • If a slice spans multiple CTB rows, the slice must start at the beginning of a CTB row but it may end anywhere. Presumably this is done so that a slice can be processed in one chunk without further synchronization, i.e. to avoid the following situation:
      • The second row of the slice is ready to be decoded because the slice contains an entry point for that row.
      • The first row of the slice is not ready to be decoded due to the dependency on the CABAC state of the previous slice.
    • The same applies for slice segments (replace "slice" by "slice segment" in the statements above). Presumably this is done so that segments can be parsed in one chunk.
  • If both tiles and WPP are used:
    • The behavior for both tiles and WPP is combined.
    • There is one entry point for each CTB row of each spanned tile.
    • The same restrictions apply.
    • Remember that a slice or slice segment may only span multiple tiles fully.

Although they are compatible, tiles and WPP currently cannot be used at the same time.

                  "For design simplicity, WPP is not allowed                    to be used in combination with tiles."                                   - Overview of the HEVC standard

A full CABAC reset is done when a tile/segment contains only one CTB column.

As far as the specification is concerned, WPP only affects the CABAC contexts and the entry points. There is no impact on other mechanisms such as intra prediction. The lag of two CTBs between successive rows ensures that the required prediction information is available.

Picture order count (POC)

The frames of a video need not be encoded in their display order. The frames are sent in the coding order, to prevent the decoder from having to buffer coded pictures until future pictures are received. The decoder can immediately process the coded pictures, but will have to reorder them for display. Example:

     Display order   Encoding order with one B frame     0 1 2 3 4 5 6          0 2 1 4 3 6 5                 I    P    P     P                 0<---2<---4<----6            I: intra frame.                  \   /\   /\   /             P: predicted frame.                   \ /  \ /  \ /              B: bi-predicted frame.                    1    3    5                    B    B    B

Each picture has a picture order count (POC), which identifies its display order. Typically the POC is 0 for the first frame, 1 for the second, etc. However, the POCs need not be contiguous, they just need to form a monotonically increasing sequence. The POC of the first frame of the bitstream is inferred to be 0.

The POC of the frames is reset to 0 when an IDR frame is encoded. The IDR frame marks the beginning of a new video sequence, i.e. it assumes that no previous frames have been encoded. This is useful for seeking in the video.

Since POC values can grow large, they are encoded using a sliding window, as follow:

     Most significant bits (MSB) | Least significant bits (LSB) (4)                             000 | 1100                             000 | 1101    The LSB size is configurable.                             000 | 1110    4..16 bits can be used to                             000 | 1111    signal the LSBs.                             001 | 0000

The slice header contains the LSB part of the POC. The decoder increments the MSB when it detects a rollover in the LSB value (see clause 8.3.1 for details). This scheme works as long as not too many frames are lost in a row, which would cause the decoder to get confused about the value of the MSB. The specification requires that at least 4 bits be used for the LSB, and up to 16 bits can be used. Using more bits reduces the probability that the decoder becomes about the value of the MSB.

The specification requires that the POC fits in signed 32 bits. The POC of a frame can be negative, e.g. to display a frame before an IDR frame (which has POC 0). Furthermore, the difference in POCs between two frames in the decoded picture buffer (see below) must fit in signed 16 bits.

Slice types

A block can be encoded with intra prediction, inter single prediction or inter bi-prediction.

     Inter single prediction            Inter bi-prediction       +----+      +----+          +----+      +----+      +----+       |Ref0| >>>> |Cur |          |Ref0| >>>> |Cur | <<<< |Ref1|       +----+      +----+          +----+      +----+      +----+

With inter single prediction, the pixels of the current block are predicted from a block of pixels located in one reference frame. The location is identified by a motion vector. The reference frame is identified by its POC number (this is a simplification, see reference frame lists).

With inter bi-prediction, the pixels of the current block are predicted by the weighted average of two blocks of pixels located in two reference frames. Each of the two reference blocks is identified by a motion vector and a reference frame POC. By default, each reference block contributes to 50% of the value of the predicted pixels. In some cases this improves the prediction accuracy. The encoder can also transmit explicit weights for each motion vector.

Typically, when bi-prediction is used, one reference frame is a past frame (a frame displayed before the current frame) and the other is a future frame (a frame displayed after the current frame). However, the specification imposes no restrictions on the reference frames used for bi-prediction. For instance, bi-prediction can be used with two past frames or with the same reference frame and two different motion vectors.

In the bitstream, the choice between intra prediction, inter single prediction, and inter bi-prediction is encoded for every prediction block. This overhead adds up. When a prediction mode is not needed for an entire slice, it can be specified so in the slice header. An I slice cannot use inter prediction. A P slice cannot use inter bi-prediction. A B slice can use all prediction modes. Note that a frame can contain multiple slices of different types.

Decoded picture buffer (DPB) and reference picture set (RPS)

The decoded picture buffer (DPB) contains the frame being decoded and the frames used for reference (i.e. those used for inter prediction). The size of the DPB is linked to the level specified at the beginning of the bitstream so that the decoder can allocate memory for it. The size of the DPB guarantees that 6 pictures can be stored at all times (5 reference pictures plus the current picture). Up to 16 pictures can be stored depending on the image size (more than that actually, but this restriction limits the number of reference frames that can be used).

The DPB is managed just after the slice header of the current frame is parsed. In other words, this happens before the first pixel of the frame is decoded (unlike H.264). If the DPB is full, at least one frame needs to vacate the DPB to make room for the current frame.

The DPB management is done such that the state of the DPB can be obtained even if some previous frames are lost due to transmission errors. The specification uses the reference picture set (RPS) for that purpose. The RPS of the current frame specifies explicitly the POC of every reference frame that should be part of the DPB of the current frame (excluding the current frame itself). The decoder removes the reference frames that are not part of the RPS from the DPB. Since the RPS is self-contained in the slice header, the loss of a previous frame doesn't affect the RPS content and this helps to limit the propagation of errors.

The RPS also specifies, for each reference frame, whether the reference frame is used for reference by the current frame. In other words, one can specify that a reference frame might be used to decode a future frame although it is not actually needed to decode the current frame.

For additional information, see "Overview of HEVC High-Level Syntax and Reference Picture Management".

Reference frame lists

A reference frame list is an ordered list of frames used for reference by the current slice, along with their weighted prediction data. Each slice contains two reference frame lists, list 0 (L0) and list 1 (L1). An I slice uses no list, a P slice uses L0 and a B slice uses both L0 and L1.

Typically, L0 contains past frames and L1 contains future frames. However, the specification imposes no restriction on the content of a reference frame list. For example, a reference frame may appear in both L0 and L1 multiple times.

A reference index is an index in a reference frame list. It represents the data of the reference frame at that position in the list. For example, index 0 represents the data of the first reference frame in the list.

A block encoded by inter single prediction uses a reference frame in L0 or L1. A block encoded by inter bi-prediction uses one reference frame from L0 and one reference frame from L1. It is not possible to use two reference frames from L0 (or L1).

In the bitstream, the encoding of an inter predicted block is as follow:

  • 1 bit specifies whether the merge mode is used (see below).
  • 1 bit specifies whether single or bi-prediction is used (if not P slice and not merge mode).
  • If single prediction:
    • If B slice: 1 bit specifies the list used (L0 or L1).
    • If P slice: L0 is assumed.
    • Encode the reference index (if there is more than one reference frame in the list) with the motion vector data.
  • If bi-prediction:
    • Encode the reference index in the L0 list (if there is more than one reference frame in L0) with the L0 motion vector data.
    • Same for L1.

Each reference index has some associated weighted prediction data that affect the pixels of the reference frame (see weighted prediction below). Emphasis: the weighted prediction data is associated to the reference INDEX, not to the reference FRAME itself. A frame can appear both in its weighted and non-weighted version in the same reference list.

The content of L0 and L1 is specified in the slice header. The specification provides default values that the encoder can override.

Weighted prediction

Weighted prediction is used to scale the pixels of a reference frame. The formula for single prediction is as follow.

  ScaledPixel = ClipPix(((RefPixel * Numerator) >> Denominator) + Offset).  The formula for bi-prediction is similar.

Weighted prediction can be used to scale the pixels of a fading scene. It can also be used for temporal bi-prediction, where the weight given to each reference frame is based on the reference frame POC distance to the current frame.

Parameter Sets (VPS, SPS, PPS)

A parameter set (PS) contains a bunch of parameters that control the encoding of the current frame. The video parameter set (VPS) is currently unused. It is needed only for scalable video coding (SVC). A sequence parameter set (SPS) contains parameters that may change at IDR frames. The picture parameter set (PPS) contains parameters that may change every frame.

There can be 16 VPS, 16 SPS and 64 PPS. A PS is said to be active if the current frame uses its parameters. Only one VPS, SPS, PPS is active at a time. Each PS is stored in one NAL unit. The slice header of a frame specifies the identifier of a PPS, which specifies the identifier of a SPS, which specifies the identifier of a VPS.

Motion vector and reference index encoding

An inter prediction block has one or two associated reference indices and motion vectors (single prediction or bi-prediction). Encoding the reference indices and motion vectors explicitly has a high overhead so the specification defines compression tools for both.

The merge_flag (the first bit of the prediction block) determines whether the merge mode is used. When the merge mode is used, the decoder generates a list of candidates for the inter information. Each candidate specifies which reference lists are used (L0, L1 or both) and the motion vector and reference index in each list. The bitstream contains an index in the candidate list (merge_idx) that identifies the actual candidate used. The size of the candidate list is specified in the slice header. The candidate list includes some spatial candidates and possibly a temporal candidate. The neighbour blocks used to generate the spatial candidates are shown below.

         B2|    |B1|B0         --+-------+--      A0 is the block to the south-west.           |       |        A1 is the block to the west, downward.           |Current|        B0 is the block to the north-east.           | Block |        B1 is the block to the north, rightmost.         --|       |        B2 is the block to the north-west.         A1|       |         --+-------+        Note that a neighbour block may cover more than one         A0|                position when it is larger than the current block.

If the merge mode if not used, the reference list usage and the reference indices are encoded explicitly. For each motion vector, a list containing two predicted motion vector (PMV) candidates is generated. One candidate is predicted from the top neighbour and the other is predicted from the left neighbour. The bitstream contains a flag that identifies the actual PMV. The motion vector is then encoded as the residual from the predicted motion vector (MV - PMV) for the X,Y components. The most common residual values, 0 and 1, are encoded compactly.

Quantization parameter (QP) encoding

The QP determines the factor by which the coefficients are multiplied to reduce their size (the coefficients are in fact positive fractions smaller than 1). For 8-bit encoding, the QP is within the range [0, 51]. Increasing the QP by 1 decreases the bitrate by ~12% (+6 QP == half the bitrate).

The QP can be constant for the whole video or it can vary per slice or per block. When the QP varies per block, things get complicated. The specification divides the image in square blocks called quantization groups (QG). The PPS specifies the QG block size, which must be smaller or equal to the CTB size. The specification allows at most one QP change per QG and at most one QP change per CB (both constraints must be respected simultaneously).

The decoder needs the QP when it decodes a transform block that has non-zero coefficients. The QP of the current transform block is predicted from its left and top neighbours (this is a simplification). The encoder can specify an offset to correct this prediction if the current transform block is the first transform block that has non-zero coefficients in the QG and in the CB.

Scaling lists

Scaling lists are used to assign an individual quantization weight to each coefficient of a transform block. For example, the encoder can specify that during quantization the first coefficient of the transform block (the DC coefficient) is multiplied by 0.2 and the other coefficients (the AC coefficients) are multiplied by 0.1 The weights can be adapted to the video properties (with two-pass encoding) to increase quality.

The specification defines a scaling list for each transform size (4x4, 8x8, 16x16, 32x32), image component (Y, U, V) and prediction mode (intra, inter). In 4:2:0 there are no 32x32 chroma transforms, so in that case there are 3*3*2 + 1*1*2 = 20 scaling lists.

During dequantization, the values in a scaling list are scaled by the QP. The values double each time the QP increases by 6 (this is the inverse for quantization).

The specification defines two behaviors for scaling lists. If the scaling lists are disabled (scaling_list_enable_flag = 0 in SPS), then all the coefficients have the same weight for all the transform sizes (flat scaling lists).

Otherwise, the specification defines default values for each scaling list (those default scaling lists are NOT flat). The encoder can override the default values of each scaling list by specifying the values explicitly in the PPS or SPS with a compression scheme to reduce the size. The encoder can also specify that one scaling list has the same values as another scaling list with the same transform size. For example, the encoder can specify that the 8x8 luma inter scaling list is the same as the 8x8 luma intra scaling list.

In the bitstream, a scaling list for a 4x4 transform has 4*4=16 entries and a scaling list for an 8x8 transform has 8*8=64 entries. For the 16x16 and 32x32 transforms, specifying all the values explicitly is impractical because there are a large number of coefficients. Instead, the 16x16 and 32x32 scaling lists have 8*8=64 values and each value is shared by several coefficients. For the 16x16 transform, the coefficient index is divided by 4 to get the index in the 16x16 scaling list (divided by 16 for the 32x32 scaling list).

Bypassing prediction, quantization and transformation

The prediction, quantization and transformation steps can be bypassed at the coding or transform block level.

At the coding block level:

  • If pcm_flag (pulse code modulation) is true, the pixels are transmitted as-is or at a lower bit depth without prediction, quantization or transformation.
  • If cu_transquant_bypass_flag is true, the residual values are not quantized or transformed (this can be used for lossless coding).
  • If rqt_root_cbf is false, the residual values are zero.

At the transform block level:

  • For 4x4 blocks, if transform_skip_flag is true, the residual values are quantized but not transformed.

Deblocking filter

The deblocking filter (DF) removes the blocking artifacts caused by different encoding modes. For example, if two adjacent blocks are predicted from two different frames, then the predicted pixels of both blocks may differ significantly. The residual will correct some of this difference but a visible discontinuity may remain. The DF removes the discontinuity by smoothing out the pixels near the edge of both blocks. Up to four pixels on each side of the edge may be used for that purpose.

The DF smoothes the edges at prediction and transform block boundaries. In other words, if one draws the countour of every prediction block and transform block on the frame, the resulting lines are the filtered edges. However, for performance reasons, only the edges that lie on an 8-pixel boundary are filtered.

All the vertical edges of the frame are filtered first, then all the horizontal edges are filtered. This order is important since the result of the vertical filtering affects the horizontal filtering. Otherwise, the horizontal filtering is almost the same as the vertical filtering. The edges in one direction can be processed in parallel because the filtering only refers to four pixels on each side of an edge (no pixel overlap). The vertical and horizontal filtering can also be done on a CTB-by-CTB basis, so long as care is taken to filter the vertical edges before the horizontal edges.

The DF contains much logic to determine if and how much each edge should be filtered. For that purpose, each edge is considered to be four pixels long. The filtering process has three steps.

Firstly, the blocks on each side of the edge are considered to determine the boundary strength (BS). The BS controls the strength of the filtering. If one of the blocks is intra, the BS is set to 2 (stronger). Otherwise, if the inter blocks are predicted from different frames, or if the motion vectors of the two blocks are distant from each other, or if one of the blocks has a non-zero residual, the BS is set to 1 (weaker). Otherwise, the BS is set to 0 and no filtering occurs for the edge.

Secondly, if the BS is non-zero, the pixels of the two blocks are considered to determine whether filtering is needed for the edge. The DF assumes the pixels on each side of the edge form a line (in increasing or decreasing values) and it measures the size of the discontinuity gap of the line at the edge. Consider the following picture.

                            |       9   Line of pixels                            |     8     to the right.                            |   7                            | 6                            ^                            |                            | Discontinuity at the edge (3).                            |                            v               Left right       Line of pixels    3  |   <=========  0123|6789   <= first row       to the left.    2    |      Plot     XXXX|XXXX                     1      |               XXXX|XXXX                   0        |               YYYY|YYYY   <= fourth row

If the gap is small, there is likely a blocking artifact and the edge filtering proceeds. Otherwise, there is likely a discontinuity in the source frame itself and no filtering occurs. For performance reasons, the DF only considers the pixels of the first and fourth rows of pixels of the edge for this decision. The decision determines whether the edge is filtered, not which rows of the edge are filtered.

Finally, if the edge needs filtering, the DF determines the filtering strength for each row of pixels in the edge. The DF first checks if strong filtering is needed by analyzing the pixels of the first and fourth rows. If strong filtering is used, three pixels in each row are filtered. Otherwise, each row is analyzed independently to determine the number of pixels filtered (0, 1, or 2).

The previous discussion applies to luma edges. The chroma edges are deblocked if the luma BS is 2, using an algorithm similar to the one used for luma.

The article "HEVC Deblocking Filter" (published by IEEE) contains more thorough explanations on the design (mind the discrepancies with the current specification).

Sample adaptive offset

The sample adaptive offset (SAO) filter adds an offset to every pixel of the reconstructed frame based on the SAO parameters of the CTB that contains the pixel. Each CTB contains its own set of SAO parameters for each image component. The filter processes each pixel of each CTB independently in each image component.

The filter operates in a mode chosen for the CTB. In edge mode, the filter compares the current pixel with its two immediate neighbours in a direction chosen for the CTB.

    Horizontal   Vertical   Diagonal NW   Diagonal NE                    N          N               N       NCN          C           C             C          C is the current pixel                    N            N           N           N is a neighbour pixel

For each neighbour pixel, the current pixel can be greater, equal or lower than the neighbour pixel. This leads to the following situations.

                               Legend          >: current pixel greater than neighbour          =: current pixel equal to neighbour          <: current pixel lower than neighbour          "= =" means the current pixel is equal to both neighbours                The order is immaterial           Two neighbour comparisons   Situation                       = =               flat                       > <               slope                       < =              edge <=                       > =              edge >=                       < <              minimum                       > >              maximum

The offset added to the current pixel depends on the situation. If the situation is "flat" or "slope", the offset is 0. Otherwise, the SAO parameters specify the offsets for the situations "edge <=", "edge >=", "minimum" and "maximum" (there are thus 4 offsets in the SAO parameters).

In band mode, the filter divides the current pixel by 8 (for 8-bit pixels) to obtain the band of the pixel. There are 256/8 = 32 possible bands. A set of four contiguous bands is chosen for the CTB (e.g. the bands 4..7). The offset added to the current pixel depends on the band of the pixel. If the band of the pixel is not within the chosen set, the offset is 0. Otherwise, the SAO parameters specify the offset associated to each band.

Coefficient encoding

This section summarizes the encoding of the coefficients of a transform block. The article "High Throughput CABAC Entropy Coding in HEVC" (published by IEEE) contains a better explanation. This present text merely describes how the encoding proceeds, not why.

The coefficients of a transform block (TB) are encoded in sub blocks (SB) of 4x4=16 coefficients. Example:

      4x4 transform   8x8 transform   16x16 transform   4x4 sub block          [SB]          [SB SB]        [SB SB SB SB]       [CCCC]                        [SB SB]        [SB SB SB SB]       [CCCC] C=coefficient                                       [SB SB SB SB]       [CCCC]                                       [SB SB SB SB]       [CCCC]

The SBs and the coefficients in a SB are scanned in an order that depends on the prediction block type. There are three scanning orders, as follow:

         Up-right          Horizontal           Vertical          [0259]             [0123]              [048C]          [148C]             [4567]              [159D]          [37BE]             [89AB]              [26AE]          [6ADF]             [CDEF]              [37BF]          This example shows the orders for a 4x4 block.          The same pattern applies to other block sizes.

The up-right order is the general order. The horizontal and vertical orders are used for some intra predicted transform blocks when the angular prediction is mostly vertical or horizontal.

To clarify: the SBs of a TB are encoded in an order that depends on the prediction block type, and the coefficients within a SB are encoded in the same order. Recall that for intra prediction the transform block is smaller or equal to the prediction block, so there is no ambiguity.

The encoding proceeds in three steps.

  1. Identify the position of the last non-zero coefficient in the TB. Example:
          8x8 TB in up-right order           [0259] [025 ]        <= The coefficient at position '5' in SB 2 is       SB 0 [148C] [14  ] SB 2      the last non-zero coefficient of the TB. The           [37BE] [3   ]           offset from the CB origin is (6, 0).           [6ADF] [    ]           [0259] [    ]      SB 1 [148C] [    ] SB 3           [37BE] [    ]           [6ADF] [    ]

    The (X,Y) position of the last non-zero coefficient in the TB is encoded first to skip the encoding of the trailing zero coefficients.

  2. Identify the non-empty SBs:

    A SB is empty if all its coefficients are zero. The SB that contains the last non-zero coefficient of the TB is known not to be empty. The first SB of the TB is assumed not to be empty. For all the SBs between the first and the last SB of the TB in scan order, the bitstream contains a bit to indicate whether the SB is empty. In the example above, there is a bit with value 1 for SB 1 and no bits for SB 0, 2, 3.

  3. Encode the values of the coefficients in each non-empty SB:

    All the coefficients of a SB are processed in reverse scan order (last-to-first) in the following text.

    For each coefficient in the SB, a significance flag indicates whether the coefficient is non-zero (this is a simplification, there are inferences in some cases).

    For each non-zero coefficient in the SB, a "greater than 1" flag indicates whether the coefficient is greater than 1 in absolute value. Up to 8 such flags may be present in the SB (i.e. there is no flag for the ninth coefficient greater than 1).

    If at least one coefficient is greater than 1, a "greater than 2" flag is transmitted for the last such coefficient to indicate whether that coefficient is greater than 2 in absolute value.

    The sign of each non-zero coefficient is transmitted with a flag.

    For each coefficient in the SB whose absolute value is still unknown, the delta from the inferred minimum value is transmitted. For example, if a coefficient has been marked greater than 2 with the flags above and its value is 5, then the delta transmitted is 5 - 3 = 2.

    There is an optional optimization to infer the sign of the first coefficient when there are many non-zero coefficients (this saves up to one bit per SB). The absolute values of the coefficients are added up. If the total is odd, the first coefficient is negative, otherwise it is positive. For example, if the absolute coefficient values are 5, 2, 7, 1, the total 5+2+7+1 = 15 is odd so the first coefficient has a negative value.

    This optimization works because the encoder can often change a coefficient value for another with little impact on the image distortion. For example, suppose an unquantized coefficient has value 10 and the quantization divides a coefficient by 4. 10/4 = 2 (rounded down), 10/4 = 3 (rounded up). On average it makes no difference if 2 or 3 is transmitted since 2*4=8 and 3*4=12 have the same distance (distortion) from 10.

Entropy coding

Entropy coding is the process of writing the syntax elements (e.g. part_mode) in the bitstream. There are two steps for each element. First, the element is binarised by converting the value of the element into a sequence of bits called 'bins'. The binarisation of an element depends on its type and also on the values of other elements. For example, the value of part_mode can be binarised as 1, 01, 001, etc. The binarisation differs depending on whether inter prediction and asymmetric motion partitions are allowed. The binarisation process is important, as the decoder uses the binarisation scheme to know when to stop reading bits. In the above example, a unary representation is used (a 1 indicates the end of the codeword). Thus, the decoder knows to stop reading when a 1 is found.

The second step is the binary arithmetic coding (BAC) of the bins of the element. Binary arithmetic coding is outside the scope of the document. The following text describes how the BAC is implemented in the specification but it does not cover the underlying theory.

There is a probability context associated to each bin of the syntax element. The context specifies the predicted value of the bin (0 or 1) and the probability that the bin has this value. The BAC uses the context information to convert the sequence of bins into a shorter sequence of bits. Bogus example:

    Bin value:        1   0   1   1   0    output   101    Predicted value:  1   0   0   1   1    =====>    Probability:     75% 57% 62% 97% 52%    BAC

The probabilities in the contexts are initialized to known values when the encoding starts. Then, each time a bin is processed, the probability in the context is adjusted. If the bin has the predicted value, the probability increases slightly. Otherwise, the probability decreases much. Over time the probability in the context stabilizes around the local probability observed in the video. If the probability falls under 50%, the predicted value (the most probable symbol) is inverted to keep the probability above 50%.

Some bins do not have a probability context. Instead, those bins are encoded directly into the bitstream to increase performance. For example, the values 0 and 1 for the sign of a coefficient are equiprobable, so the sign bin is encoded assuming a probability of 50%.

Frame types

The specification defines frame types to address the following use cases:

  • Seeking in a video stream.
  • Switching between video streams encoded at different bitrates on the fly.
  • Preventing the POC from growing indefinitely.

The specification distinguishes between key frames (called random access points or RAP) and non-key frames. A key frame contains only I slices so it can be decoded independently of the other frames in the stream. A non-key frame has a dependency on the frames present in its DPB. The decoder starts decoding at a key frame in the video stream.

The specification allows much flexibility for decoupling the display order from the encoding order. The following sequences are valid:

                           Legend     Kx represents frame 'x' encoded as a key frame.     Px represents frame 'x' encoded with P slices.     Bx represents frame 'x' encoded with B slices.     |  represents a dependency boundary. The frames to the left of the        boundary do not depend on the frames to the right.     GOP = group of pictures, i.e. a related sequence of frames.   Display order            Encode order   0 1 2 3 4 5 6  ===>  K0 P3 B1 B2 P5 B4|K6  (typical closed GOP).   0 1 2 3 4 5 6  ===>  K0 P3 B1 B2 K6 B4 B5  (typical open GOP).   0 1 2 3 4 5 6  ===>  K2 P1 P0 P3|K6 P4 B5  (with leading pictures).

As the example shows, the key frames can be reordered as any other frame. There are a few restrictions. Each non-key frame is associated to the key frame that precedes it in encode order. If a non-key frame precedes its associated key frame in display order, it is a leading frame, otherwise it is a trailing frame. The leading frames must be encoded before the trailing frames. Also, a non-key frame must be displayed after the key frame that precedes (in display order) its associated key frame and before the key frame that follows its associated key frame. Example of illegal sequences:

  Display order    Encode order     0 1 2    ===>   K1 P2 P0  (illegal because P2 is displayed after P0).     0 1 2    ===>   K0 P2 K1  (illegal because P2 is displayed after K1).

There are other restrictions (most are listed in clause 7.4.1.2). Rule of thumb: if it doesn't make sense, it's not allowed.

The leading frames are subdivided in two groups. The decodable leading (DL) frames can be displayed after a seek. The skipped leading (SL) frames cannot be decoded after a seek and get discarded. Example:

                Display order      Encode order                  0 1 2 3   ===>   K0 K3 B1 P2          0    1    2===>3    P2 is predicted from K3 (DL).           <===|========>     B1 is predicted from K0 and K3 (SL).

In the example, P2 is a DL frame because its reference frame K3 is available when the decoder seeks to its associated key frame K3. B1 is a SL frame because one of its reference frames (K0) is missing when the decoder seeks to its associated key frame K3. After seeking, the decoder silently discards the SL frames and displays the DL frames or the key frame itself.

A streaming server can switch between video streams on the fly to accommodate changing bandwidth requirements. This can be done in several ways. In all cases, the server identifies a key frame in the destination stream and sends this key frame instead of the current frame of the source stream. The server may change the type of the key frame to BLA (broken link access) to explicitly signal the switch. Upon receiving the BLA frame, the client flushes its DPB, discards any SL frame associated to the BLA frame and continues decoding then destination stream normally.

The three key frame types supported by the specification are IDR (instantaneous decoding refresh), CRA (clean random access) and BLA (broken link access). The decoder flushes its DPB and resets the POC to 0 when it decodes an IDR frame. The decoder processes a CRA frame as a normal frame if it is not seeking and if the CRA frame is not the first frame of the stream (e.g. truncated video file), otherwise the decoder processes the CRA frame as an IDR and discards the SL frames. The decoder processes a BLA frame as an IDR and discards the SL frames.

A stream needs not contain an IDR frame, but an IDR frame must be used to reset the POC after 2 billion frames have been encoded. This makes sure that the POC can always be stored with a signed 32-bit value.

Part II.
Description of the bitstream elements

Some bitstream elements are described along with the concerned specification clauses to regroup the information.

Important variables used by the specification

  • cIdx: image component index (0 luma, 1 Cb, 2 Cr).
  • log2CbSize: log2 of the coding block size (3, 4, 5, or 6).
  • log2TrafoSize: log2 of the transform block size (2, 3, 4, or 5).
  • nCSL/nCSC: size of the coding block in pixels for luma/chroma.
  • nCS: alias to nCSL.
  • nS: alias to nCSL/nCSC, depending on the image component.
  • nT: size of the current transform block in pixels in the current image component.
  • predSamplesL/C: luma/chroma predicted pixels.
  • nPbW/nPbH: width/height of the current predicted block.
  • (xC,yC): X,Y location of the coding block from the source plane origin.
  • (xB,yB): X,Y location of the current block from the coding block origin. This is (0,0) for the first block.
  • (xP,yP): X,Y location of the current block of the coding block from the source plane origin. In other words, this is (xC+xB, yC+yB). Used for inter.
  • (xT, yT): equivalent to (xP,yP) for the transform text.
  • partIdx: partition index, i.e the number of the block being predicted in the CB (0 for the first block of the CB).
  • trafoDepth: transform split level from the CB size (0 if not split, 1 if split in 4 blocks, 2 if split again in 4 blocks, etc.).
  • IntraSplitFlag: true if the partitioning mode is HV for intra.
  • intraPredMode: luma intra prediction mode for the current block. Add 'C' for chroma.
  • mvL0/mvL1: luma motion vector for each list.
  • mvCL0/mvCL1: chroma motion vector for each list.
  • mvdL0/mvdL1: luma motion vector difference (from the predicted MV) for each list and each component.
  • refIdxL0/refIdxL1: reference index for each list.
  • predFlagL0/predFlagL1: true if the L0/L1 list is used.
  • PredMode: prediction mode: 0 (intra), 1 (inter).
  • MvL0[X][Y], RefIdxL0[X][Y], PredFlagL0[X][Y]:
    • The variables above mapped to each X,Y pixel from the CB origin.
  • MvdL0[X][Y], IntraPredMode[X][Y]:
    • The variables above mapped to each X,Y pixel from the source plane origin.
  • mvpLX: predicted motion vector for list X.
  • mvpListLX: prediction motion vector list for list X.
  • currPic: the current picture (the specification isn't more specific).
  • RefPicList0/RefPicList1: content of the reference picture list.
  • ListX: alias to RefPicListX, with X being 0 or 1.
  • refPicLX with X as L/Cb/Cr: the array of pixels in a reference picture.
  • MaxNumMergeCand: maximum number of merge candidates.
  • Inter block neighbours:
    • Spatial neighbours: A0, A1, B0, B1, B2.
    • Temporal neighbours: Col.
    • The neighbour name (or 'N' for a generic neighbour) is appended to the variable names above to refer to a variable in those blocks. For example, refIdxL0A0 or refIdxL0N.
    • availableN: true if this neighbour is available according to 6.4.2 (inside the frame, etc.).
    • availableFlagN: true if this neighbour is available for prediction according to more rules than just 6.4.2.

7.2: Specification of syntax functions and descriptors

Syntax units:

  • ae(v): CABAC-style bins.
  • b(8): one byte.
  • f(n): n bits.
  • u(n): unsigned integer with n bits. Same as f(n).
  • se(v): signed Exp-Golomb code.
  • ue(v): unsigned Exp-Golomb code.

7.4.3.1: Video parameter set RBSP semantics

  • video_parameter_set_id: value between 0 and 15 used by other elements to refer to a particular VPS.
  • vps_max_sub_layers_minus1: number of temporal sublayers (between 0 and 6).

    Note that temporal sublayers are used to achieve temporal scalability (scalable video coding).

  • vps_max_dec_pic_buffering: size of the DPB (in units of picture storage) actually used.

    This information makes it possible for the decoder to allocate just enough memory, rather than allocating the maximum amount of memory required for the target level.

  • vps_max_num_reorder_pics
    • Maximum number of frames that precede a frame in decode order but follow it in display order.
    • In other words, the number of frames to buffer for continuous playback.
  • vps_max_latency_increase_plus1
    • When present and not equal to 0, increases the maximum number of frames that precede a frame in decoding order but follow it in display order.
      VpsMaxLatencyPicutures = vps_max_num_reorder_pics +                          vps_max_latency_increase_plus1 - 1

7.4.3.2: Sequence parameter set RBSP semantics

  • sps_video_parameter_set_id: identify the active VPS.
  • sps_seq_parameter_set_id: identify the SPS. This is the value used by a PPS to refer to an SPS.
  • chroma_format_idc:
    1. 4:0:0 (monochrome)
    2. 4:2:0
    3. 4:2:2
    4. 4:4:4
  • pic_width_in_luma_samples / pic_height_in_luma_samples:
    • Frame width and height in luma pixels. Must be a multiple of the minimum CB size.
  • conformance_window_flag: true if cropping is enabled.
  • conf_win_*_offset: pixel cropping offsets in terms of luma units.
  • bit_depth_luma_minus8: luma bit depth.
    • Max bit depth is 14 bits.
    • bit_depth_luma_minus8 in the range 0..6.
  • bit_depth_chroma_minus8:
    • Same as above for chroma.
    • Cannot be greater than the luma bit depth.
  • pcm_sample_bit_depth_luma_minus1:
    • Bit depth of the luma PCM samples.
    • Cannot be greater than the luma bit depth.
  • pcm_sample_bit_depth_chroma_minus1:
    • Same as above for chroma.
  • log2_max_pic_order_cnt_lsb_minus4:
    • Size of the POC LSB window (1 << (log2_max_pic_order_cnt_lsb_minus4+4)).
  • sps_max_dec_pic_buffering & co:
    • - Direct overlap with VPS. FIXME.
  • log2_min_luma_coding_block_size_minus3:
    • Log2 of the minimum CB size (8x8 or greater).
  • log2_diff_max_min_luma_coding_block_size:
    • Log2 of the maximum CB size (64x64 or lower), expressed as the delta from the minimum.
    • MinCbLog2SizeY: log2 of the minimum CB size.
    • CtbLog2SizeY: log2 of the maximum CB size, i.e. log2 of the CTB size.
    • MinCbSizeY: minimum CB size, i.e. 1 << MinCbLog2SizeY.
    • CtbSizeY: CTB luma size, i.e. 1 << CtbLog2SizeY.
    • CtbWidthC/CtbHeightC: size of the chroma CTBs.
    • PicWidthInMinCbsY: picture width in terms of minimum CB blocks.
    • PicHeightInMinCbsY: same as above, but for height.
    • PicWidthInCtbsY/PicHeightInCtbsY: same in terms of CTB blocks.
    • PicSizeInCtbsY: number of CTBs in the picture, i.e. PicWidthInCtbsY * PicHeightInCtbsY.
    • PicSizeInSamplesY: number of luma pixels in the picture.
  • log2_min_transform_block_size_minus2:
    • Log2MinTrafoSize: log2 of the minimum luma transform block size.
    • Log2MinTrafoSize must be greater than the minimum CB size.
  • log2_diff_max_min_transform_block_size:
    • Log2 of the maximum luma transform block size (32x32 or lower), expressed as the delta from the minimum.
  • log2_min_pcm_luma_coding_block_size_minus3 / log2_diff_max_min_pcm_luma_coding_block_size:
    • Minimum/maximum PCM CB size.
    • The minimum size of a PCM CB is 8x8.
    • The maximum size of a PCM CB is 32x32.
  • max_transform_hierarchy_depth_inter / max_transform_hierarchy_depth_intra:
    • Maximum depth difference between the CB size and the transform block size for inter/intra.
    • For example, maximum depth 2 implies that a 64x64 CB must have transform blocks larger or equal to 16x16.
  • scaling_list_enable_flag:
    • True if scaling matrices are used in SPS. Scaling matrix data follows.
  • sps_scaling_list_data_present_flag:
    • True if scaling matrix parameters can be specified in SPS or PPS.
  • amp_enabled_flag: true if the H1, H2, V1, V3 partitions can be used.
  • sample_adaptive_offset_enabled_flag: true if SAO is used.
  • pcm_enabled_flag: true if pcm_flag is present in the bitstream.
  • pcm_loop_filter_disable_flag: disable the deblocking filter for PCM.
  • num_short_term_ref_pic_sets:
    • Number of short-term reference picture sets (ST-RPS) specified in the SPS.
    • There can be another ST-RPS in the SH.
  • long_term_ref_pics_present_flag:
    • True if long-term (LT) reference frames are used.
  • num_long_term_ref_pics_sps:
    • Number of LT "presets" defined in the SPS.
  • lt_ref_pic_poc_lsb_sps/used_by_curr_pic_lt_sps_flag:
    • See reference picture sets.
  • sps_temporal_mvp_enable_flag:
    • True if temporal motion vectors can be enabled in the SH.
  • strong_intra_smoothing_enable_flag:
    • Smooth the neighbour pixels for the 32x32 intra prediction.

7.4.3.3: Picture parameter set RBSP semantics

  • pps_pic_parameter_set_id: identify the PPS, so that slices can refer to it.
  • pps_seq_parameter_set_id: reference to an SPS.
  • dependent_slice_segments_enabled_flag: true if dependent slices can be used.
  • output_flag_present_flag: true if the corresponding flag is present in the SH.
  • sign_data_hiding_flag: true to enable coefficient sign encoding optimization.
  • cabac_init_present_flag: true if the CABAC initialization indicator is specified explicitly.
  • num_ref_idx_l0/l1_default_active_minus1: number of L0/L1 reference frames if not overridden in the SH.
  • init_qp_minus26: initial QP of the slice.
  • constrained_intra_pred_flag: true if the pixels from inter blocks cannot be used as intra neighbours (prediction purposes).
  • transform_skip_enabled_flag: true if transform_skip_flag is present in the bitstream.
  • cu_qp_delta_enabled_flag: true if the QP can vary per block.
    • This means that the diff_cu_qp_depth is present in the PPS.
    • This also means that cu_qp_delta_abs may be present.
  • diff_cu_qp_delta_depth: size of a quantization group.
  • pps_cb/cr_qp_offset: offset between luma and chroma QPs.
  • pps_slice_chroma_qp_offsets_present_flag: true if the QP chroma offsets can be overridden at slice level.
  • weighted_(bi)pred_flag: true if weighted prediction is applied to P/B slices.
  • transquant_bypass_enable_flag (PPS): true if cu_transquant_bypass_flag is present in the bitstream.
  • tiles_enabled_flag: true if tiles are used.
  • entropy_coding_sync_enabled_flag: true if WPP is used.
  • num_tile_columns/rows_minus1: number of tile columns/rows.
  • uniform_spacing_flag: true if the tile columns and rows are spaced evenly.
  • column_width/height_minus1: size of columns/rows for non-uniform tile spacing.
  • loop_filter_across_tiles_enabled_flag: true to deblock across tile boundaries.
  • pps_loop_filter_across_slices_enabled_flag: true to deblock across slice boundaries.
  • deblocking_filter_control_present_flag: true if the deblocking paramaters are adjusted in the PPS or SH.
  • deblocking_filter_override_enabled_flag: true if the deblocking paramaters are adjusted in the SH.
  • pps_disable_deblocking_filter_flag:
    • True to disable the deblocking filter.
    • Can be overridden by the SH.
  • pps_beta_offset_div2/pps_tc_offset_div2: deblocking filter parameters.
  • pps_scaling_list_data_present_flag: true if scaling matrices are used in PPS. Scaling matrix data follows.
  • lists_modification_present_flag: true if the SH may modify the reference picture lists.
  • log2_parallel_merge_level_minus2: speed optimization for merging candidate list generation. See below.

7.4.5: Scaling list data semantics

The scaling list is used in the transform and scaling process. Remember that the maximum transform size is 32x32. The same limit applies to the scaling list. The following are variables used along with the syntax elements.

  • sizeId: identifier for the transform size (0 -> 4x4, 3 -> 32x32).
  • matrixId: an identifier derived using the sizeId, the prediction mode (intra or inter), and the colour component. See Table 7-4 for the correct mappiings.
  • ScalingList[sizeId][matrixId][i]:
    • Scaling value for each position i in the scaling list for the transform size SizeID and the use case matrixId.
    • The values are in up-right scan order (0).
  • ScalingFactor[sizeId][matrixId][x][y]:
    • Scaling value for each coefficient (x,y) in the transform block of size sizeId and the use case matrixId.
    • There is a one-to-one mapping between ScalingList and ScalingFactor for the 4x4/8x8 transforms. For the 16x16/32x32 transforms, each value in ScalingList is used by 4/16 coefficients in ScalingFactor.
  • scaling_list_pred_mode_flag:
    • True(1) if the scaling list values are specified explicitly.
    • False(0) if the default scaling values are used (see Tables 7-5 and 7-6).
    • The name has the opposite meaning of the usage (bug).
  • scaling_list_pred_matrix_id_delta:
    • This value is only used when scaling_list_pred_mode_flag == 0.
    • If this value is 0, the scaling list uses the default values.
    • Otherwise, the scaling list inherits the values of another scaling list:
                refMatrixId = matrixId - scaling_list_pred_matrix_id_delta.
  • scaling_list_dc_coef_minus8:
    • Scaling value of the first coefficient (DC) of a 16x16/32x32 transform block. In other words, the DC value overrides the value given by ScalingList[sizeID][matrixId][0] to the very first coefficient of the transform block.
    • Not present for 4x4/8x8 transforms.
    • When not present in the stream, use 8.
  • scaling_list_delta_coef:
    • - The values in a scaling list are unsigned bytes. The value 0 is illegal.
    • scaling_list_delta_coef is the delta between the current scaling value and the previous scaling value. For instance, if the previous value is 15 and the current value is 17, then the delta is 2.
    • The delta must be within [-128, 127]. The specification assumes the values 0 and 255 are contiguous (rollover):
               nextCoef = (nextCoef + scaling_list_delta_coef + 256) % 256.
    • For the very first scaling_list_delta_coef value, the previous value is assumed to be 8 for the 4x4/8x8 transforms and the DC value for the 16x16/32x32 transforms.

7.4.7.1: General slice segment header semantics

  • first_slice_segment_in_pic_flag:
    • Must be set to 1 for the first slice segment of the current frame.
    • When there are multiple slices associated to the current picture, then the slice that carries the first CTB has this flag set to 1. All other slices have this flag set to 0.
  • no_output_of_prior_pics_flag: see the specification, this is complex.

    In summary, the value of this flag affects what to do with the pictures currently held in the decoded picture buffer when the current picture is an IDR or a BLA.

  • slice_pic_parameter_set_id: identify the PPS to activate.
  • dependent_slice_segment_flag:
    • True if this segment is an extension of the previous segment in the slice.
    • The dependent segment shares the SH, CABAC state and neighbour availability of the previous segment.
  • slice_segment_address:
    • Specify the number of the first CTB of the slice in picture raster scan order.
    • CtbAddrInRS is set to slice_segment_address initially. It is updated after each CTB encoded.
    • CtbAddrInTS is set to slice_segment_address mapped to tile scan order initially. It is incremented after each CTB encoded.
    • If the segment is an independent segment:
      • BaseSliceAddrRS = slice_segment_address.
    • Else:
      • BaseSliceAddrRS = slice_segment_address of the last independent segment.
  • slice_reserved_flag: reserved for future use by the ITU-T | ISO/IEC.
  • slice_type:
    1. B
    2. P
    3. I
  • pic_output_flag: true if the picture is displayed.

    Like no_output_of_prior_pics_flag, this values affects the decoded picture buffer management. See Annex C of the spec for more details.

  • colour_plane_id: when the colour planes are separate, this value identifies the image component to which the slice belongs to.
  • slice_pic_order_cnt_lsb:
    • LSB value of the POC of the current frame.
    • PicOrderCntMsb contains the MSB value of the POC of the current frame.
    • PicOrderCntVal contains the POC of the current frame.
  • short_term_ref_pic_set_sps_flag:
    • True if the ST-RPS is selected from the SPS.
    • If false, the ST-RPS is defined in the SH.
  • short_term_ref_pic_set_idx: index of the ST-RPS selected in the SPS.
  • num_long_term_sps: number of LT presets selected from the SPS. Each preset is an LSB value.
  • num_long_term_pics: number of LT LSB values specified in the SH (described later).
  • lt_idx_sps:
    • Index of the selected LT preset in the SPS.
    • The LSBs specified by those indices must be in increasing order.
  • poc_lsb_lt, used_by_curr_pic_lt_flag, delta_poc_msb_present_flag, delta_poc_msb_cycle_lt: See reference picture sets.
  • slice_temporal_mvp_enable_flag: true if temporal motion vectors are used.
  • slice_sao_luma/chroma_flag: true if SAO is enabled for luma/chroma.
  • num_ref_idx_active_override_flag: true if the L0 and L1 number of reference frames is overridden.
  • num_ref_idx_l0/l1_active_minus1: number of L0/L1 reference frames (max 15).
  • mvd_l1_zero_flag: if true, when inter bi-prediction is used, the L1 motion vector is assumed to be equal to the predicted motion vector.
    • MvdL1[x0][y0][0] = MvdL1[x0][y0][1] = 0; (no residual)
  • cabac_init_flag: CABAC initialization indicator.
  • collocated_from_l0_flag: true if the collocated partition is in list 0 for temporal prediction, otherwise, use list 1.
  • collocated_ref_idx: reference index associated to the collocated partition.
  • five_minus_max_num_merge_can: number of candidates in merge mode.
  • slice_qp_delta: offset added to pic_init_qp_minus26.
  • slice_cb/cr_qp_offset: offset added to pic_cb/cr_qp_offset.
  • deblocking_filter_override_flag,
    slice_disable_deblocking_filter_flag,
    beta_offset_div2,
    tc_offset_div2,
    slice_loop_filter_across_slices_enabled_flag: same semantics as in PPS, override the PPS values when present.
  • num_entry_point_offsets: number of entry points.
  • offset_len_minus1: number of bits used to represent entry points. entry_point_offset:
    • Offset from the previous entry point (first is 0).
    • Notice that alignment_bit_equal_to_one is present before the start of the data segment. This means that the first 2 bytes of the data segment will not be escaped regardless of the content of the SH. Hence, the slice segments can be written in the bitstream and escaped prior to writing the slice header.

7.4.7.3: Weighted prediction parameters semantics

  • luma_log2_weight_denom: log2 of the luma scaling denominator.
  • delta_chroma_log2_weight_denom: offset added to luma_log2_weight_denom to obtain the chroma scaling denominator.
  • luma_weight_l0_flag/chroma_weight_l0_flag: true if explicit luma/chroma weights are present.
  • delta_luma_weight_l0/delta_chroma_weight_l0:
    • Offset added to the luma/chroma denominator to obtain the numerator.
    • Must be within [-128, 127].
    • Example:
      • luma_log2_weight_denom = 3
      • delta_chroma_log2_weight_denom = 1
      • delta_luma_weight_l0 = 7
      • Denom = 1 << (3 + 1) = 16.
      • Num = 16 + 7 = 23.
  • luma_offset_l0:
    • Luma offset added to the pixel after the multiplication by Num/Denom.
    • Must be within [-128, 127].
  • delta_chroma_offset_l0:
    • The chroma offset is computed differently than luma.
    • PredictedOff = ((128*ChromaNum) >> Log2ChromaDenom) - 128.
    • In other words, PredictedOff = 128*(Num/Denom - 1.0).
    • ChromaOffset = ClipToSignedByte(delta_chroma_offset_l0 - PredictedOff).
    • Must be within [-512, 511].
  • The same applies for L1.

Restrictions:
There can be at most 24 weight flags equal to 1 (counting 2 for the chroma flags) in the combined L0+L1 list. Put another way, at most 8 fully weighted reference frames.

7.4.8: Short-term reference picture set syntax

Several short-term (ST) RPS can be defined in the SPS, and one ST-RPS can be defined in the SH. Each ST-RPS is identified by an index (RefRpsIdx). The indices are assigned in the order in which the ST-RPS are defined in the SPS (the first ST-RPS has index 0). The SPS can define num_short_term_ref_pic_sets that the SH can refer to. This is useful when a regular picture pattern is used. For instance, a slice header may simply tell the decoder to use the 3rd RPS predifined set in the SPS instead of signaling what to do with each of the pictures in the DPB the encoder wants to keep.

The ST-RPS defined in the SH (if any) uses the last index (num_short_term_ref_pic_sets). Each new picture will overwrite the content of RPS signaling at this address. A frame either selects one ST-RPS amongst those defined in the SPS or build the RPS directly in its SH.

  • stRpsIdx is set to the index of the selected ST-RPS.
  • inter_ref_pic_set_prediction_flag: can only be present if the ST-RPS is not the first defined by the SPS.
    • When true, the ST-RPS being defined uses a previously defined ST-RPS in the SPS and brings modifications to it (inter encoded).
    • When false, the ST-RPS being defined is self-contained (intra encoded).

    When true, the following fields are also present:

    • delta_idx_minus1: (only present when dealing with the ST-RPS in the slice header)
      • It represents the offset from the current index to the previously decoded ST-RPS in the active SPS that will serve as a reference for the inter RPS coding.
      • RefRpsIdx is the variable containing the index of the ST-RPS used for prediction.
        • If the current ST-RPS is being defined in the SPS, RefRpsIdx is set to stRpsIdx - 1 (inter RPS signaling in the SPS is based on the previous RPS set).
        • Otherwise (current RPS is being defined in the SH), RefRpsIdx is set to stRpsIdx - (delta_idx_minus1 + 1), that is, the index of some ST-RPS set defined in the SPS.
    • delta_rps_sign/abs_delta_rps_minus1:
      deltaRps = (1 - 2 * delta_rps_sign) * (abs_delta_rps_minus1 + 1)

      deltaRps represents the offset to add to the delta_poc_s0/s1_minus1 values found in the reference RPS set.

    • used_by_curr_pic_flag:
      • True if the reference frame is used for reference by the current frame. The value thus goes in one of the "Curr" lists (see section 8.3.2 for more info). If false, the reference frame may be discarded, depending on the value of "use_delta_flag".
      • Note that the same semantics apply for LT frames, with different variable names (used_by_curr_pic_lt_flag).
    • use_delta_flag: if the reference frame is not used for reference by the current frame, this value specifies if the reference frame lingers in the DPB. When true, this says that the value is to be kept in the ST "Foll" list. When false, this says that the value will be set as "unused for reference" when the current RPS set is signaled in the slice header.
  • num_negative_pics/num_positive_pics: Number of frames having a POC lower/higher than the current frame in the RPS.
    NumDeltaPocs = num_negative_pics + num_positive_pics.
  • delta_poc_s0/s1_minus1: Used to compute the POC offset of the ST frame with respect to the current frame.
    • The offset is encoded as the delta from the previous value encoded (see 8.3.2 for more precision).
    • DeltaPocS0/S1[I][X] contains the POC offset of ST frame X in the ST-RPS with index I.
  • used_by_curr_pic_s0/1_flag:
    • Same semantics as used_by_curr_pic_flag for the frame.
    • UsedByCurrPicS0/S1[[I][X] has the same semantics as DeltaPocS0/S1 for used_by_curr_pic_flag.

Consider this simplified example for inter RPS signaling. Here, we assume that the used_by_curr_pic_flag is always true. In a more complex example, the lists may differ, as not all the values are kept.

                DeltaPocS0[RefRpsIdx] = [ -1, -3 ]                DeltaPocS1[RefRpsIdx] = [ 1, 2, 4 ]                deltaRps = -2 (i.e. we shift the ST frames left by two)"DeltaBeforeShift" = [ -3, -1, 1, 2, 4 ]"DeltaAfterShift" =  [ -5, -3, -1, 0, 2 ]                The current frame cannot be part of its own reference list, so weremove entry 0."DeltaActual" = [ -5, -3, -1, 2 ]                DeltaPocS0[stRpsIdx] = [ -1, -3, -5 ]                DeltaPocS1[stRpsIdx] = [ 2 ]

The ST-RPS in the SPS are "presets" to reduce the encoding size of the ST-RPS needed by the current frame. There can be as many as 64 presets in each SPS. The decoder should always allocate one extra RPS structure, as each slice can define its own RPS structure on top of the "presets" already defined.

7.4.9.1: General slice segment data semantics

Note the alignment requirements for tile and WPP row boundaries.

  • end_of_slice_segment_flag: true if this is the last CTB of the segment.
  • end_of_sub_stream_one_bit: hardcoded to 1.

7.4.9.3: Sample adaptive offset semantics

  • sao_merge_left/up_flag:

    True to import the SAO parameters from the left/up CTB. A flag is not present if the left/up CTB is not in the same slice or tile. The up flag is not present if the left flag is true.

  • sao_type_idx_luma/chroma:
    • SAO filter type (0: disabled, 1: band, 2: edge). The chroma components share the same filter type and situation neighbour direction. The other SAO parameters are specific to each image component.

    Note that both Cb and Cr use the same mode, as only one sao_type_idx_chroma is transmitted.

  • sao_offset_abs/sign:
    • Sign (0: positive) and absolute value of the offset for each band or situation.
    • The offset signs are hardcoded in edge mode:
            Positive sign      Positive sign      Negative sign      Negative sign         a     b            a                  a--c                 c          \ ^ /              \ ^                  |\               /|\           \|/                \|                  v \             / v \            c                  c--b                  b           a     b        Category 1         Category 2         Category 3        Category 4         (valley)       (concave corner)    (convex corner)       (peek)
    • BitDepthShift = bitDepth - Min(bitDepth, 10).
    • SaoOffsetVal[0] = 0.
    • SaoOffsetVal[1..4] = offset 'i' << BitDepthShift.
  • sao_band_position: offset of the first band in the chosen set (band 0 has offset 0).
  • sao_eo_class_luma/chroma: situation neighbour direction.
          1D 0-degree      1D 90-degree      1D 135-degree      1D 45-degree      edge offset      edge offset        edge offset       edge offset        +-+-+-+          +-+-+-+           +-+-+-+           +-+-+-+        | | | |          | |a| |           |a| | |           | | |a|        +-+-+-+          +-+-+-+           +-+-+-+           +-+-+-+        |a|c|b|          | |c| |           | |c| |           | |c| |        +-+-+-+          +-+-+-+           +-+-+-+           +-+-+-+        | | | |          | |b| |           | | |b|           |b| | |        +-+-+-+          +-+-+-+           +-+-+-+           +-+-+-+      EO Class 0       EO Class 1        EO Class 2        EO Class 3

7.4.9.4: Coding quadtree semantics

A quadtree node is empty if it covers zero pixels, i.e. near the right and bottom picture boundaries. The top-most node is never empty. Empty quadtree nodes are not present in the bitstream.

  • split_cu_flag:
    • Indicate whether the current block is split into four smaller blocks.
    • The value is present in the current node, unless
      1. the minimum CB size has been reached (inferred to be 0), or else
      2. the node is missing pixels (inferred to be 1).

7.4.9.5: Coding unit semantics

  • cu_transquant_bypass_flag:
    • If true, assume the coefficients of the whole CB are dequantized and untransformed residual values (like PCM, but for the residual values instead of the source pixels).
    • cu_transquant_bypass_flag is present even if cu_skip_flag or pcm_flag is true, or rqt_root_cbf is false. Presumably the overhead is tolerated to remove conditions from the bitstream.
  • cu_skip_flag:
    • If true, use inter prediction with the UN prediction split in merge mode and assume the residual is zero for the whole CB.
    • Not present for I slices.
  • pred_mode_flag:
    1. inter
    2. intra
  • part_mode: partitioning mode for intra/inter (see spec table 7-10).
  • pcm_flag:
    • True to send the source pixel values directly for the whole CB.
    • Only present if PCM is enabled, the split mode is UN and the CB matches the PCM block size constraints. Inferred to be 0 otherwise.
    • Some bits with value 0 (pcm_alignment_zero_bit) must be sent after this flag until the bitstream is aligned on a byte boundary. The PCM data follows.
  • prev_intra_luma_pred_flag:
    • This value is present for each prediction block (1 or 4).
    • If true, choose the luma intra mode of the current block from a predicted list with 3 entries.
  • mpm_idx: if prev_intra_luma_pred_flag is true, this is the index in the predicted list.
  • rem_intra_luma_pred_mode: if prev_intra_luma_pred_flag is false, this is the index in the list of intra modes with the 3 predicted entries removed.
  • intra_chroma_pred_mode: chroma intra mode of the CB. In 4:2:0 it is the same for all chroma blocks, regardless of the prediction split mode.
  • rqt_root_cbf:
    • If true, assume there are residual values in the current CB..
    • Inferred to be 1 when not present.
    • Not present for intra (including PCM) or if setting this flag to true would amount to setting skip_flag to true (UN prediction split in merge mode with no residual).

7.4.9.6: Prediction unit semantics

  • mvp_l0/l1_flag: index in the predicted motion vector candidate list.
  • merge_flag: true if the merge mode is used.
  • merge_idx: index in the candidate list in merge mode when there is more than one candidate in the list.
  • inter_pred_idc: reference lists used:
    1. L0
    2. L1
    3. L0 and L1
  • ref_idx_l0/l1: reference index in the reference list when there is more than one candidate in the list.

7.4.9.8: Transform tree semantics

  • split_transform_flag: True if the transform block is split into four smaller transform blocks. When split_transform_flag is absent, its value is derived as follows:
    • Inferred to be 1 if the current size is larger than the maximum size.
    • Inferred to be 1 if the prediction block is intra and the prediction split is HV (the transform block cannot be larger than the intra prediction block).
    • Inferred to be 0 if the current size is equal to the minimum size or if the current split depth is equal to the maximum split depth.

    The spec uses the split_transform_flag to derive the value of the variable interSplitFlag. The derivation process is as follows:

    • If max_transform_hierarchy_depth_inter is 0, the CU is inter coded, the partitioning is not unsplit and trafoDepth is 0, interSplitFlag is set to 1.

      Concretely, this means that sub-CB transformed are required when inter splits are used and max_transform_hierarchy_depth_inter is set to 0. In other words, when the inter coded block has been split for prediction purposes and that transform blocks should be the same size as coded blocks, bypass the restriction and subdivide the transform block.

    • Otherwise, interSplitFlag is set to 0.
  • cbf_luma/cbf_cb/cbf_cr: true if the coefficients of the current transform block are non-zero (more precisely, if at least one coefficient is non-zero).
    • The luma flags are encoded differently than the chroma flags.
    • cbf_c* are present in the top node of the transform tree, which is 8x8 or greater.
    • cbf_c* are present in each node under the parent node, unless:
      1. the parent node specified that the chroma coefficients were zero, or
      2. the chroma transform size is less than 4x4 (in 4:2:0).

      In case "a", cbf_c* is inferred to be 0. In case "b", cbf_c* is inferred to be cbf_c* of its parent node. This is a kludge to consider the four 2x2 blocks as one 4x4 block.

    • cbf_luma is present at the leaf nodes of the transform tree, i.e. for the nodes that represent an unsplit transform block, except for one case:
      • If the transform block has the same size as the coding block, the prediction is inter and there are no chroma coefficients, then cbf_luma is inferred to be 1, since in that case cbf_luma == 0 would be equivalent rqt_root_cbf == 0.

7.4.9.9: Motion vector difference semantics

  • abs_mvd_greater0_flag: false if the residual value is zero.
  • abs_mvd_greater1_flag: true if the absolute residual value is greater than 1 (assuming greater0_flag is true).
  • abs_mvd_minus2: absolute residual value when it is greater than 1 (minus 2 to account for greater0_flag and greater1_flag).
  • mvd_sign_flag: sign of the residual value when it is non-zero.

7.4.9.10: Transform unit semantics

Mind the hack used to avoid 2x2 chroma blocks.

  • cu_qp_delta_abs/cu_qp_delta_sign: offset with the predicted QP.
    • CuQpDelta contains the QP offset of the current QG/CB:
      • Set to 0 when a QG boundary is crossed outside a CB.
      • Set to cu_qp_delta_abs with sign cu_qp_delta_sign when those values are present in the bitstream.
    • IsCuQpDeltaCoded tracks whether cu_qp_delta_abs has been specified in the current QG and CB:
      • Set to 0 when a QG boundary is crossed outside a CB.
      • Set to 1 when cu_qp_delta_abs is present in the bitstream.

7.4.9.11: Residual coding semantics

This section reconstructs the values of the coefficients so it is complex. The bitstream variables are described then the control flow is paraphrased.

  • transform_skip_flag:
    • True if the coefficients of the current 4x4 transform block are quantized untransformed residual values.
    • Not present if cu_transquant_bypass_flag is true, the TB is not 4x4 or the feature is disabled.
  • last_sig_coeff_x/y_prefix/suffix:
    • (X, Y) location from the TB origin of the last non-zero coefficient in scan order.
    • If last_significant_coeff_x_prefix <= 3:
      • LastSignificantCoeffX = last_significant_coeff_x_prefix.
    • Else:
      • LastSignificantCoeffX =
        (1 << ((last_significant_coeff_x_prefix >> 1) - 1)) * (2 + (last_significant_coeff_x_prefix & 1)) + last_significant_coeff_x_suffix.
    • Same for Y.
    • If scanIdx == 2 (vertical scan), X is swapped with Y.
  • coded_sub_block_flag:
    • True if the SB contains at least one non-zero coefficient.
    • Inferred to be 1 for the first SB and the SB that contains the last non-zero coefficient.
  • sig_coeff_flag:
    • True if the coefficient is non-zero.
    • Inferred to be 0 for the coefficients after the last non-zero coefficient.
    • Inferred to be 1 for the last non-zero coefficient.
    • Inferred to be 1 for the first coefficient if coded_sub_block_flag of the SB is present and equal to 1 and the other coefficients are zero (one coefficient has to be non-zero).
  • coeff_abs_level_greater1/2_flag: true if the absolute coefficient value is greater than 1/2.
  • coeff_sign_flag: 0: coefficient is positive, 1: negative.
  • coeff_abs_level_remaining:
    • Delta from the minimum inferred absolute value.
    • The coefficient must fit in signed 16-bit.

Control flow explanations follow.

  • Determine the index in scan order of the SB that contains the last non-zero coefficient (lastSubBlock), and the index in scan order of the last non-zero coefficient in that SB (lastScanPos):
    • lastScanPos = 16.
    • lastSubBlock = highest SB index in the TB, e.g. 3 for a 8x8 TB.
    • Loop:
      • If lastScanPos == 0 (pass to the next SB in reverse order):
        • lastScanPos = 16.
        • lastSubBlock--.
      • lastScanPos-- (pass to the next coefficient in reverse order).
      • (xS, yS): position of the current SB in the TB (in units of 4x4 blocks).
      • (xC, yC): position of the current coefficient in the TB.
      • If (xC, yC) is the position of the last non-zero coefficient, break.
  • Iterate on every SB 'i' in reverse order (last-to-first), starting from lastSubBlock:
    • Determine if the SB is empty:
      • inferSigCoeffFlag = 0.
        Track whether the first coefficient is inferred to be non-zero.
      • If the SB is not the first or the last SB:
        • coded_sub_block_flag: 0 if the SB is empty.
        • inferSigCoeffFlag = 1.
    • Determine whether the coefficients are non-zero:
      • In general there are 16 coefficients to process. For the last SB, only the coefficients before the last non-zero coefficients are processed.
      • Iterate on every coefficient 'n', last-to-first:
        • If (the SB is not empty) && (this is not the first coefficient or the value of the flag for the first coefficient cannot be inferred):
          • significant_coeff_flag: true if the coefficient is non-zero.
          • If the coefficicient is non-zero:
            • inferSigCoeffFlag = 0.
              Infer the value of the first coefficient if coded_sub_block_flag is present and true and the last 15 coefficients are zero.
    • Determine whether the coefficients are greater than 1/2:
      • firstSigScanPos = 16.
        Track the position of the first non-zero coefficient.
      • lastSigScanPos = -1.
        Track the position of the last non-zero coefficient.
      • numGreater1Flag = 0.
        Track the number of "greater than 1" flags (max 8).
      • lastGreater1ScanPos = -1.<br <="" track="" position="" of="" coefficient="" with="" the="" "greater="" than="" 2"="" flag,="" if="" any.="" li="">
      • Iterate on every coefficient 'n', last-to-first:
        • If the coefficient is non-zero:
          • If less than 8 "greater than 1" flags are present:
            • coeff_abs_level_greater1_flag: true if the coefficient is greater than 1 in absolute value.
            • numGreater1Flag++.
            • If this is the first "greater than 1" flag processed with value 1:
              • lastGreater1ScanPos = n;
          • If the coefficient is the last non-zero coefficient of the SB:
            • lastSigScanPos = n.
          • firstSigScanPos = n.
      • If at least one coefficient is greater than 1:
        • coeff_abs_level_greater2_flag: true if the last coefficient greater than 1 is also greater than 2.
    • Determine the signs of the coefficients:
      • signHidden = (lastSigScanPos - firstSigScanPos) > 3 && !cu_transquant_bypass_flag:
        • Infer the sign of the first coefficient if there seems to be many non-zero coefficients and there is no bypass.
      • Iterate on every coefficient 'n', last-to-first:
        • If the coefficient is non-zero and its sign is not hidden:
          • coeff_sign_flag: 0 if the coefficient is positive.
    • Determine the values of the coefficients:
      • numSigCoeff = 0.
        Track the number of non-zero coefficients processed.
      • sumAbsLevel = 0.
        Sum of the absolute values of the coefficients.
      • Iterate on every coefficient 'n', last-to-first:
        • If the coefficient is non-zero:
          • baseLevel = 1 + coeff_abs_level_greater1_flag + coeff_abs_level_greater2_flag.
            This is the inferred minimum value of the coefficient.
          • If the coefficient value has not been inferred from coeff_abs_level_greater1/2_flag (i.e. those flags are true or not present):
            • coeff_abs_level_remaining: offset from baseLevel.
          • TransCoeffLevel[...] = sign*(baseLevel + coeff_abs_level_remaining).
          • If the sign of the first coefficient is hidden:
            • sumAbsLevel += |TransCoeffLevel[...]|.
            • If this is the first coefficient && sumAbsLevel is odd:
              • TransCoeffLevel[...] *= -1.
                The first coefficient is now negative.
          • numSigCoeff++.

Part III. Specification clauses

6.4.1: Derivation process for z-scan order block availability

Resume: determine whether a neighbour block is available.

  • The neighbour is available if it is inside the frame, within the same slice and tile and the neighbour is encoded before the current block.
  • Note: z-scan == depth scan == traverse tiles, CTBs, CBs, blocks
    left-to-right, top-to-bottom, depth first.

6.4.2: Derivation process for prediction block availability

Resume: determine whether a neighbour block is available for inter prediction, sorta. Subclause 6.4.2 is invoked from the derivation process for motion vector components.

  • The semantics are messy.
  • The spec considers some PB split constraints here and other PB split constraints elsewhere.
- Pseudo-code:  - Do 6.4.1, return false if unavailable.  - If the neighbour is intra, return false.  - If the PB split is [01]:                       [23]   - If block 1 is the current block then block 2 is unavailable.

6.5.1: Coding tree block raster and tile scanning conversion process

Resume: specify a bunch of variables used elsewhere.

  • colWidth[X]: specify the width in CTBs of tile column X.
  • rowHeight[X]: specify the height in CTBs of tile row X.
  • ColumnWidthInLumaSamples[X]/RowHeightInLumaSamples[X]: same as colWidth[X]/ rowHeight[X] but in terms of pixels.
  • colBd[X]: specify the horizontal index of the first CTB in column X (offset on the horizontal axis of the 2D CTB grid).
  • rowBd[X]: specify the vertical index of the first CTB in row X.
  • CtbAddrRStoTS[X]: map a CTB number in raster scan order to a CTB number in tile scan order. Example:
                          Raster scan      Tile scan with 4 tiles                        [0123]               [0145]                        [4567]               [2367]                        [89AB]               [89CD]                        [CDEF]               [ABEF]
  • CtbAddrTStoRS[X]: inverse of the map above.
  • TileId[X]: map a CTB number to a tile number.

6.5.2: Z-scan order array initialization process

Resume: map the number of a "minimum block" in raster scan to a number in z-scan.

  • The minimum block is the size of the smallest luma transform block allowed, e.g. 4x4.
  • Output is MinTbAddrZS[X][Y].

6.5.3-5: Scan order arrays

Resume: specify the order in which array elements are scanned.

   Up-right          Horizontal           Vertical    [0259]             [0123]              [048C]    [148C]             [4567]              [159D]    [37BE]             [89AB]              [26AE]    [6ADF]             [CDEF]              [37BF]
  • ScanOrder[log2BlockSize][scanIdx][sPos][sComp] is the scan order for various block sizes:
    • log2BlockSize: log2 of the block size (1x1 => 0, 4x4 => 2, etc).
    • scanIdx: 0 for up-right, 1 for horizontal, 2 for vertical.
    • sPos: coefficient index.
    • sComp: 0 for horizontal offset, 1 for vertical offset. In other words, the (X,Y) location of the coefficient in the matrix.

8.3.2: Decoding process for reference picture set

As a complement to this text, please refer to "Overview of HEVC High-Level Syntax and Reference Picture Management".

The basic idea of the RPS is to "signal what you want to keep in the DPB". By default, pictures in the decoded picture buffer (DPB) are marked as "unused for reference" (a picture's death sentence). It is up to the RPS to change the status of the pictures that must stay in the DPB.

By signaling RPS information in every slice, packet loss does not prevent the decoder to correctly maintain the status of the reference pictures in the DPB. The explicit signaling of reference picture management in H.265 was designed as a basic error robustness tool to help detect the loss of entire pictures. If the current picture refers to a previously decoded picture no longer in the DPB, the decoder reports or conceals the error.

The specification makes the RPS a lot more complex than it should be. It's easy to get lost in the details and miss the big picture. Basically, the RPS is the list of POCs of the frames in the DPB, with the minor exception of long-term reference frames optionally identified by the POC LSB value instead of the POC.

The RPS is fundamentally flawed as far as robustness is concerned. The specification and various papers carefully tiptoe around this issue. Quote from the specification:

The RPS is an absolute description of the reference pictures used in the decoding process of the current and future coded pictures. The RPS signalling is explicit in the sense that all reference pictures included in the RPS are listed explicitly.

What is not mentioned is that the "absolute" description of the reference frames is based on the POC, which by design is arelative method to identify frames. The POC is split in its MSB and LSB parts. Each rollover of the LSB value (a cycle) implicitly increments the MSB. If the decoder loses more than half a cycle of frames, then it gets confused about the current MSB value. There's no recovery until the next IDR is successfully received.

The IDR frames themselves cause problems since they reset the POC values. Consider the following sequence:

        IDR0 P1 IDR2 P3

If the decoder loses P1 and IDR2, then the RPS indicates that IDR0 is IDR2. So, frequent IDR frames accelerate the recovery after a total desynchronization but make the RPS less robust. The identification of long-term frames by their POC LSB value is another robustness hole.

The concept of RPS sets is also dubious. The RPS sets in the SPS each describe one DPB buffer state. The slice header can use a set as-is or "shift" the POC offsets of a set. This only works well when the frame structure is fixed. Typically, an encoder uses a flexible structure when it uses B frames, and there's no need for something as complex as an RPS set to describe a mostly contiguous list of frames when only P frames are used.

The specification screwed the pooch on that one.

The bitstream refers to the POCs of the DPB frames in three ways:

  • POC offsets from the current frame, aka the short-term list.
  • "Far" POC offsets from the current frame, aka the long-term MSB list.
  • LSB values of the frames, aka the long-term LSB list.

The short-term list can refer to past and future frames (in display order). The long-term lists can only refer to past frames.

The list names used by the specification are misleading. The so-called short-term list can refer to 2^16 past frames, irrespective of how many bits there are in the LSB. In contrast, the long-term lists can refer to 2^24 past frames. Thus, in a practical encoder, it's possible to use the short-term list to refer to arbitrarily distant frames and forget about long-term frames entirely.

Example using 3 bits for the LSB (the specification requires at least 4 bits in practice):

POC | 00 01 02 03 04 05 06 07 | 08 09 10 11 12 13 14 15 | 16 17 18 19 20 21 22 23 |LSB |  0  1  2  3  4  5  6  7 |  0  1  2  3  4  5  6  7 |  0  1  2  3  4  5  6  7 |MSB |           0             |           1             |           2             |  E              F  G                a  b     H    c     *     d                                                                                               '*' is the current frame.                              Letters represent a frame in the RPS.

We arbitrarily choose to refer to the RPS frames as follow.

        Short-term list:     [ a, b, c, d ].        Long-term MSB list:  [ F, G, H ].        Long-term LSB list:  [ E ].

The specification splits the short-term list in two: negative frames (those displayed before the current frame) and positive frames (those displayed after the current frame). In each list the frames are ordered by the distance from the current frame.

        Negative ST list: [ c, b, a].        Positive ST list: [ d ].

The bitstream encodes the delta of the offsets between the frames, minus 1, starting from the position of the current frame.

        Negative ST offsets: [ 2, 5, 6 ].        Negative ST deltas:  [ 1, 2, 0 ].        Positive ST offsets: [ 2 ].        Positive ST deltas:  [ 1 ].

The long-term MSB list specifies both the MSB and LSB values of the frames. The MSB values are delta-coded like in the short-term list, without the "minus 1". The MSB offset 0 is legal, even for the first value of the list. The LSB values are encoded in fixed length. The LSB values are not ordered.

        LT MSB list: [ H(Delta 1, LSB 7), G(Delta 1, LSB 7), F (Delta 0, LSB 6) ].

The long-term LSB list specifies only the LSB value of the frames. The LSB values are not ordered. A frame can only appear in the LT LSB list if its LSB value isn't shared by any other frame in the DPB, because the identification would be ambiguous.

        LT LSB list: [ E(LSB 1) ].

For each frame in the RPS, a flag indicates if the frame is used in the reference list of the current frame, as opposed to just lingering in the DPB for future usage. The specification requires that a frame appears at most once in the lists above.

The decoder tracks the POC and status (short-term, long-term) of each frame in its DPB. When the current frame is fully decoded, it becomes part of the DPB and its status is set to "short-term". When the decoder receives the next frame, it reconciles the frame RPS and its DPB, as follow.

To simplify, suppose we create a new DPB buffer to store the frames to be kept. First, any frame in the old DPB whose POC matches a POC value specified in the LT MSB list, or any frame whose POC LSB value matches a LSB value specified in the LT LSB list, is marked "long-term" and moved to the new DPB. Then, any frame in the old DPB whose POC matches a POC value specified in the ST list and which is currently marked "short-term" is moved to the new DPB. The remaining frames in the old DPB are discarded.

The order of the operations matter. To paraphrase the algorithm above, the long-term frames are identified and marked as "long-term" regardless of their previous status. This step implicitly performs the promotion of short-term frames to long-term frames. The promotion to long-term is a one-way trip. Thus the frames that were previously marked "long-term" that are not matched in the first step are discarded because the short-term list may not refer to them. Then the remaining short-term frames are matched to the short-term list.


The specification formally uses 5 lists to specify the RPS semantics described above. The lists are presented here for reference.

  • PocStCurrBefore: pictures both decoded and displayed prior to the current picture that may be used for inter picture prediction.
  • PocStCurrAfter: pictures decoded prior to the current picture, but displayed afterwards that may be used for inter picture prediction.
  • PocStFoll: pictures decoded prior to the current picture that are not used for inter picture prediction in the current picture, but that may be used for inter picture prediction in following pictures.
  • PocLtCurr: pictures marked as long-term reference pictures that may be used for inter picture prediction by the current picture.
  • PocLtFoll: pictures marked for long-term reference that are not used for inter picture prediction by the current picture, but that may be used for inter picture prediction in following pictures.

To better grasp the role of each list, consider the following example. To keep things simple, long-term references aren't used, but the same logic applies. The numerals inside the boxes indicate the coding order. POC values start at 0 (IDR0) and increase by 1 from left to right. The POC value is reset to 0 at every IDR picture.

                                                               |  +--------------------------+  +------------------------+     |  |                          |  |                        |     |  |                          v  |                        v     |+----+        +----+        +----+        +----+        +----+ | +----+|IDR0|------->| B2 |<-------| P1 |------->| B6 |<-------| P5 | | |IDR9| ...+----+        +----+        +----+        +----+        +----+ | +----+      \      /      \      /      \      /      \      /       |       +----+        +----+        +----+        +----+        |       | B3 |        | B4 |        | B7 |        | B8 |        |       +----+        +----+        +----+        +----+        |

The state of the DPB before each picture is encoded is described in the table below. Pictures in the DPB appear in ascending POC values. Furthermore, the content of each of the five lists described above is also shown.

PictureDPBPocStCurrBeforePocStCurrAfterPocStCurrFollPocLtCurrPocLtFollIDR0emptyemptyemptyemptyemptyemptyP1{IDR0}{IDR0}emptyemptyemptyemptyB2{IDR0,P1}{IDR0}{P1}emptyemptyemptyB3{IDR0,B2,P1}{IDR0}{B2}{P1}emptyemptyB4{IDR,B3,B2,P1}{B2}{P1}emptyemptyemptyP5{B2,B4,P1}{P1}emptyemptyemptyemptyB6{P1,P5}{P1}{P5}emptyemptyemptyB7{P1,B6,P5}{P1}{B6}{P5}emptyemptyB8{P1,B7,B6,P5}{B6}{P5}emptyemptyemptyIDR9{B6,B8,P5}emptyemptyemptyemptyempty

RPS signaling is present in every slice header, with the exception of IDR pictures. This ensures that the DPB is correctly flushed. Note that I pictures may provide RPS info, as they do not prevent future pictures from referencing pictures prior to it.

8.3.4: Decoding process for reference picture lists construction

Resume: build the final reference picture lists. Note that everything said about L0 applies to L1 as well, except when noted.

In "Rate-Distortion Optimized Reference Picture Management for High Efficiency Video Coding", the authors mention the use of CABAC to code the reference picture lists indices limit the usefulness of reordering the pictures in the reference picture lists.

        "As we know, coding of the reference picture index in HEVC         employs CABAC, which can adapt to the order of the reference         pictures in the reference picture lists. No matter how the         reference pictures are placed, entropy coder can always          allocate the most efficient bin string to the most useful         reference picture. Thus, changing the order of reference          pictures in reference picture list will not make much         difference to the distortion and the bitrate."
  • RefPicList0/1: final L0/L1 lists.
  • RefPicListTemp0/1: default lists used to build RefPicList0/1
  • ref_pic_list_modification_flag_l0/l1: false to accept the default list as-is.
  • list_entry_l0/l1[]: array of indices in the default list. See below.
  • NumPocTotalCurr: number of frames used for reference by the current frame.
  • NbActiveL0 (a variable used in this text only):
    • Number of reference indices in L0 (num_ref_idx_l0_active_minus1+1).
    • NbActiveL0 may be greater than NumPocTotalCurr, e.g. when weighted prediction is used.
  • Build RefPicListTemp0:
    • RefPicListTemp0.Size = Max(NbActiveL0, NumPocTotalCurr).
    • RefPicListTemp0 contains, in order, the ST frames that precede the current frame in display order (those closest to the current frame first), the ST frames that follow the current frame in display order (those closest to the current frame first) and the LT frames. Call those frames the "frame pattern". The frame pattern is clipped/replicated as needed to fill up the default list.
  • Build RefPicListTemp1:
    • Same as above, except that the frames that follow the current frame in display order are added before the frames the precede the current frame in display order.
  • - Build RefPicList0:
    • If ref_pic_list_modification_flag_l0/l1 is false, RefPicList0/1 is identical to RefPicListTemp0/1.
    • If ref_pic_list_modification_flag_l0/l1 is false, the values in RefPicListTemp0/1 are reordered according to the list_entry_l0/l1 array.
           RefPicList0[i] = RefPicList0[list_entry_l0[i]]

      Note that list_entry_l0/l1[i] is expressed with the minimum number of bits required to represent NumPocTotalCurr-1 since the duplicated frames in RefPicListTemp0/1 are irrelevant.

8.4.1: General decoding process for coding units coded in intra prediction mode

Resume: predict the pixels as intra then decode them. The resulting values are samples before the deblocking filter is applied.

  1. Get the QP (see 8.6.1).
  2. Process luma:
    • If PCM: use the scaled PCM pixels.
               Scaled PCM pel = PCM pel << (BitDepth - PcmBitDepth)                                               ^                                               |                                   The "working" bit depth has                                   to be greater or equal to                                   the raw coding bit depth.
    • Else:
      • Set the number of luma blocks (1 (unsplit, i.e. 2Nx2N) or 4 (both, i.e. NXN)) according to IntraSplitFlag (false = unsplit, true = both).

        Note on intra coding: the size of the prediction block (PB) covers exactly one signalled partition in the bitstream. Therefore, a 64x64 intra PB is possible. The splitting rule indicates that only the smallest size CBs allowed in the stream (see the SPS syntax elements) can be split into 4 PBs. This means that a 32x32 CB using intra coding cannot be split into 4 16x16 PBs when 16x16 CBs are allowed. In this example, the partition size would be 32x32, and a 32x32 PB would cover that partition. Moreover, partitions can be split into more than one transform block (TB). The splitting rules for PBs and TBs are independent. For instance, 64x64 TBs don't exist. 64x64 PBs lead to at least 4 32x32 TBs. This is important to understand. Although 64x64 PBs exist, intra prediction will not predict the entire region as one big chunk. Rather, the prediction will be split into 4 32x32 regions (remember, 64x64 transforms don't exist). For sake of clarity, we will refer to these regions as intra blocks (IBs). When a 64x64 intra PB is signalled, the encoder is essentially telling the decoder to use the same prediction mode for the 4 IBs it carries. The decoder first predicts the samples for the first TB, recovers the residual values, reverses the transform, then reconstructs the samples. These samples are required to predict the remaining IBs. Following the z-scan order, the decoder will predict, decode, and reconstruct the samples so they are available for the next blocks. Therefore, 64x64 intra PBs exist, but the actual maximum size of an intra predicted block is 32x32.

      • For each block:
        • Get the luma prediction mode (see 8.4.2).
        • Decode the pixels (see 8.4.4.1).
  3. Process chroma:
    • If PCM: use the PCM pixels as above, but with chroma bit depth for scaling.
    • Else:
      • There is only one chroma block in 4:2:0 (IntraSplitFlag is unused).
      • Get the chroma prediction mode (see 8.4.3).
      • Decode the pixels for each component (see 8.4.4.1).

8.4.2: Derivation process for luma intra prediction mode

Resume: get the luma intra prediction mode.

               "Due to the large number of intra prediction                modes, H.264/AVC-like mode coding approach                based on a single most probable mode was                not found effective in HEVC. Instead, HEVC                defines three most probable modes for each                PU based on the modes for the neighbouring                PUs."                              - Intra Coding of the HEVC Standard
Intra prediction modesAssociated names0Planar1DC2..34Angular
  1. Get the prediction mode candIntraPredModeN of the left and top neighbours:
    • The left neighbour (-1, 0) is called A.
    • The top neighbour (0, -1) is called B.
    • For each neighbour N (A and B):
      1. Determine the availability according to 6.4.1.
      2. If N is unavailable, or inter, or PCM or above the current CTB, then candIntraPredModeN is set to DC.
      3. Else (the neighbour uses intra), candIntraPredModeN uses the same intra mode as the neighbouring candidate.
  2. Build the list of mode candidates candModeList (holds 3 candidates):
    • If candIntraPredModeA == candIntraPredModeB (duplicate mode):
      • If candIntraPredModeA is Planar or DC candModeList = [ Planar, DC, Vertical(26) ].
      • Else:
        • candModeList[0] = candIntraPredModeA.
        • candModeList[1] = 2 + (candIntraPredModeA + 29) % 32.
        • candModeList[2] = 2 + (candIntraPredModeA - 1) % 32.

          In other words, the neighbour mode and its adjacent angular modes.

    • Else:
      • candModeList[0] = candIntraPredModeA.
      • candModeList[1] = candIntraPredModeB.
      • candModeList[2] = the first mode in [Planar, DC, Vertical(26)] that is not a duplicate of the two previous entries.
  3. Get the prediction mode IntraPredMode:
    • If prev_intra_luma_pred_flag, then IntraPredMode = candModeList[mpm_idx], where mpm_idx is read from the bitstream.

      Basically, this means that the intra prediction mode used by the current block is one of the three predicted modes. The mpm_idx simply tells us which mode to use.

    • Else:
      • IntraPredMode = rem_intra_luma_pred_mode.
      • Skip the entries that could have been selected by mpm_idx:
        1. Sort candModeList by increasing values.
        2. For (i = 0; i < 2; i++): If IntraPredMode >= candModeList[i]: IntraPredMode++.

      When the intra prediction mode is not correctly predicted (i.e. it is one of the 32 remaining modes), the value is signaled using 5 bits (the modes outside the predicted ones are found to be uniformly distributed). However, a signaled value of 2, for instance, does not necessarily map to 2. The meaning depends on the content of candModeList. The last "for loop" does just that. Suppose that candModeList holds the values 2, 4, and 8. If IntraPredMode is 0, then it actually means 0, as all the values in the list are greater than 0. If IntraPredMode is 3, then it means 4, as 3 is the third value not in candModeList. If IntraPredMode is 7, then it means 9, as 2 and 4 are lesser. Finally, 3 is added to any value greater or equal to 8.

      Sorting is actually required here because of the way the values are "bumped". The idea is that the resulting value cannot belong to candModeList. Without sorting the values in candModeList, this cannot be guaranteed. For instance, consider that candModeList holds [4, 2, 6], and that IntraPredMode is 3. Since IntraPredMode is less than 4, don't increment IntraPredMode, and move to the next candidate. Since IntraPredMode is greater than 2, increment IntraPredMode, and move to the last candidate. Since IntraPredMode (now 4) is less than 6, don't increment IntraPredMode. The final result is that IntraPredMode is 4, a value in candModeList, which is thewrong result. If the list would have been sorted, the result would have been 5, thecorrect result.

8.4.3: Derivation process for chroma intra prediction mode

Resume: determine the chroma prediction mode.

  • IntraPredModeC depends on IntraPredMode, using intra_chroma_pred_mode as a filter.
  • See the spec for details, this is simple.

8.4.4.1: General decoding process for intra blocks

Resume: predict and reconstruct (prior to the deblocking filter) the block for the current image component.

  • Determine the value of splitFlag:
    • If luma: splitFlag = split_transform_flag at the current depth in the bitstream.
    • Else if chroma && split_transform_flag at the current depth in the bitstream is equal to 1 && chroma block size == 4x4 (no 2x2 blocks allowed): splitFlag = 1.
    • Else: splitFlag = 0
  • If splitFlag is 1, call 8.4.4.1 recursively for each of the four blocks (divide the current block size by 2).
  • Else:
    • Predict the intra pixels of the block (see 8.4.4.2.1).
    • Do the dequantization and inverse transformation (see 8.6.2).
    • Reconstruct the intra pixels of the block (see 8.6.5).

8.4.4.2.1: General intra sample prediction

Resume: generate the intra prediction.

The figure below is adapted from "Intra Coding of the HEVC Standard".

   |         |        |        |     |        |          |     |        | --+---------+--------+--------+-----+--------+----------+-----+--------+--   | R(0,0)  | R(1,0) | R(2,0) | ... | R(N,0) | R(N+1,0) | ... | R(2N,0)| --+---------+--------+--------+-----+--------+----------+-----+--------+--   | R(0,1)  | P(1,1) | P(2,1) | ... | P(N,1) |          |     |        |   | R(0,2)  | P(1,2) | P(2,2) | ... | P(N,2) |   |   .     |    .   |    .   |     |        |   |   .     |    .   |        |  .  |        |   |   .     |    .   |        |     |    .   |   | R(0,N)  | P(1,N) |   ...  |     | P(N,N) | --+---------+--------+--------+-----+--------+--   | R(0,N+1)|        |        |     |        |   |    .    |   |    .    |   |    .    |   |         |   |         |   | R(0,2N )| --+---------+--   |         |

The R(x,y) samples denote reference samples from neighbouring blocks, while the P(x,y) samples denote the predicted samples this subclause returns. The use of 1-based indexing in P is for illustrative purposes.

  • For each neighbour pixel R(x,y):
    • If the block containining R(x,y) is not available according to 6.4.1 or (the block is not intra and constrained_intra_pred_flag == 1):
      • Mark the neighbour pixel as unavailable.
  • If any neighbour pixels R(x,y) are unavailable, replace them (see 8.4.4.2.2).
  • If the prediction is luma, filter the neighbour pixels according to 8.4.4.2.3.
  • Do the prediction according to IntraPredMode:
    1. When IntraPredMode == 0, use Planar (see 8.4.4.2.4).
    2. When IntraPredMode == 1, use DC (see 8.4.4.2.5).
    3. When 2 <= IntrePredMode <= 34, use Angular(IntraPredMode) (see 8.4.4.2.6).

8.4.4.2.2: Reference sample substitution process for intra sample prediction

Resume: replace the unavailable intra neighbour pixels. For clarity, refer to the figure in the previous section. Note that the R(x,y) samples refer to the p[x][y] samples in the spec, when x and/or y are negative.

  • If all the R(x,y) pixels are missing, assign 1 << (bitDepth - 1) to all of them, and return.
  • If at least one pixel is available:
    1. If R(0,2N) is unavailable, search for the first available neighbouring sample R(x,y) and copy its value to R(0,2N). The search starts from R(0,2N) and moves towards R(0,0), then continues towards R(2N,0).
    2. Start at R(0,2N) and make your way to R(0,0). R(0,2N) is sure to hold a value because of the previous step. Everytime an unavailable sample R(0,y) is found, assign it the value used at R(0,y+1).
    3. Start at R(0,0) and make your way to R(2N,0). R(0,0) is sure to hold a value because of the previous step. Everytime an unavailable sample R(x,0) is found, assign it the value used at R(x-1,0).

8.4.4.2.3: Filtering process of neighbouring samples

Resume: filter the neighbour pixels R(x,y).

Essentially, the filter smooths out the variations in the neighbour pixels. Refer to the spec for details, the filter is too complex to be fully specified here.

8.4.4.2.4: Specification of Intra_Planar (0) prediction mode

Resume: do bilinear interpolation between the edges.

Because the blocks to the right and below the current one are not yet coded, the neighbouring samples R(N+1,0) and R(0,N+1) are repeated along the rightmost column and bottom row, respectively. The prediction samples P(x,y) are obtained as follows:

P(x,y) = (dL*R(0,y) + dR*R(N+1,0) +           dT*R(x,0) + dB*R(0,N+1) + PB size) >> (log2(PB size) + 1)

where dL is the distance between P(x,y) and the rightmost column, dR is the distance between P(x,y) and R(0,y), dT is the distance between P(x,y) and the neighbouring bottom row, and dB is the distance between P(x,y) and R(x,0). Essentially, given a transform block size, dR is the complement to dL, and dT is the complement to dB. The block size is added to the four prediction samples for rounding purposes. The right shift by log2(PB size) + 1 averages the prediction samples that is assigned to P(x,y).

8.4.4.2.5: Specification of Intra_DC (1) prediction mode

Resume: set the average pixel value of the neighbours for the whole block.

            "[...] DC and angular prediction modes may introduce             discontinuities along block boundaries. To remedy             this problem, the first prediction row and column             are filtered in the case of DC prediction with a             two-tap finite impulse response filter (corner             sample with a three-tap filter)."                           - Intra Coding of the HEVC Standard
  1. dcSum = sum of the neighbour samples R(x,y) excluding R(0,0).
  2. dcVal = (dcSum + PB size) >> (log2(PB size)+1).
  3. If luma && PB size < 32: (boundary smoothing)
    • Smooth out the pixels next to the neighbour pixels, as follow:
                        top                 +----+       A, i.e. P(1,1), is smoothed from top and left.                l|ABBB|       B, i.e. P(2..N,1), are smoothed from top.                e|CXXX|       C, i.e. P(1,2..N), are smoothed from left.                f|CXXX|       X are not smoothed.                t|CXXX|                 +----+
    • P(1,1) = (R(0,1) + 2*dcVal + R(1,0) + 2) >> 2. (three-tap filter)
    • P(2..N,1) = (R(2..N,1) + 3*dcVal + 2) >> 2. (two-tap filter)
    • P(1,2..N) = (R(1,2..N) + 3*dcVal + 2) >> 2. (two-tap filter)
    • P(2..N,2..N) = dcVal.
  4. Else: P(x,y) = dcVal for the entire prediction block.
               "As the prediction of chroma components tends to be             very smooth, the benefits of the boundary smoothing            would be limited. Thus, in order to avoid extra            processing with marginal quality improvements, the            prediction boundary smoothing is only applied to            luma component."                      - Intra Coding of the HEVC Standard

8.4.4.2.6: Specification of Intra_Angular (2..34) prediction mode

Resume: interpolate the neighbour pixels in one direction.

The spec uses the array ref[], which initially holds a copy of either R(0,0) to R(N,0), or R(0,0) to R(0,N).

  1. If predModeIntra is equal or greater that 18, the array ref[] refers to the neighbouring samples R(0,0) to R(N,0).
  2. If the prediction angle (see Table 8-4 in the spec) is less than 0, the reference sample array ref[] may need to be extended using a projection. (see Table 8-4 in the spec that maps the prediction mode to the prediction angle),
                     ref[x] = R(0,-1+((x*invAngle + 128)>>8))                    x = [-1 ... (TB size * intraPredAngle)>>5]

    Figure 3 in "Intra Coding of the HEVC Standard" clearly illustrates what the above projection does.

  3. If no projection is needed, R(N+1,0) to R(2N,0) is appended to ref[].
                  B range            <----------->      X and the lines show the prediction directions.                              There are four regions: A, B, C, D.          A&C range          /------>            A (up-left) is symmetric with C (left-up).       ^  |+-----+-----+      B (up-right) is symmetric with D (left-down).       |  ||\\A|B/|       |  ||C\\|/ |            Case B:      D|  ||--X  |              ref[] contains all the pixels to the top, and       |  ||D/   |              some pixels to the top-right (B range). There      r|  ||/    |              are more pixels from top-right as the angle      a|  v+-----+              increases. There are no pixels from top-right      n|   |                    for angle 0 (pure vertical).      g|   |      e|   |                  Case A:       |   |                     ref[] contains all the pixels to the top, and       |   |                     some pixels from the left (A&C range). The       v   +                     left pixels are projected to the top line                                 using invAngle.
  4. Derive the prediction values P(x,y) using the following:
    • iIdx = ((y + 1) * intraPredAngle) >> 5
    • iFact = ((y + 1) * intraPredAngle) & 31
    • If iFact > 0
      P(x,y) = ((32 - iFact) * ref[x+iIdx+1] + iFact * ref[x+iIdx+2] + 16) >> 5

      Here, iFact acts as a weighting factor in the two-tap filter.

    • If iFact == 0 (the projection is perfectly horizontal or vertical)
      P(x,y) = ref[x+iIdx+1]
    • Apply clipping to the predicted values of the first column (only when intraPredMode == 26, luma is being coded and the block size is less than 32)
      P(x,y) = clip(R(1,0) + ((R(0,y+1) - R(0,0)) >> 1))

      where the clip function makes sure the result is in the range supported by the current luma bit depth.

  5. If predModeIntra is less than 18 (but greater or equal to 2), everything above applies, but the ref[] array is populated with R(0,0) to R(0,2N). Negative indices will also need a projection. Additionally, switch x with y in the equations above. For instance, ref[x+iIdx+1] becomes ref[y+iIdx+1].

8.5.1: General decoding process for coding units coded in inter prediction mode

Resume: predict the pixels as inter then decode them (same as intra).

8.5.2: Inter prediction process

Resume: split the CB in blocks according to the partitioning mode and decode each block according to 8.5.3 (next section).

The numerals below indicate the partition indices (partIdx) associated to the various block types.

   2Nx2N          Nx2N           2NxN           2NxnU          2NxnD          nLx2N          nRx2N          nXn   +----+----+    +----+----+    +----+----+    +----+----+    +----+----+    +--+------+    +------+--+    +----+----+   |0        |    |0   |1   |    |0        |    |0        |    |0        |    |0 |1     |    |0     |1 |    |0   |1   |   |         |    |    |    |    |         |    +---------+    |         |    |  |      |    |      |  |    |    |    |   +         +    +    |    +    +---------+    |1        |    |         |    +  |      +    +      |  +    +----+----+   |         |    |    |    |    |1        |    |         |    +---------+    |  |      |    |      |  |    |2   |3   |   |         |    |    |    |    |         |    |         |    |1        |    |  |      |    |      |  |    |    |    |   +----+----+    +----+----+    +----+----+    +----+----+    +----+----+    +--+------+    +------+--+    +----+----+

8.5.3: Decoding process for prediction units in inter prediction mode

Resume: predict the MV/RefIdx (8.5.3.2) and do motion compensation(8.5.3.3).

8.5.3.2: Derivation process for motion vector components and reference indices

Resume: predict the MV/RefIdx of the current block.

  • If the MV prediction mode is merge, do merge mode (8.5.3.2.1) and return.
  • Check the value of inter_pred_idc (L0, L1, Bi):
    • Set predFlagLX accordingly.
    • Set refIdxLX accordingly:
      • (-1) if the reference list is unused.
      • Bitstream value otherwise.
  • If a list is used
    • predict the MV for that list (8.5.3.2.5)
    • add the motion vector difference (Bitstream value)
    • clip it to 16 bits.
  • If chroma is used, derive motion vectors according to 8.5.3.2.9.

8.5.3.2.1: Derivation process for luma motion vectors for merge mode

Resume: predict the MV/RefIdx in merge mode of each reference list.

  • singleMCLFlag = (log2_parallel_merge_level_minus2 != 0 and CB size is 8x8).
    If true, the merging list is obtained as if the CB was not split in blocks (xP set to xC -- see below).
  • Generate a list of candidates and select one, as follow.
    1. Process the 5 spatial neighbour blocks (8.5.3.2.2).
    2. Set the refIdx of the temporal block 'Col' to 0 for each list.
    3. Get the MV/RefIdx of the temporal block 'Col' in each list (8.5.3.2.7):
      • availableFlagCol = 1 if the L0 or L1 reference index is available.
      • predFlagLXCol = 1 if the LX reference index is available.
      • In other words, for each list, use reference index 0 if a temporal motion vector can be predicted for that list and reference index.
    4. Build the merging candidate list mergeCandList:
      • In order and if available: A1, B1, B0, A0, B2, Col.
    5. Set numMergeCand and numOrigMergeCand to the size of mergeCandList.
    6. For B slices, add B candidates (8.5.3.2.3).
      • The number of B candidates is stored in numCombMergeCand and numMergeCand is updated accordingly.
      • The B candidates, if any, are named combCandK, with K in [0, numCombMergeCand - 1].
    7. Pad mergeCandList with null motion vectors until numMergeCand >= MaxNumMergeCand (8.5.3.2.4). The variables are numZeroMergeCand and zeroCandM with M in [0, numZeroMergeCand - 1] as above.
    8. Select the candidate mergeCandList[merge_idx] and assign its values to mvLX, refIdxLX, predFlagLX.
    9. If the block is 8x4 or 8x4 and the chosen candidate uses both L0 and L1, set refIdxL1 = -1 and predFlagL1 = 0 since bi-prediction is not allowed for those block sizes.

8.5.3.2.2: Derivation process for spatial merging candidates

Resume: add the spatial neighbour blocks A0, A1, B0, B1, B2.

  • If the current block is adjacent to the top boundary of the CTB:
    • REMOVED FROM DRAFT 9. Kept for reference in case they reintroduce.
    • For the neighbours B0, B1, B2:
      • Set their X coordinate as follow: X = ((X>>3)<<3) + ((X>>3)&1)*7.
      • This is used to reduce the memory requirements by "converting" neighbour 4x8 blocks into 8x8 blocks. Example:
              00  04  08  12  16  20  24  28  32      | A | B | C | D | E | F | G | H |  <== 4x8 neighbours blocks.      +---+---+---+---+---+---+---+---+  <== CTB boundary.      | A | A | D | D | E | E | H | H |  <== Neighbours as seen from the                                             current CTB.

  • Process each neighbour N (A0, A1, B0, B1, B2):
    • Set the neighbour availability according to 6.4.2 (availableN).
    • If singleMCLFlag == 0 and partIdx == 1 (the current block is second block of the CB):
      • If the partitioning mode is one of H1, H2, H3:
        • availableB1 = 0 (because B1 is the first block of the CB).
      • If the partitioning mode is one of V1, V2, V3:
        • availableA1 = 0 (because A1 is the first block of the CB).
      • This is done because it doesn't make sense to use the first block for prediction since the partitioning mode implies that both blocks use a different MV or RefIdx.
      • The singleMCLFlag condition is needed because both blocks share the same merging list if the flag is true.
    • availableFlagN = availableN.
    • Handle log2_parallel_merge_level_minus2:
      • Let PML = log2_parallel_merge_level_minus2+2. This is the parallel merge level.
        • The level represents a block size: 4x4 (0), 8x8 (1), and so on.
      • If (xP>>PML, yP>>PML) == (xN>>PML, yN>>PML): availableFlagN = 0.
      • In other words, if the current block and the neighbour share the same "aligned" block of size (1<= 1, block D ignores the neighbours A, B, C.
                 E        +--+        |AB|        |CD|  Block A does not ignore the neighbour above it (E) since that        +--+  neighbour is not in the same (aligned) 8x8 block.
      • Presumably this is done so that A, B, C, D can be decoded in parallel.
        • Note that all spatial neighbours may become unavailable in this way.
  • For the following pairs of blocks, if the blocks are duplicates, mark the first as unavailable (in availableFlagN). A block is a duplicate if it contains the same RefIdx and MV as the other block:
    • (A0, A1).
    • (B1, A1).
    • (B2, A1).
    • (B0, B1).
    • (B2, B1).
  • Note that the logic above is not sufficient to eliminate all duplicates (e.g. A1=0, B1=1, B0=0).
    • This is done to reduce the computations required.
  • If A0, A1, B0, B1 are available:
    • availableFlagB2 = 0 (max four neighbours).

8.5.3.2.3: Derivation process for combined bi-predictive merging candidates

Resume: combine the L0 RefMv of some candidates with the L1 RefMv of other candidates (RefMv = reference index + motion vector).

  • Return if there are less than 2 candidates or the maximum number of merge candidates has already been generated.
  • Try to combine each candidate with every other candidate:
    • NbCombo = numOrigMergeCand*(numOrigMergeCand-1).
  • For (combIdx = 0; combIdx < NbCombo && numMergeCand != MaxNumMergeCand; combIdx++):
    • LookupTable[4*3] = [ (0,1), (1,0), (0,2), (2,0), (1,2), (2,1), (0,3), (3,0), (1,3), (3,1), (2,3), (3,2) ]
    • comboPair = LookupTable[combIdx].
    • l0Cand = mergeCandList[comboPair[0]].
    • l1Cand = mergeCandList[comboPair[1]].
    • If l0Cand uses list 0 && l1Cand uses list 1 && (l0Cand.RefIdxL0.Frame != l1Cand.RefIdxL1.Frame || l0Cand.Mv != l1Cand.Mv):
      • Add candidate combCandK to mergeCandList:
        • combCandK.RefMvL0 = l0Cand.RefMvL0.
        • combCandK.RefMvL1 = l1Cand.RefMvL1.

8.5.3.2.4: Derivation process for zero motion vector merging candidates

Resume: pad mergeCandList with (0,0) MVs and different reference indices.

  • If P slice: numRefIdx = L0.Size.
  • If B slice: numRefIdx = Min(L0.Size, L1.Size).
  • For (i = 0; numMergeCand < MaxNumMergeCand; i++, numMergeCand++):
    • zeroCandM.MvL0 = zeroCandM.MvL1 = (0,0).
    • refIdx = (i < numRefIdx) ? i : 0.
    • zeroCandM.RefIdxL0 = refIdx.
    • zeroCandM.RefIdxL1 = (is P slice) ? -1 : refIdx.

8.5.3.2.5: Derivation process for luma motion vector prediction

Resume: predict the motion vector for refIdxLX in list X of the current block.

  • Add the following candidates to mvpListLX, in order, until there are two candidates in the list:
    • Spatial neighbour A, if available (8.5.3.2.6).
    • Spatial neighbour B, if available and not a duplicate (8.5.3.2.6).
    • Temporal neighbour Col, if available (8.5.3.2.7).
    • (0,0).
    • (0,0).
  • mvpLX = mvpListLX[mvp_lX_flag].

8.5.3.2.6: Derivation process for motion vector predictor candidates

Resume: get the predicted motion vector of the spatial neighbours A and B.

  • refIdxLX: reference index of the current block, i.e. the index in list X for which a motion vector is predicted.
  • mvLXA/mvLXB: output motion vectors.
  • Notation:
    • RefType is the reference frame type (short-term or long-term) of the reference frame associated to a reference index.
    • LY is the complement of LX (list X), as follow:
      • LX == 0 => LY = 1.
      • LX == 1 => LY = 0.
  • Although the motion vector is predicted for refIdxLX in list X, LY (the complement list) is accessed in the neighbour blocks.
  • The MV for neighbour A is predicted, then the MV for neighbour B is predicted based on the information found in A, as follow.
  • Process neighbour A:
    1. AvailableFlagLXA = 0.
    2. Get the availability of A0 and A1 (6.4.2).
    3. isScaledFlagLX = AvailableA0 || AvailableA1 (see below).
    4. For (k = 0; k < 2 && !AvailableFlagLXA; k++) (first to match, if any):
      • Let Ak be the neighbour A0 or A1.
      • If AvailableAk && Ak.PredMode (the neighbour is available and inter):
        • If Ak.PredFlagLX && Ak.RefIdxLX == refIdxLX (the neighbour uses the same list and the neighbour index is refIdx):
          • AvailableFlagLXA = 1.
          • mvLXA = Ak.MvLX.
        • Else if Ak.PredFlagLY && Ak.RefIdxLY.Frame == refIdxLX.Frame (the neighbour uses the complement list and the neighbour index refers to the same frame referred to by refIdxLX):
          • AvailableFlagLXA = 1.
          • mvLXA = Ak.MvLY.
    5. For (k = 0; k < 2 && !AvailableFlagLXA; k++) (first to match, if any):
      • If AvailableAk && Ak.PredMode (the neighbour is available and inter):
        • If Ak.PredFlagLX && Ak.RefIdxLX.RefType == refIdxLX.RefType (the neighbour uses the same list and both the current and the neighbour indices refer to the same reference frame type):
          • AvailableFlagLXA = 1.
          • mvLXA = Ak.MvLX.
          • refIdxA = Ak.RefIdxLX (see below).
        • Else if Ak.PredFlagLY && Ak.RefIdxLY.RefType == refIdxLX.RefType (the neighbour uses the complement list and both the current and the neighbour indices refer to the same reference frame type):
          • AvailableFlagLXA = 1.
          • mvLXA = Ak.MvLY.
          • refIdxA = Ak.RefIdxLY.
        • If AvailableFlagLXA && refIdxLX.RefType == ShortTerm (the match was for short-term frames):
          • Map (scale) the motion vector of the neighbour reference frame to the current reference frame. Example:
                        -2     -1      0 <= POC             F      E      C      C: current frame.             |      |      |             |      |<==E==|      E: reference frame used by current block.             |<=====|===F==|      F: reference frame used by neighbour block.             |      |      |      Distance scale factor (DSF): E/F = 1/2.
          • td = ClipToSignedByte(PicOrderCntVal - refIdxA.POC).
          • tb = ClipToSignedByte(PicOrderCntVal - refIdxLX.POC).
          • tx = (16384 + (|td|>>1)) / td.
          • distScaleFactor = Clip3(-4096, 4095, (tb*tx + 32)>>6).
          • tmp = distScaleFactor*mvLXA (for each MV component).
          • mvLXA = ClipMV(SignOf(tmp) * ((|tmp| + 127)>>8)).
          • Notes:
            • tb/td is the theoric distance scale factor.
            • Since divisions are costly to compute, the division is emulated by shift operations and a precomputed table (tx table).
            • The signs of tb and td are independent. The motion vector is reversed when the signs are opposite.
  • Process neighbour B:
    • Apply steps 1), 2), 4) above to B0, B1, B2 (with k < 3 in the for loop).
    • If !isScaledFlagLX (A0 and A1 are not available):
      • If availableFlagLXB (B is available):
        • mvLXA = mvLXB (assign the B neighbour data to A).
        • availableFlagLXA = 1.
        • availableFlagLXB = 0 (try to generate another MV for B).
      • Apply step 5) to B0, B1, B2.
    • Else:
      • Void, presumably to cut down on the processing time (use what was already found, if any).

8.5.3.2.7: Derivation process for temporal luma motion vector prediction

Resume: get the predicted motion vector of the temporal neighbour Col.

  • Return if slice_temporal_mvp_enable_flag is zero (no temporal prediction).
  • Determine which reference frame is the temporal frame:
    • B slices use RefPicList1 when collocated_from_l0_flag is false, and RefPicList0 when collocated_from_l0_flag is true.
    • P slices always use RefPicList0.
    • colPic = the selected temporal frame (given by collocated_ref_idx).
  • Check if the colocated bottom right MV is available:
    • colPb is the variable containing the collocated block location in the temporal frame. The collocated block can be in two locations with respect to the current block:
            +----------+                    Top:          (xPb,yPb)      | +------+ |                    Middle:       (xPb+nPbW>>1, yPb+nPbH>>1).      | |Middle| | <== Current block  Bottom-right: (xPb+nPbW, yPb+nPbH).      | +------+ |      +----------+------------+                 |Bottom-right|                 +------------+
    • If the bottom right location is outside the current CTB vertically or outside the picture, colPb is set to "unavailable".
                "Additionally, to tackle unavailability of the MV            derived from position C0 [Bottom-right] and to            enable CTU-aligned motion data compression, storage,            and fetch, the position used to derive the TMVP            [Temporal Motion Vector Predictor] is kept inside            the row of processed CTUs."                    -Block Merging for Quadtree-Based Partitioning in HEVC
    • Else, the bottom right location (aligned on a 16-byte boundary) is used to fetch the colocated MV at that location (8.5.3.2.8).
  • If the bottom-right candidate is "unavailable", fetch the colocated MV (8.5.3.2.8) using the middle of the current location to derive the temporal candidate.
    • Note that the location is aligned at 16-byte boundaries, i.e. ((xPb+(nPbW>>1))<<4)>>4), and ((xPb+(nPbH>>1))<<4)>>4).

8.5.3.2.8: Derivation process for collocated motion vectors

Resume: get the predicted motion vector of the temporal neighbour Col.

  • Return if colPb is intra (no colocated prediction).
  • Determine the reference list and reference index associated to the motion vector in the collocated block:
    • The collocated block can contain one or two motion vectors (single prediction or bi-prediction). The reference list used by the collocated block may differ from the reference list of the current block.
    • If the collocated block is single-predicted (either only List0 or only List1):
      • listCol = list used by the collocated block.
      • refIdxCol = reference index in listCol.
      • mvCol = motion vector associated to listCol in the collocated block.
    • Else (the collocated block is bi-predicted):
      • If the POC of all reference frames in the current slice (L0+L1) is less than or equal to PicOrderCntVal:
        • Choose the list associated to refIdxLX. In other words, if the motion vector is predicted for an index in L0, choose the collocated MV from L0. Update listCol, refIdxCol, and mvCol.
      • Else:
        • If collocated_from_l0_flag is true, choose L1, otherwise choose L0.
        • In other words, choose list collocated_from_l0_flag.
    • If refIdxCol.RefType != refIdxLX.RefType (not the same reference frame type, i.e. one is short-term and the other is long-term): colocated MV is "unavailable". Return.
    • Scale the temporal motion vector:
      • colPocDiff = DiffPicOrderCnt(colPic, refPicListCol[refIdxCol])
      • currPocDiff = DiffPicOrderCnt(currPic, refPicListX[refIdxLX])
      • If refIdxLX.RefType == "long-term" of POC differences are identical:
        • mvLXCol = mvCol.
      • Else:
        • The scaling is the same as the scaling done in 8.5.3.2.6 step 5):

8.5.3.2.9: Derivation process for chroma motion vectors

Resume: use the luma MV for chroma MVs.

8.5.3.3: Decoding process for inter prediction samples

Resume: get the reference lists and motion vectors and perform MC.

  • Get the reference of each luma list (8.5.3.3.2).
  • Get the reference of each chroma (8.5.3.3.3).
  • Get the prediction for each colour component (8.5.3.3.4).

8.5.3.3.2: Reference picture selection process

Resume: return the reference frame corresponding to refIdxLX.

8.5.3.3.3: Fractional sample interpolation process

Resume: perform MC.

  • xIntL = LumaMv.x>>2 (luma integer motion vector).
  • xFracL = LumaMv.x&3 (luma quarterpel component).
  • The same applies for the Y coordinate and chroma.
  • Do MC (8.5.3.3.3.2 luma, 8.5.3.3.3.3 chroma).

8.5.3.3.3.2: Luma sample interpolation process

Resume: perform the quarterpel luma interpolation.

  • The reference plane is padded to infinity with the border pixels.
  • Interpolate using the equations in the spec. It's pretty clear.

8.5.3.3.3.3: Chroma sample interpolation process

Resume: same as above.

8.5.3.3.4: Weighted sample prediction process

Resume: perform weighted prediction according to weighted_pred_flag and weighted_bipred_flag for each image component.

  • Note: weighted prediction happens after quarterpel interpolation.
  • If weightedPredFlag = 1, do default weighted sample prediction (8.5.3.3.4.2).
  • Else, do weighted sample prediction (8.5.3.3.4.3).

8.5.3.3.4.2: Default weighted sample prediction process

Resume: perform weighted prediction with default weights.

  • The spec is pretty clear on that topic.
  • If single-predicted:
    • Shift and clip the pixel value.
  • Else:
    • Add both pixel values, then shift and clip the result.

8.5.3.3.4.3: Weighted sample prediction process

Resume: perform weighted prediction with explicit weights.

  • The spec is pretty clear on that topic.
  • If single-predicted:
    • ClipPix(((Pix*num + bias) >> denom) + off).
  • Else:
    • Bias = (off0 + off1 + 1) << denom.
    • ClipPix(((Pix0*num0 + Pix1*num1 + Bias)) >> (denom+1)).

8.5.4: Decoding process for the residual signal of coding units coded in inter prediction mode

Resume: do the dequantization and inverse transformation for luma and chroma in the CB.

  • Nothing to do if rqt_root_cbf = 0 or skip_flag = 1 (all zero).
  • Else, decode luma (8.5.4.2) an chroma (8.5.4.3).

8.5.4.2: Decoding process for luma residual blocks

Resume: do the dequantization and inverse transformation for luma.

  • Split in sub blocks recursively according to split_transform_flag.
  • For each leaf block:
    • Do inverse transformation/dequantization for the luma block (8.6.2).

8.5.4.3: Decoding process for chroma residual blocks

Resume: same as above for chroma.

  • The only difference is that a 4x4 chroma block is not split in 2x2 blocks.

8.6: Scaling, transformation and array construction process prior to deblocking filter process

Before digging into the details of the spec, a description of the encoding and decoding steps is presented. Special attention is payed to the bit precision required at the different steps of the process.

During encoding, lossless predictive coding is used as a way to minimize the amount of information that will be transformed and quantized. This implies that sample differences are sent to the transform process (the transform and quantization process can be bypassed however). If the actual samples are coded with BD bits, then sample differences require one additional bit. As an example, consider the case where each sample is stored with 8 bits. The range [0, 255] (or [0,(2^BD)-1]) encloses all possible values each sample may take. Subtracting two values in the range expands the range to [-255,255], a range that requires one additional bit. So we now that the input to the first transform (the transform is in fact decomposed into two 1D-transforms applied first to the rows, then the columns) uses DB+1 precision (sign flag included).

The transform is defined as the sum of N multiplications involving predefined coefficients in the range [-90,90]. However, the average coefficient used in the process is less or equal to 64. This lets us assume that the multiplications require six additional bits to store the intermediate results. To that, log2(N) additional bits will be required to store the transform coefficients because of the summation. We know that the transform sizes are either 4, 8, 16 or 32. Summing the products two-by-two involves log2(N) operations. An example for N = 8 is depicted below:

                            Operations      Precision                       +-----------------+--------------+                       |        +        | BD+1+6+1+1+1 |                       +-----------------+--------------+                       |    +       +    |  BD+1+6+1+1  |                       +-----------------+--------------+                       |  +   +   +   +  |   BD+1+6+1   |                       +-----------------+--------------+                       | a b c d e f g h |    BD+1+6    |                       +-----------------+--------------+

From above, we see that the result of the first transform increases by 9 bits the storage required for each value. From our observations, the HM then right shifts these values by log2(N)+BD-9 bits, where N is the transform size, and BD is the initial bit depth. This operation shrinks the BD+7+log2(N) samples to 16 bit samples (BD+7+log2(N) - (log2(N)+BD-9) = 7 + 9 = 16). So essentially, the right shift is used to factor out the bit depth from the remainder of the transform process.

Next, the same transform is applied to the scaled outputs of the first transform, but this time in column-wise fashion. The same log2(N)+6 bits of precision are again added to the input coefficients, but the reference software applies a matching log2(N)+6 right shift to the intermediate values to make sure that 16-bit values are sent to the quantization process.

The transformed values are then quantized, which can be summarized as applying an integer division (truncating the remainder). When the values are rescaled, they are expected to be finer or coarser approximation of the original values. The precision depends of the quantization step size used. As a rule of thumb, the quantized values aren't expected to require more precision than the original ones.

During the decoding process, the quantized values are first rescaled. Each sample is multiplied by the denominator used during quantization. As a precaution, the output of the rescaling is clipped to the signed 16-bit range (e.g. [-32678,32767]). From our observation, clipping here is not required if it is known a priori that the values obtained during the forward transform and quantization process were not altered. More on that in section 8.6.4.

The clipped values are then sent to the first inverse transform (the inverse transform is also decomposed in two steps). Intuitively, since the forward transform adds log2(N)+6 bits, and that the inverse transform uses the same matrix, we can conclude that the inverse transform adds the same amount of bits. Thus, we would expect that, in similar fashion to the encoding process, the outputs of the first inverse transform also need to be right shifted by log2(N)+6. However, the spec indicates that the outputs need to be right shifted by 7. But why?

It is assumed that both transform matrices are orthogonal. Numerically, that is not the case. The coefficients were altered to allow integer arithmetic, and orthogonality was lost. Sullivan states:

              "The elements of the core transform matrices were                derived by approximating scaled DCT basis functions,                under considerations such as limiting the necessary                dynamic range for transform computation and maximizing                the precision and closeness to orthogonality when the                matrix entries are specified as integer values."                                     - Overview of the HEVC Standard

This assumption of orthogonality means that multiplying the transform matrix by its transposed should yield the identity matrix multiplied by a gain factor. An example works best here. Let T be the transform matrix, and let T' be its transposed. Also let I be the identity matrix. For the 4x4 transform case we have:

         +---+---+---+---+     +---+---+---+---+     +---+---+---+---+         | 64| 64| 64| 64|     | 64| 83| 64| 36|     | 1 | 0 | 0 | 0 |         +---+---+---+---+     +---+---+---+---+     +---+---+---+---+         | 83| 36|-36|-83|     | 64| 36|-64|-83|     | 0 | 1 | 0 | 0 |     T = +---+---+---+---+ T'= +---+---+---+---+ I = +---+---+---+---+         | 64|-64|-64| 64|     | 64|-36|-64| 83|     | 0 | 0 | 1 | 0 |         +---+---+---+---+     +---+---+---+---+     +---+---+---+---+         | 36|-83| 83|-36|     | 64|-83| 64|-36|     | 0 | 0 | 0 | 1 |         +---+---+---+---+     +---+---+---+---+     +---+---+---+---+

Using matrix algebra, the encoding process can be expressed as T*I*T'. To this, we need to add the right shift operations applied by the encoder. So we have (T*((I*T')/2))/256. Both denominators are derived from the bit depth (8). The intermediate (X), and final (Y) matrices are:

                 +---+---+---+---+     +---+---+---+---+                 | 32| 32| 32| 32|     | 31| 0 | 0 | 0 |                 +---+---+---+---+     +---+---+---+---+                 | 41| 18|-18|-41|     | 0 | 31| 0 | 0 |             X = +---+---+---+---+ Y = +---+---+---+---+                 | 32|-32|-32| 32|     | 0 | 0 | 32| 0 |                 +---+---+---+---+     +---+---+---+---+                 | 18|-41| 41|-18|     | 0 | 0 | 0 | 31|                 +---+---+---+---+     +---+---+---+---+

The first step of the reverse process is to apply the inverse transform to Y's columns. Because the transform is lossless (in theory), we are expecting to return to matrix X. The output of the transform, before the right shift is applied, is shown below.

                      +-----+-----+-----+-----+                      | 2048| 2048| 2048| 2048|                      +-----+-----+-----+-----+                      | 2573| 1116|-1116|-2573|                  Z = +-----+-----+-----+-----+                      | 2048|-2048|-2048| 2048|                      +-----+-----+-----+-----+                      | 1116|-2573| 2573|-1116|                      +-----+-----+-----+-----+

We can see that a factor of 64 separates X from Z. Now if we were to right shift the values of Z by log2(N)+6, the scaled results would be 4 times too small (since N = 4 in our example). Our observations indicate that the results would in fact by N times too small, as a factor of 8 was found in the 8x8 case, 16 in the 16x16 case, and 32 in the 32x32 case. Therefore, the log2(N) factors cancel each other out. The right shift should be 6.

We know for a fact that the input to the first inverse transform uses 16 bits. We also know that the log2(N) bits introduced by the summation naturally cancel out. The reason the spec applies a right shift by 7 is for numerical stability. 6 bits of added precision are required because of the multiplications in the transform. Afterwards, 64 is added to the transform output before the right shift is applied. This is why the right shift is 7, and not 6. In practice, however, we have found that a right shift by 6 is sufficient, and would not cause overflow in the second transform.

8.6.1: Derivation process for quantization parameters

Resume: get the luma and chroma QPs of the current CB.

  • QP'Y: luma QP of the current CB (output value).
  • QP'Cb/Cr: chroma QPs of the current CB (output values).
  • QPY: intermediate luma QP value.
  • QpBdOffsetY = 6 * bit_depth_luma_minus8.
  • QpBdOffsetC = 6 * bit_depth_chroma_minus8.
  • QpBdOffsetY is the luma QP range extension at higher bit depths. QP'Y must be within [0, 51 + QpBdOffsetY]. For 8-bit pixels, this is to [0, 51].
  • SliceQPY = (pic_init_qp_minus26+26) + slice_qp_delta.
  • Log2MinCuQpDeltaSize = log2 of QG block size = Log2CtbSizeY - diff_cu_qp_delta_depth.
  • (xQG, yQG): location of the quantization group that covers the current CB from the source plane origin (the current QG). If the size of the current CB is larger than the QG size, the top-left QG covering the CB is chosen.

    The standard imposes that the QG size be less or equal to the CTB size. The same restrictions applies to the CB size (i.e. CB <= CTB). This leaves us with three cases to handle here; 1) QG and CB are the same size, 2) QG is bigger than CB (multiple CBs belong to the current QG), and 3) QG is smaller than CB (multiple QGs belong to the current CB). The spec uses the following two equations to derive the QG coordinates (xQG,yQG) from the CB coordinates (xCB,yCB).

                 xQG = xCB - (xCB & ((1 << Log2MinCuQpDeltaSize) - 1))             yQG = yCB - (yCB & ((1 << Log2MinCuQpDeltaSize) - 1))

    The two equations above handle all three cases. For instance, let the CTBs be 64x64, the QGs be 32x32 (Log2MinCuQpDeltaSize = 5), and the current CB be 8x8. The current CTB is located at (0,0), the four QGs in the current CTB are located at (0,0), (32,0), (0,32) and (32,32), and the current CB is located at (16,48).

          0     32     64    0 +-----+--+--+           A: top-left luma sample of the current CTB      |A    |     |           B: top-left luma sample of the current QG      |     |     |           C: top-left luma sample of the current CB   32 +--+--+--+--+      |B |  |     |   48 +--+--+     |      |  |C |     |      +--+--+--+--+   64

    With the equations above, we get:

           xQG = 16 - (16 & ((1 << 5) - 1)) = 16 - (16 & 31) = 16 - 16 = 0       yQG = 48 - (48 & ((1 << 5) - 1)) = 48 - (48 & 31) = 48 - 16 = 32

    On the other hand, if we had a CB of 32x32, and the QG size had been set to 32x32 or less (CTB is still 64x64), then (xQG,yQG) would always be identical to (xCB,yCB). Here both xCB and yCB are multiples of 32, and Log2MinCuQpDeltaSize is 4. Any multiple of 32 ANDed with 31 (i.e. (1 << Log2MinCuQpDeltaSize) - 1) or less will always be 0 as only the 5th bit and the following can be raised for multiples of 32.

  • There is one QPY value associated to each CB and QG. If a CB is larger than the QG, all QGs covered by the CB inherit its QPY value. If the QG is larger than the CB, the QG inherits the QPY value of the last CB in encoding order.
  • When the QG is larger than the CB, the CBs in the QG do not necessarily share the same QPY value. The QPY value of a CB is affected by the value of CuQpDelta when the CB is parsed. The CuQpDelta value changes at most once per QG, in the first CB that contains non-zero coefficients. Thus, if the first CB in the QG contains only zero coefficients and the second CB contains non-zero coefficients, the CuQpDelta value may change for the second CB and both CBs (the first and the second) may have different QPY values.
  1. Get the QP of the previous QG (qPY_PREV):
    • If the current QG is the first QG of a slice, a tile or a WPP CTB row:
      • qPY_PREV = SliceQPY.
    • Else:
      • qPY_PREV = QPY of the last QG decoded.
      • Do not confuse QPY with QP'Y here.
  2. Get the QP of the left CB (qPY_A):
    • The left CB is at (xQG-1, yQG) and its availability is determined using 6.4.1.
    • If the left CB is unavailable or in a different CTB containing the current CB:
      • qPY_A = qPY_PREV.
      • Presumably this is done to reduce the memory requirements.
    • Else:
      • qPY_A = QPY of the CB at (xQG-1, yQG).
      • The left CB is not necessarily the spatial neighbour of the current CB or the last CB in encoding order of the left QG. Presumably this is done to avoid intra-QG dependencies while retaining as much spatial prediction as possible.
  3. Get the QP of the top CB (qPY_B):
    • Same as above for the top CB at (xQG, yQG-1).
  4. Get the predicted QP of the current CB (qPY_PRED):
    • qPY_PRED = (qPY_A + qPY_B + 1) >> 1.
    • In other words, the average of the two neighbour QPs.
  5. Get the intermediate luma QP (QPY):
    • QPY = ((qPY_PRED + CuQpDelta + 52 + 2*QpBdOffsetY) % (52 + QpBdOffsetY)) - QpBdOffsetY.
    • In 8-bit, this simplifies to QPY = ((qPY_PRED + CuQpDelta + 52) % 52.
    • The modulo operation makes the QP 0 and 51 contiguous. For example, if qPY_PRED = 51 and CuQpDelta 0= 1, then QPY = 0.
  6. Get the intermediate chroma QPs (qPCb and qPCr):
    • qPiCb = Clip3(-QpBdOffsetC, 57, QPY + pic_cb_qp_offset + slice_cb_qp_offset).
    • Map qPiCb to qPCb using Table 8-9 (see spec.).
    • Same for qPCr.
  7. Get the final QPs (QP'Y, QP'Cb, QP'Cr):
    • QP'Y = QPY + QpBdOffsetY.
    • QP'Cb = qPCb + QpBdOffsetC.
    • QP'Cr = qPCr + QpBdOffsetC.

8.6.2: Scaling and transformation process

Resume: perform dequantization and inverse transformation.

  • TransCoeffLevel[x][y]: raw coefficient values from the bitstream (input).
  • r[x][y]: dequantized and inverse-transformed coefficients (output).
  • If cu_transquant_bypass_flag (no transformation and no quantization):
    • r[x][y] = TransCoeffLevel[x][y].
  • Else:
    1. Get the QP (see 8.6.1).
    2. Dequantize the coefficients (see 8.6.3) with TransCoeffLevel[x][y] as input and d[x][y] as output.
    3. If transform_skip_flag (no transformation):
      • r[x][y] = (d[x][y] << 7).
      • Scale for compatibility with the transform case.
    4. Else:
      • Inverse transformation (8.6.4) with d[x][y] as input and r[x][y] as output.
    5. bdShift = 20 - BitDepth.
    6. r[x][y] = (r[x][y] + (1 << (bdShift - 1))) >> bdShift.
      • Discard the low bits.

8.6.3: Scaling process for transform coefficients

Resume: dequantize the coefficients.

  • TransCoeffLevel[x][y] are the quantized coefficients (input).
  • d[x][y] are the dequantized coefficients (output).
  • bdShift = BitDepth + Log2(nT) - 5.

    bdShift is a scaling shift factor used to make sure that the output values fit in the range defined by the current bit depth.

  • levelScale[k] = {40,45,51,57,64,72} (k = qP % 6 scaling).
  • If scaling_list_enable_flag = 0 (use a flat scaling value):
    • m[x][y] = 16.
  • Else (use the scaling list value, see 7.4.5):
    • m[x][y] = ScalingFactor[sizeId][matrixId][x][y].
  1. Val1 = TransCoeffLevel[x][y] * m[x][y] * (levelScale[qP%6] << (qp/6)).
    • Scale by the scaling list value and the QP.
  2. Val2 = (Val1 + (1<<(bdShift - 1))) >> bdShift.
    • Discard the low bits.
  3. d[x][y] = ClipToSigned16Bits(Val2).
    • IMPORTANT See text in next section. Clipping is not required in the encoder.

8.6.4: Transformation process for scaled transform coefficients:

Resume: perform the 2D inverse transform.

  • d[x][y] is the input.
  • e[x][y] are intermediate values (after the first transform).
  • g[x][y] are clipped intermediate values (in between transforms).
  • r[x][y] is the output.
  1. Derive trType (the transform type: DST or DCT).
    • Intra 4x4 luma transform blocks use the DST (trType = 1).
    • All other transform block types use the DCT (trType = 0).
  2. Apply the 1D transform on the columns (8.6.4.2) with d[x][y] and trType as the input, and e[x][y] as output.
  3. g[x][y] = ClipToSigned16Bits((e[x][y] + 64) >> 7).
    • IMPORTANT See text below.
  4. Apply the 1D transform on the rows (8.6.4.2) with g[x][y] and trType as input, and r[x][y] as output.

To clip, or not to clip?

The spec (Draft 10) specifies that the output of the first transform needs to be clipped in the signed 16-bit range before the inverse transform is applied to the rows (step 3 above).

Assuming the inverse transform is applied to any NxN array populated by random samples, the clipping makes sense, as multiplying 16-bit values with 6-bit coefficients, then adding them all together, requires log2(N)+22 bits of storage. This means that we need between 24, and 27 bits of storage.

The above logic assumes that the NxN array of random samples is not obtained by applying the forward transform. Either that, or the resulting coefficients, before or after the quantization, were altered.

A 4x4 array was randomly populated with sample differences of -255, 0, or 255. For all possible combinations (3^16), the forward transform was applied, then the first step of the inverse transform was applied following the steps described in the spec, with the exception of the clipping process. The values were then checked to make sure no values were outside the [-32768,32767] range.

Our experiment assumes lossless coding (no quantization). The results indicate that no values can lie outside the signed 16-bit range. Furthermore, the largest absolute coefficient observed was 16321 for the DCT, and 15470 for the DST. This also serves as proof that applying a right shift by 6 in between inverse transforms would work.

The same clipping question is valid for the inverse quantization process (dequant). Subclause 8.6.3 specifies that the rescaled coefficients are to be clipped in the signed 16-bit range before they are sent to the inverse transform process. An encoder may skip this clipping step if the final values are not altered before they are entropy coded.

When checking for possible overflow in the inverse transform, we observed that the greatest absolute coefficient after the second transform was 30600. Using this value as a reference point, we checked that an overflow was not possible for coefficients in the range [-32640,32640] using every QP value between 0, and 51. The simulations indicate that the worst case is 32647, which is still in the signed 16-bit range. Therefore, we can assume that clipping is not required when it is known that the quantized coefficients come from an unaltered set of transform coefficients.

8.6.4.2: Transformation process

Resume: perform the 1D inverse transform.

The standard uses the x[] array to denote the transformed coefficients (before the reverse transform), and the y[] array to denote the residual samples (after the reverse transform). Moreover, the standard uses column-first ordering, (e.g. x[0][1] is the first element on the second line, and x[1][0] is the second element on the first line) which is opposite to how the C language works. An example is given for the DST, but the same logic applies for the DCT.

  1. If trType == 1: multiply the coefficients by the 4x4 DST matrix.

    This transform can be described with the following matrix multiplication:

               +---+---+---+---+   +---+---+---+---+   +---+---+---+---+           | A | B | C | D |   | a | b | c | d |   | 29| 55| 74| 84|           +---+---+---+---+   +---+---+---+---+   +---+---+---+---+           | E | F | G | H |   | e | f | g | h |   | 74| 74|  0|-74|           +---+---+---+---+ = +---+---+---+---+ * +---+---+---+---+           | I | J | K | L |   | i | j | k | l |   | 84|-29|-74|-29|           +---+---+---+---+   +---+---+---+---+   +---+---+---+---+           | M | N | O | P |   | m | n | o | p |   | 55|-84| 74|-29|           +---+---+---+---+   +---+---+---+---+   +---+---+---+---+             A = 29 * a + 74 * b + 84 * c + 55 * d             B = 55 * a + 74 * b - 29 * c - 84 * d             ...             E = 29 * e + 74 * f + 84 * g + 55 * h             ...
  2. Else (DCT):
    • Multiply the coefficients by the DCT matrix that corresponds to the transform size.
    • The specification provides the 32x32 DCT matrix directly. The 16x16 DCT matrix is obtained by taking every other row and the first 16 columns of the 32x32 DCT matrix.
    • The same principle applies for the other matrices. For instance, the 4x4 DCT matrix looks like this:
      +---+---+---+---+| 64| 64| 64| 64| (the first 4 columns on the first row)+---+---+---+---+| 83| 36|-36|-83| (the first 4 columns on the 8-th row)+---+---+---+---+| 64|-64|-64| 64| (the first 4 columns on the 16-th row)+---+---+---+---+| 36|-83| 83|-36| (the first 4 columns on the 24-th row)+---+---+---+---+

      And the 8x8 matrix looks like this:

      +---+---+---+---+---+---+---+---+| 64| 64| 64| 64| 64| 64| 64| 64| (the first 8 columns of the first row)+---+---+---+---+---+---+---+---+| 89| 75| 50| 18|-18|-50|-75|-89| (the first 8 columns of the 4-th row)+---+---+---+---+---+---+---+---+| 83| 36|-36|-83|-83|-36| 36| 83| (the first 8 columns of the 8-th row)+---+---+---+---+---+---+---+---+| 75|-18|-89|-50| 50| 89| 18|-75| (the first 8 columns of the 12-th row)+---+---+---+---+---+---+---+---+| 64|-64|-64| 64| 64|-64|-64| 64| (the first 8 columns of the 16-th row)+---+---+---+---+---+---+---+---+| 50|-89| 18| 75|-75|-18| 89|-50| (the first 8 columns of the 20-th row)+---+---+---+---+---+---+---+---+| 36|-83| 83|-36|-36| 83|-83| 36| (the first 8 columns of the 24-th row)+---+---+---+---+---+---+---+---+| 18|-50| 75|-89| 89|-75| 50|-18| (the first 8 columns of the 28-th row)+---+---+---+---+---+---+---+---+

8.6.5: Picture construction process prior to in-loop filter process

Resume: copy the decoded pixels to the reconstructed frame.

8.7.1: General (In-loop filter process)

Resume: perform DF and SAO if enabled.

8.7.2.1: General (Deblocking filter process)

Resume: filter the edges vertically then horizontally.

As a complement to this text, refer to "HEVC Deblocking Filter" for more details.

The "current" block associated to an edge is the block to the right or below the edge when considering values such as slice_disable_deblocking_filter_flag.

The specification processes the edges in one direction CB by CB, presumably because the authors found it easier that way. The spec notes that the filtering process can be done in both direction in parallel, as longs as the dependencies are accounted for before the processing of the horizontal edges start.

This text assumes the vertical case. The horizontal case is symmetric. For each CB:

  1. filterLeftCbEdgeFlag tracks whether the left edges of the CB can be filtered. The flag is true unless:
    • The CB edge is on a frame boundary.
    • The CB edge is on a tile/slice boundary and loop_filter_across_tiles_enabled_flag/ slice_loop_filter_across_slices_enabled_flag is false for the current CB. The values for those flags in the block on the other side of the edge are irrelevant.
  2. Reset all values in verEdgeFlags to 0.
    • verEdgeFlags is a matrix containing a flag for every pixel in the current CB. This binary map is a convenient way to identify the filtered edges in the CB.
  3. Update verEdgeFlags according to the transform block boundaries (8.7.2.2).
  4. Update verEdgeFlags according to the prediction block boundaries (8.7.2.3).
  5. Set the boundary strength in verBS using verEdgeFlags (8.7.2.4).
    • verBS is a matrix containing the boundary strength for every pixel of the current CB (same layout as verEdgeFlags).
  6. Filter the edges using verBS (8.7.2.5.1 vertical, 8.7.2.5.2 horizontal).

8.7.2.1: Derivation process of transform block boundary

Resume: identify the transform block boundaries.

  1. Split in sub blocks recursively according to split_transform_flag.
  2. For each leaf block:
    1. If the left edge of the block lies on the left edge of the CB:
      • Set the verEdgeFlags of the left edge to filterLeftCbEdgeFlag.
    2. Otherwise:
      • Set the verEdgeFlags of the left edge to 1.
  3. No test is done to determine if an edge lies on a 8-pixel boundary.

8.7.2.3: Derivation process of prediction block boundary

Resume: identify the prediction block boundaries.

  • The specification merely covers the cases where the prediction block is smaller than the transform block.
  • Check PartMode and set the corresponding edges to 1 in verEdgeFlags.
  • No test is done to determine if an edge lies on a 8-pixel boundary.

8.7.2.4: Derivation process of boundary filtering strength

Resume: set the boundary strength of the edges.

NOTE that only luma samples/blocks are used in this subclause. In this text, reference to a block (e.g. coding block, transform block) implies it is a luma block.

The specification walks the edges in verEdgeFlags in steps of 8 pixels horizontally (xDi) and 4 pixels vertically (yDj). Those edges are four pixels long. In other words, the specification processes every vertical edge of the CB that lie on the 8x8 grid. The pixels p0 (xDi-1, yDj) and q0 (xDi, yDj) formally identify the current edge. Example:

            +--------+               ---+---            |01234567|0 |             p0|q0            |------->|1 | yDj          X|X Current            |  xDi   |2 |              X|X  block            |        |3 v              X|X            +--------+               ---+---

For each vertical edge:

  1. If verEdgeFlags is 0 for the edge:
    • BS = 0, continue (i.e. pass to the next edge).
  2. Consider the two blocks that lie on each side of the edge.
  3. If a block is intra:
    • BS = 2, continue.
  4. If the edge is a transform block boundary and a block has a non-zero luma residual (chroma isn't checked):
    • BS = 1, continue.

    At this point the two blocks are necessarily inter.

  5. If one block is single predicted and the other is bi-predicted:
    • BS = 1, continue.

    At this point either the two blocks are both single-predicted or both bi-predicted. Consider the frame(s) used for reference in both blocks, irrespective of which list or reference index is used to refer to the frames in the two blocks.

  6. If the blocks do not use the same reference frames:
    • BS = 1, continue.

    At this point both blocks use the same reference frames.

  7. Two motion vectors Mv0 and Mv1 are said to be distant if
          |Mv0.X - Mv1.X| >= 4 or |Mv0.Y - Mv1.Y| >= 4.
  8. If the blocks are single-predicted:
    • If the motion vectors of the two blocks are distant (i.e. compare the motion vector of the first block with the motion vector of the second block):
      • BS = 1, continue.
  9. Else (bi-predicted):
    • If the two reference frames are different (recall that it is possible to do bi-prediction with only one reference frame):
      • For each reference frame, if the motion vectors for that reference frame (one from each block) are distant:
        • BS = 1, continue.
    • Else (bi-prediction on a single reference frame):
      • If the following condition is true (each MV pair evaluates to true if the MVs are distant), set BS to 1 and continue:
        • ((Block0.MVL0, Block1.MVL0) || (Block0.MVL1, Block1.MVL1)) &&
          ((Block0.MVL0, Block1.MVL1) || (Block0.MVL1, Block1.MVL0))
  10. At this point, set BS to 0.

8.7.2.5.1: Vertical edge filtering process:

Resume: filter the vertical edges.

  • For each luma edge with BS > 0:
    1. Do the decision process (8.7.2.5.3).
    2. Do the luma edge filtering (8.7.2.5.4).
  • For each chroma component:
    • For each chroma edge with BS == 2 that lies on the 8x8 chroma grid (the chroma edges must be aligned on a 16-luma-pixel boundary):
      1. Do the chroma edge filtering (8.7.2.5.5).

        Note: the JCT-VC experts did not deign to explain the reason for the 8x8 chroma grid restriction. Technically there is no reason why the filtering couldn't be done on the 4x4 chroma grid since only one chroma pixel is modified on each side of the edge.

8.7.2.5.2: Horizontal edge filtering process:

Resume: same as above for the horizontal edges.

8.7.2.5.3: Decision process for luma block edges:

Resume: get the global filtering decision for the edge.

  • Block P (with pixels pXY) is the block to the left/bottom.
  • Block Q (with pixels qXY) is the block to the right/top.
  • bS is the boundary strength.
  • Output is Beta, tC, dE, dEp, dEq.
  1. Get the threshold parameters (Beta and tC):
    • qPL = (BlockP.QPY + BlockQ.QPY + 1) >> 1.

      Get the average of the QPs of the blocks.

    • Q1 = Clip3(0, 51, qPL + (beta_offset_div2 << 1)).

      Add the Beta QP offset.

    • Beta' = Table8_11_Beta[Q1].

      Beta depends on the QP.

    • Beta = Beta' * (1 << (BitDepthY-8)).

      Scale by the luma bit depth.

    • Q2 = Clip3(0, 53, qPL + 2*(bS - 1) + (tc_offset_div2 << 1)).

      Add the tC and the boundary strength QP offsets.

    • tC' = Table8_11_tC[Q2].
    • tC = tC' * (1 << (BitDepthY-8)).

      Same as above.

  2. Get the decisions (dE, dEp, dEq):
    • dE controls the edge filtering strength (2: strong, 1: weak, 0: none).
    • dEp/dEq determines the weak filtering strength for the block P/Q (1: 2 pixels, 0: 1 pixel).
    • The pixels on each side of the edge are labelled as follow:
                              ---------------+----------------                        p30 p20 p10 p00|q00 q10 q20 q30 <= first row                        p31 p21 p11 p01|q01 q11 q21 q31                        p32 p22 p12 p02|q02 q12 q22 q32                        p33 p23 p13 p03|q03 q13 q23 q33 <= fourth row                        ---------------+----------------                                            Current                                             block
    • Compute some values from the rows 0 and 3:
           dp0 = |p20 - 2*p10 + p00|.     dp3 = |p23 - 2*p13 + p03|.     dq0 = |q20 - 2*q10 + q00|.     dq3 = |q23 - 2*q13 + q03|.    dpq0 = dp0 + dq0.    dpq3 = dp3 + dq3.      dp = dp0 + dp3.      dq = dq0 + dq3.       d = dpq0 + dpq3.
    • If d >= Beta (busted threshold):
      • dE = dEp = dEq = 0.
    • Else:
      • dSam0 = decision for row 0 with dpq=2*dpq0 (8.7.2.5.6).
      • dSam3 = decision for row 3 with dpq=2*dpq3 (8.7.2.5.6).
      • dE = 1 + (dSam0 == 1 && dSam3 == 1).
      • dEp = dp < ((Beta + (Beta>>1)) >> 3).
      • dEq = dq < ((Beta + (Beta>>1)) >> 3).

8.7.2.5.4: Filtering process for luma block edges

Resume: filter and replace the luma edge pixels.

  1. If dE == 0, return (nothing to do).
  2. For each row k of pixels (k=0..3):
    1. Filter the row (8.7.2.5.7):
      • nDp and nDq are the number of pixels modified on each side of the edge.
    2. Put the modified pixels back into the reconstructed frame.

8.7.2.5.5: Filtering process for chroma block edges

Resume: filter and replace the chroma edge pixels.

  1. cQpPicOffset = pic_cb_qp_offset/pic_cr_qp_offset.
  2. Get the threshold parameter (tC):
    • qPi = ((BlockP.QPY + BlockQ.QPY + 1) >> 1) + cQpPicOffset.

      Add the chroma QP offset to the average of the QPs of the blocks.

    • QPC = Table8_9[qPi].
    • Q1 = Clip3(0, 53, QPC + 2*(bS - 1) + (tc_offset_div2 << 1)).
    • tC' = Table8_11_tC[Q1].
    • tC = tC' * (1 << (BitDepthC-8)).

      Same as the luma case.

  3. For each row k of pixels (k=0..3):
    1. Filter the row (8.7.2.5.8):
      • There is one modified pixel on each side of the edge.
    2. Put the modified pixels back into the reconstructed frame.

8.7.2.5.6: Decision process for a luma sample

Resume: return dqp < (Beta >> 2) &&                 |p3 - p0| + |q0 - q3| < (Beta >> 3) &&                 |p0 - q0| < ((5*tC + 1)>>1).

8.7.2.5.7: Filtering process for a luma sample

Resume: filter a luma edge row.

  • If dE == 2 (strong filtering):
      p0' = Clip3(p0-2*tC, p0+2*tC, (p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4) >> 3).  p1' = Clip3(p1-2*tC, p1+2*tC, (p2 + p1 + p0 + q0 + 2) >> 2).  p2' = Clip3(p2-2*tC, p2+2*tC, (2*p3 + 3*p2 + p1 + p0 + q0 + 4) >> 3).  q0' = Clip3(q0-2*tC, q0+2*tC, (p1 + 2*p0 + 2*q0 + 2*q1 + q2 + 4) >> 3).  q1' = Clip3(q1-2*tC, q1+2*tC, (p0 + q0 + q1 + q2 + 2) >> 2).  q2' = Clip3(q2-2*tC, q2+2*tC, (p0 + q0 + q1 + 3*q2 + 2*q3 + 4) >> 3).  nDp = nDq = 3.
  • Else (weak filtering):
    • Delta1 = (9*(q0 - p0) - 3*(q1 - p1) + 8) >> 4.
    • If |Delta1| >= 10*tC (busted threshold):
      • nDp = nDq = 0.
    • Else:
          Delta2 = Clip3(-tC, tC, Delta1).    p0' = ClipPix(p0 + Delta2).    q0' = ClipPix(q0 -Delta2).    nDp = nDq = 1 (one pixel filtered).
      • If dEp == 1:
              DeltaP = Clip3(-(tC>>1), tC>>1, (((p2+p0+1) >> 1) - p1 + Delta2) >> 1).      p1' = ClipPix(p1 + DeltaP).      nDp = 2 (2 pixels filtered).
      • Same thing for dEq (replace 'p' by 'q' and switch the sign of Delta2).
  • If (block P is PCM and pcm_loop_filter_disable_flag == 1) or (block P has cu_transquant_bypass_flag set):
    • nDp = 0.

      The pixels in block P are not replaced.

  • Same thing for block Q.

8.7.2.5.8: Filtering process for a chroma sample

Resume: filter a chroma edge row.

 Delta = Clip3(-tC, tC, (((q0 - p0) << 2) + p1 - q1 + 4) >> 3). p0' = ClipPix(p0 + Delta). q0' = ClipPix(q0 - Delta).

The restrictions for PCM and cu_transquant_bypass_flag apply as for the luma case.

8.7.3.1: General (Sample adaptive offset process)

Resume: if SAO is enabled, process each CTB and each image component (8.7.3.2).

8.7.3.2: Coding tree block modification process

Resume: process the pixels of the current CTB and image component.

For more information about SAO, see "Sample Adaptive Offset in the HEVC Standard".

  • Loop over each pixel of the CTB.
  • Do not filter the current pixel if SAO is disabled for the CTB or if the condition (pcm_loop_filter_disable_flag == 1 && pcm_flag == 1 || cu_transquant_bypass_flag == 1) is true for the block containing the pixel.
  • If edge mode:
    1. Get the two neighbour pixel positions according to the situation direction.
    2. Do not filter the current pixel if one of the following cases apply:
      • A neighbour is outside the frame.
      • The current pixel and a neighbour are not inside the same tile and loop_filter_across_tiles_enabled_flag == 0 for the current pixel.
      • The current pixel and a neighbour are not inside the same slice and slice_loop_filter_across_slices_enabled_flag == 0 for the pixel encoded last in tile scan order.
    3. edgeIdx is set to one of the following values:
      • 0 (< <).
      • 1 (< = || = <).
      • 2 (< > || = =).
      • 3 (> = || = >).
      • 4 (> >).

      Notice that the indices (edgeIdx) and the SAO categories are not ordered the same way. The categories are given by {1, 2, 0, 3, 4} (the first three values are reordered).

    4. NewPix = ClipPix(OldPix + SaoOffsetVal[edgeIdx]).

      Recall that SaoOffsetVal[0] == 0.

  • Else (band mode):
    1. bandShift = bitDepth - 5.
    2. bandTable[32] = {0} (initialize to zero).
    3. For k=0..3: bandTable[(sao_band_position + k) & 31] = k + 1.

      bandTable yields the index in SaoOffsetVal for each band. The band 31 is contiguous to the band 0 (rollover).

    4. NewPix = ClipPix(OldPix + SaoOffsetVal[bandTable[OldPix>>bandShift]]).

Entropy coding notes

The specification uses the following terminology:

  • context-adaptative binary arithmetic coding (CABAC): entropy coding process.
  • Most probable symbol (MPS): predicted value of a bin.
  • Least probable symbol (LPS): complement value of the MPS.
  • bin: a single bit.
  • bins: a string of bits.
  • context variable: predicted value of a bin along with its probability.
  • context table: identify the contexts associated to related syntax elements.
  • context index: index of a specific context in a context table.
  • state index: index used to lookup a probability in a precomputed table.

Each syntax element is associated to one context table. Several syntax elements may share the same context table. The context tables are independent of each other. Each bin of a syntax element is associated to one specific context in the context table. The association is represented by the context index of the bin, which is the offset of the context from the beginning of the table (e.g. the fourth context of the table). The context index of a bin depends on the position of the bin in the sequence of bins, the slice type and possibly on the values of other bins and syntax elements.

The CABAC state contains a context variable corresponding to each context in each context table. Initially, the MPS and the probability of a context variable are initialized to known values that depend on the slice type and the slice QP.

The specification keeps the contexts of different slice types separated. For example, the context table for 'part_mode' has 1 context for intra slices, 3 contexts for P slices and 3 contexts for B slices. The indices for those contexts are mapped sequentially:

        0:    I contexts 0.        1..3: P contexts 0..2.        4..6: B contexts 0..2.

The context separation by slice type is done for specification purposes. A real implementation uses only the contexts of the actual slice type.

If the syntax element 'cabac_init_flag' is true, a P slice initializes its contexts as if it was a B slice and vice-versa. An I slice is always initialized as an I slice.

Table 9-4 specifies the context index ranges corresponding to each slice type (initType). Tables 9-5 to 9-31 specify the initial MPS and probability for each context, marshalled in a single value (see clause 9.3.2.2).

Table 9-32 specifies the binarisation of each syntax element. ctxIdxOffset is the offset of the first context in each slice type, e.g. 1 for part_mode in a P slice. maxBinIdxCtx is described below.

Table 9-37 specifies the context index of each bin of a syntax element. The leftmost bin of a syntax element has bin index 0 in table 9-37. For each bin, the table contains a context index increment (ctxIdxInc). The context index of a bin is ctxIdxOffset + ctxIdxInc. For example, the second bin of part_mode has ctxIdxInc = 1, so its context index in a P slice is 1 + 1 = 2. If a bin has a bin index greater than maxBinIdxCtx, it is encoded the same way as the bin at bin index maxBinIdxCtx. 'na' means that no bin exists for the current index. 'bypass' means to encode the bin without a probability context.

9.3: CABAC parsing process for slice segment data

Resume: decode the current syntax element.

  1. Initialize the CABAC state if this is the beginning of a slice segment/tile/WPP row (9.3.2).
  2. Get the binarization data of the element (9.3.3).
  3. Decode each bin of the element (9.3.4).
  4. If the element is pcm_flag, reinitialize the BAC (9.3.2.5).
  5. Save the context variables if this is the end of a WPP row or slice segment (9.3.2.3).

9.3.2.1: General (Initialization process)

Resume: initialize the CABAC state.

  • If this is the beginning of a WPP row:
    • If the bottom-right pixel of the top-right CTB is available according to 6.4.1:
                              +--+--+                        |  |  |                        |  | X| <= bottom-right pixel of the second CTB                        +--+--+    of the previous CTB row                        |  |                        |  |    <= first CTB of the current CTB row                        +--+
      • Import the context variables of the second CTB of the previous CTB row after encoding that CTB.
    • Else:
      • Initialize the context variables according to 9.3.2.2.
  • Else if this is the beginning of a dependent slice segment:
    • Import the context variables of the previous slice segment.
  • Else:
    • Initialize the context variables according to 9.3.2.2.
  • Initialize the BAC (9.3.2.5).

9.3.2.2: Initialization process for context variables

Resume: parse the CABAC initialization value in a table.

  • pStateIdx: probability state index in a context.
  • valMPS: 1 if the predicted bin value in a context is 1.
  • initValue: the value being parsed in the initialization table.
  • The context probability is scaled with the slice QP:
    • slopeIdx = initValue >> 4.
    • offsetIdx = initValue & 15.
    • m = slopeIdx*5 - 45.
    • n = (offsetIdx << 3) - 16.
    • preCtxState = Clip3(1, 126, ((m * Clip3(0, 51, SliceQPY)) >> 4) + n).
    • valMPS = (preCtxState <= 63) ? 0 : 1.
    • pStateIdx = valMPS ? (preCtxState - 64) : (63 - preCtxState).
  • initType is derived as follow:
    • If slice_type == I: initType = 0
    • Else if slice_type == P: initType = cabac_init_flag ? 2 : 1.
    • Else: initType = cabac_init_flag ? 1 : 2.

9.3.2.3: Storage process for context variables

Resume: save every context variable in a snapshot.

9.3.2.4: Synchronization process for context variables

Resume: import every context variable from a snapshot.

9.3.2.5: Initialization process for the arithmetic decoding engine

Resume: initialize the binary arithmetic (de)coder (BAC).

The range and offset variables are the usual interval division parameters of a BAC.

9.3.3: Binarization process

Resume: lookup table 9-32.

  • Process the bins:
    • Bypass-coded (9.3.3.2 through 9.3.3.4).

      Note that all these subclauses are clear enough in the spec.

    • Context-coded (9.3.3.5 through 9.3.3.9).

9.3.3.9: Binarization process for coeff_abs_level_remaining

Resume: binarise the absolute offset of the n-th coefficient.

  1. The output is a top prefix and possibly a top suffix.
  2. If this is the last coefficient (n == 15):
    • cLastAbsLevel = 0.

      Track the previous coefficient absolute value, if any.

    • cLastRiceParam = 0.

      Track the Rice parameter (see below).

  3. Else:
    • cLastAbsLevel = absolute value of coefficient 'n + 1'.
    • cLastRiceParam = cRiceParam of coefficient 'n + 1'.
  4. cAbsLevel = baseLevel + coeff_abs_level_remaining[n].
  5. Threshold = 3 * (1 << cLastRiceParam).
  6. cRiceParam = Min(cLastRiceParam + (cLastAbsLevel > Threshold), 4).

    If the previous coefficient busts the threshold, increment the Rice parameter up to 4.

  7. cMax = 4 << cRiceParam.
  8. Binarize the top prefix:
    • prefxVal = Min(cMax,coeff_abs_level_remaining).
    • Apply Truncated Rice binarization (9.3.3.2) on prefixVal with cMax and cRiceParam as inputs.
  9. Binarize the top suffix:
    • If top prefix == "1111":
      • suffixVal = coeff_abs_level_remaining - cMax.
      • Apply ExpGolomb binarization (9.3.3.3) on suffixVal with k = cRiceParam + 1.

9.3.4: Decoding process flow

Resume: decode the bins of a syntax element.

Loop while the current bin string does not match one of the bin strings

  1. Derive the appropriate context (9.3.4.2) for the current bin.
  2. Invoke the BAC process (9.3.4.3).
  3. Increment the binIdx.

9.3.4.2: Derivation process for ctxTable, ctxIdx and bypassFlag

Resume: lookup table 9-37.

9.3.4.3: Arithmetic decoding process

Resume: perform standard BAC processing.

The specification is clear enough if one understands BAC. A short review of BAC and CABAC are given here. Readers uncomfortable with BAC are invited to scour the internet for additional information on the subject. The second section introduces the differences between BAC and CABAC. The third and fourth sections are devoted to CABAC in the decoder and the encoder respectively.

BAC uses finite-precision real values in the range [0,1[ to represent symbols in binary form. Using multiple iterations, the initial range is recursively subdivided until all symbols in the message have been coded. To reduce the gap between the terminology used in H.264, a symbol is a bin (a bit), and a message is a bin string. Once the recursive subdivision of the initial interval is done, the coding process searches for the shortest binary representation in that range.

An example is best suited to illustrate the process. As H.264 deals with binary strings, the alphabet consists of the symbols {0,1}. For the moment, consider that the probabilities of each symbols are fixed. CABAC relaxes this assumption. Let us assume that p(0) = 0.75, and p(1) = 0.25. This means that the symbol 0 occupies 75% of the range, no matter its width.

The width of the range (R) is given by R_max - R_min. Initially, R_min (the lower bound) is 0, and R_max (the upper bound) is 1 (R = 1). Coding the message 0010 would look like this:

R_max   1          3/4         9/16        9/16       -+-      .  -+-      .  -+-........ -+-        |     .     |     .     |           |         |   .       |   .     1 |           |         | .         | .         |           |         +           +           +           +   === 135/256        |           |           | .         |    ^        |           |           |  .        |    |        |           |           |   .       |    |        |           |           |    .      |  final     0  |         0 |           |     .   0 |          |           |           |      .    |  range        |           |           |       .   |    |        |           |           |        .  |    |        |           |           |         . |    v       -+-........ -+-........ -+-         -+-  === 27/64R_min   0           0           0         27/64

The first value (0), forces the use of the range [0,0.75[. That range is then subdivided according to the symbol probabilities. The second value (0), forces the use of the range [0,0.5625[. Again, that range is subdivided using the symbol probabilities. The third symbol (1) uses the range [0.421875,0.5625[. The last symbol (0) selects the final range [0.421875,0.52734375[. Finding that range is the first step to BAC. The second step is to find the shortest binary value that belongs to the range we have just found. There are numerous algorithms to do so. Here, we use a simple iterative process that stops when the value belongs to the range.

lb = 0.421875;    // lower boundub = 0.52734375;  // upper boundbin = 0;          // binary stringfrac = 0.5;       // first power of two (negative exponent) while (bin < lb){    if (bin + frac < ub)    {        bin += frac;    }    frac /= 2; // next (smaller) power of 2}

The above algorithm outputs '1' (0.421875 < 0.5 < 0.52734375). This makes it possible to code a 4-bit binary string with a single bit.

The example above covers the basic idea of BAC. However, it is a little more complex in practice, as the end of the binary string needs to be signalled. Without such signalling, it wouldn't be possible to recover the encoded message, as the decoder wouldn't know when to stop. Using the same probabilities as above, the coded message 1 can be decoded as such:

R_max   1          3/4         9/16        9/16      135/256    256/511       -+-      .  -+-      .  -+-........ -+-      .  -+-      . -+-        |     .     |     .     |           |     .     |     .    |        |   .       |   .   0.5 | 1         |   .       |   .      |        | .         | .         |           | .         | .        |    3/4 +      9/16 +     27/64 +   135/256 +   256/511 +          +        |           |           | .         |           |          |        |           |           |  .        |           |          |        |           |           |   .       |           |          |        |           |           |    .      |           |          |    0.5 | 0     0.5 | 0         |     . 0.5 | 0     0.5 | 0        |        |           |           |      .    |           |          |        |           |           |       .   |           |          |        |           |           |        .  |           |          |        |           |           |         . |           |          |       -+-........ -+-........ -+-         -+-........ -+-....... -+-R_min   0           0           0         27/64       27/64

The input bit string '1' actually represent 0.5. The initial range is first subdivided with p(0)=0.75, and p(1)=0.25. The range containing the value 0.5 is selected, and the bit associated to that range is the output. The selected range is then subdivided following the exact process used during the encoding process. At each step, a bit is returned, and the current range is again subdivided. This leads to the following decisions:

0.5 is in the range [0,0.75[, output 0, subdivide the range to [0,0.75].0.5 is in the range [0,0.5625[, output 0, subdivide the range [0,0.5625].0.5 is in the range [0.421875,0.5265[, output 1, subdivide the range.0.5 is in the range [0.421875,0.52734375[, output 0, subdivide the range.

Oh wait, shouldn't we stop? The original message is 0010, and that's what we have. To avoid this situation, CABAC, in H.264, works with states and offsets to include signalling into the design. More on that later.

For efficiency, the coding process can be combined with range selection. As the range is recursively subdivided, some of the bits can be eagerly outputted, as they become fixed. To do so, the encoder and the decoder simply need to keep track of the lower bound, the upper bound, and the range's width. Furthermore, numerical precision can be ensured with integers large enough to avoid rounding errors during subdivision. Outputting the fixed bits actually makes it possible to keep good precision. In H.264, the range is kept between 256, and 512 (see Context-Based Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard by Deltev Marpe, Heiko Schwarz and Thomas Wiegand).

Additionally, we need to keep either the probability of the most probable symbol (MPS), or of the least probable symbol (LPS), so that the current interval may be subdivided into a low range, and a high range. As a convention, the low range is always associated to the MPS, rather than assigning it to 0 or 1. If fixed probabilities are used, then once assigned, the intervals will always refer to the same symbols.

When choosing the MPS, the lower bound does not change, and the upper bound is set to "lower bound + range * p(MPS)". On the other hand, when the LSP is selected, the upper bound does not change, and the lower bound is set to "lower bound + range * P(MPS)". The process is illustrated below. In both cases, the same multiplication applies. The range is then set to "upper bound - lower bound".

                   upper bound -+- ...........> -+- upper bound                                |                |                                |                +                                |                |                                |                |  upper bound -+- <........... -+- ...........> -+- lower bound               |                | \               |                |  \               +                |   \               |                |    lower bound + Prob(MPS)               |                |               |                |               |                |               |                |  lower bound -+- <........... -+- lower bound

After each interval subdivision, the most significant bit (leftmost bit) of the lower and upper bounds are checked. This is done to avoid overflow, and to make sure that the range has sufficient precision for the next subdivision. If all is well, we should observe that the leftmost bit of the lower bound is 0, and that the leftmost bit of the upper bound is 1 (again, this has to do with the fact that the range in H.264 is between 256 and 512). When the leftmost bits of both boundaries match, it is to be outputted according to the following:

When encoding the next bit:

  1. We adjust Low and High according to the probability and the bit value.
  2. There are three cases:
    • Case 0: the leftmost bit of Low and High is 0 and 0:
      • Low = 0l...
      • High = 0h...
      • We output bit 0, plus the outstanding bits.
      • We shift Low and High left by 1 and reprocess until l != h.
    • Case 1: the leftmost bit of Low and High is 1 and 1:
      • Same as above with output bit 1.
    • Case 2: the leftmost bit of Low and High is 0 and 1:
      • Low = 0l...
      • High = 1h...
      • We can't output a bit yet.
      • Case A: h == 1 || l == 0:
        • Low and High are far apart. No problemo.
      • Case B: h == 0 && l == 1:
        • Low and High are getting dangerously close (bit buffer overflow).
        • Add one outstanding bit.
        • Remove l and h.
        • Reprocess case 2.
  3. Example to understand the outstanding bits. Note, 3 outstanding bits are used in the example, but the actual number depends on the situation.
    • Actual situation (low and high are very close, overflow).
      • Low = 0[111]0 <= the outstanding bits are within the brackets.
      • High = 1[000]0
    • Remove the outstanding bits:
      • Low = 00
      • High = 10
    • Now suppose high is lowered by encoding the next symbol:
      • Low = 00
      • High = 01
    • Put back the outstanding bits:
      • Low = 0[111]0
      • High = 0[111]1
    • Output:
      • 0[111].

A conventional BAC selects a sequence of bits which identifies a particular location in an interval, as shown in the previous section. The CABAC algorithm follows the same principle. In addition to the range boundaries, and the range width (the difference between the upper and lower boundaries), both the encoder and the decoder need to consider the effect of choosing the MPS or the LPS.

A compromise between fast adaptation and accurate probability estimates was made during the design of the H.264 CABAC engine. To avoid asking the codec to do extensive bookkeeping to keep track of the occurences of 0s and 1s, 64 states were created. These states describe the probability of the LPS, which ranges between 0.01875, and 0.5 inclusively. The idea here is that the selection of a symbol (0 or 1) makes that symbol more probable, and its counterpart less probable. For instance, if both 0 and 1 are equiprobable (e.g. p(0) = p(1)), and a 0 is selected, the new probabilities should no longer be equiprobable (e.g. p(0) > p(1)).

Because the MPS may toggle between 0 and 1, the codec has to keep track of the state, and the MPS. The MPS can only change when in the equiprobable state. Moreover, predefined tables for the LPS probabilies, as well as state transitions, are listed in the H.264 standard.

CABAC in an H.264 decoder is very simple. The MPS subdivides the interval in a low part and a high part. Using the width of the range, an offset indicating where the MPS ends, and where the LPS begins is retrieved from a precomputed table. If the offset is lower than the MPS, the decoder selects the low part of the interval, otherwise it selects the high part.

When the interval becomes too small (e.g. the interval is less than 256), the decoder doubles both the offset and the range to add precision (the ratio is preserved). The value of 256 is coupled to the quantization function used to compute the offset "(R >> 6) & 3". If R falls below 256, then the operation can no longer yield indices 2 and 3. The decoder also reads a bit and adds it to the offset to steer it toward the final interval location.

  • Offset: determine which portion of the interval is selected.
  • Range: length of the interval.

The MPS is selected when the offset is lower than the MPS value at the current range. This corresponds to the selection of the low interval.

When an MPS is selected (low interval):

  • Set the range to the value of the MPS.

When an LPS is selected (high interval):

  • Decrease the offset by the value of the MPS.
  • Set the range to the value of the LPS.

When the range falls below 256, the precision of the interval is increased until the range is higher or equal to 256:

  • Double the offset and the range:
    • Notice that this does not change the ratio between the offset and the range (the offset is not allowed to overflow).
  • Read a single bit.
  • Add the bit to the offset.

CABAC in an encoder is more complex. Like the decoder, the MPS subdivides the interval in a low part and a high part. If the MPS is selected, the encoder adjusts the high value, otherwise it adjusts the low value.

The initial size of the interval is logically 511.999 periodic. The range is expected to be greater than 256, and below 512 at all times. For efficiency, the CABAC encoder rounds that value to 510. This has to to with the fact that BAC (and CABAC) outputs a real decimal value in the range [0.0, 1.0[. Notice that the range excludes 1, since the binary string with all '1' only converges toward 1. Subtracting 1 from 511, we find 510, the value used by the H.264 implementation. This error has negligible effects. For unifying the code, the encoder operates one iteration "late". The bit outputted in the current iteration corresponds to the interval selection of the previous iteration, whose size is ~1024. For the very first iteration, the low part of the previous iteration is always selected, so the first bit outputted by the encoder is discarded.

When the current interval becomes too small (i.e. less than 256), the encoder has to adjust the low and the high values to add precision. If the high value is less than half of the length of the previous interval, the encoder outputs 0 to select the low part. Conversely, if the low value is more than half of the length of the previous interval, the encoder outputs 1 to select the high part.

Thus, the bit to output corresponds to the leftmost bit (value 512) of the register containing the low value. For simplicity we number the bits from left to right, so that the leftmost bit is bit 0. Once the value of bit 0 is known (i.e. it cannot change), it is removed from the low register to prevent the register from growing indefinitely. If bit 0 is 0, there is nothing to remove, otherwise 512 must be subtracted. Since the encoder infers the high value from the low value and the range, clearing bit 0 in the low register clears the corresponding bit in the nominal high register.

If the low value is below the half and the high value is above the half of the previous interval, no decision can be taken yet. Bits 0/1 of the low value are 0/1 and bits 0/1 of the high value are 1/0 since the low and high values are close to the half. To prevent the low and high values from growing indefinitely, the encoder removes bit 1 from the low value, whose value is known to be 1. When the value of bit 0 is known in a subsequent normalization operation, the following logic applies. Bit 0 (leftmost) of the MPS is always 0. Thus, if bit 0 is 1, then it necessarily changed value because a carry operation took place. If the encoder had not removed bit 1 earlier, its value would have carried over from 1 to 0 and the bit 0 would have become 1. So, the bits to output are 1 and 0. Conversely, if bit 0 is 0, then no carry operation took place and the bits to output are 0 and 1.

  • Low: reverse of the offset.
  • Range: length of the interval.

When an MPS is selected (low interval):

  • Set the range to the value of the MPS.

When an LPS is selected (high interval):

  • Increase the low by the value of the MPS.
  • Set the range to the value of the LPS.

When the range falls below 256, the precision of the interval is increased until the range is higher or equal to 256:

  • If Low < 256:
    • Write bit 0 + outstanding.
  • Else if low >= 512:
    • Write bit 1 + outstanding.
    • Low -= 512.
  • Else:
    • Low -= 256.
    • Outstanding++.
  • Double the offset and the range.

Bit writing logic (for bit B):

  • If the first bit is being written:
    • Do not write it.
  • Else:
    • Write B.
  • Write the outstanding bits, whose value is !B.
  • Outstanding = 0.

A.3: Profiles

Resume: constraints the encoder must obey for a profile:

  • RawCtuBits (Size of a YUV CTB in bits)
        CtbSizeY*CtbSizeY*BitDepthY + 2*(CtbWidthC*CtbHeightC)*BitDepthC.
  • Main profile:
    • 4:2:0 with 8-bit pixels.
    • No 8x8 CTBs.
    • WPP and tiles are mutually exclusive.
    • The minimum tile size is 256x64 unless there is only one tile.
    • Each CTB must take less than 5/3*RawCtuBits bits in the bitstream.
  • Main 10 profile:
    • Same as main profile, but 8-bit, 9-bit or 10-bit pixels are allowed.
  • Main Still Picture profile:
    • Same as main profile, but only one frame may be encoded.

A.4.1: General tier and level limits

Resume: constraints all profiles must obey.

  • Table A-1:
    • MaxLumaPs: maximum size of a frame in luma pixels.
    • MaxCPB: size of the coded picture buffer (CPB) in kbits.
    • MaxSliceSegmentsPerPicture: maximum number of slice segments per frame.
    • MaxTileRows/MaxTileCols: maximum number of tile rows/columns.
  • Maximum frame size in luma pixels: PicSizeInSamplesY <= MaxLumaPs.
  • Minimum/maximum aspect ratio:
    • pic_width_in_luma_samples <= Sqrt(MaxLumaPs*8).
    • pic_height_in_luma_samples <= Sqrt(MaxLumaPs*8).
  • sps_max_dec_pic_buffering <= MaxDpbSize (see the spec for the computation).
  • If level >= 5: CTBs must be 64x64 or 32x32.
  • Up to 8 frames may be used for reference.
  • num_tile_rows/columns_minus1 < MaxTileRows/Cols.

A.4.2: Profile-specific level limits for the Main and Main 10 profiles

Resume: additional constraints the main profiles must obey.

  • fR = 1/300: assumed minimum frame duration (i.e maximum 300 frames/sec).
  • NumBytesInNALunit: size of an encoded frame in the bitstream.
  • Table A-2:
    • MaxLumaSR: maximum number of luma pixels processed per second (SR = sample rate).
    • MaxBr: maximum bitrate in kbits/sec.
    • MinCR: minimum frame compression ratio, i.e. the ratio between the YUV frame size and the encoded frame size in the bitstream.

The following assumes a constant frame rate (CFR) and a corresponding frame duration (FD=1/CFR).

  • Maximum number of luma pixels processed per second:
    • FD >= PicSizeInSamplesY/MaxLumaSR.
    • The constraint applies both to the frame linger time in the CPB and the DPB. See the specification.
  • Maximum number of slice segments in a frame:
    • FrameSegments <= Min(MaxSliceSegmentsPerPicture*MaxLumaSR/MaxLumaPS*FD, MaxSliceSegmentsPerPicture).
  • Maximum size of an encoded frame in the bitstream:
    • NumBytesInNALunit <= 1.5*MaxLumaSR*FD/MinCR.
  • Maximum number of tiles in a frame:
    • FrameTiles <= Min(MaxTileCols*MaxTileRows*120*FD, MaxTileCols*MaxTileRows).

Hypothetical Reference Decoder

Resume: The hypothetical reference decoder (HRD) is basically an evolved video buffering verifier. It is used to check to conformance of the decoder to a specific profile and level to make sure that the video can be correctly displayed (i.e. smooth playback without stalling).

The document "A Generalized Hypothetical Reference Decoder for H.264/AVC" is a good place to start to better understand the HRD. The paper"An Improved Hypothetical Reference Decoder for HEVC" presents additional information to better understand the HRD with the new HEVC coding tools.

C.1 General

Resume: Description of stream types and how tests are setup.

There are two types of streams:

  • Type I: only VCL NAL units and filler data units (non-VCL).
  • Type II: both VCL and non-VCL NAL units.

The test setup are written to work with sublayers. In the current version of the spec (Draft 10), nuh_layer_id is always 0 (no sublayers). For the moment, let us assume that this will always be true. Going through the 8 points in the list, we have the following simplified steps:

  1. OpLayerIdList = [0].
    OpTid = 0.
  2. TargetDecLayerIdList = OpLayerIdList = [0].
    HighestTid = OpTid = 0.
    The output of the sub-bitstream extraction process is identical to the input.
    BitstreamToDecode is the initial one.
  3. TargetDecLayerIdList = [0], the only valid nuh_layer_id.
    The HRD parameters sent in the SPS are the ones to use.
    • If the stream is Type I, use the VCL HRD parameter set.
    • If the stream is Type II, use either the VCL or the NAL HRD parameter set, according to what is signaled.
  4. Select the first access unit (VCL NAL unit containing the first CTB of a picture.)
  5. Select the buffering period (the amount of time to wait before the encoder starts transmitting coded data to the decoder, and the amount of time the decoder has to wait before starting to decode the received information).
    Select the SEI timing info (if more than one slice per picture is used, also select the "sub pic" timing info).
  6. Choose the scheduling info (i.e. the transmission plan to see if playback can be ensured under the currently selected conditions of buffering, timing and transmission rate).
  7. There's a typo in point 7, as both conditions check for NalHrdModeFlag == 1...
  8. Select the operation mode. Either the HRD assumes that each VCL NAL carries an entire picture (simple scenario as operations on CPB and DPB are all linked to one NAL unit), or the HRD assumes that slices are used (more complicated for the operations on both CPB an DPB).

Next, derive the number of conformance tests required. There can be numerous tests, as we can determine if the stream conforms to different transmission parameters (e.g. bigger buffer and slower transmission rate VS. smaller buffer and faster transmission rate). Note that the bitrate of the coded video and the transmission rate are two separate concepts. For instance, if the bitrate is higher than the transmission rate, we know that we will have to delay the start of the playback if the video is streamed. On the other hand, if the bitrate is less than the transmission rate, then playback can start pretty much as soon as the bits of the first packet are received.

Finally, determine if the bytes associated to the non-VCL NAL units count in the transmission statistics or not.

Then process the stream, checking that the CPB does not overflow or underflow, and that the state of the DPB is correctly set for each picture (i.e. check that a picture does not reference a picture that was previously marked as unsued for reference). Here, the spec also uses "decoding unit m", which refers to a coded slice when multiple slices are used per pictures. Picture the following, where each picture is coded using three slices:

         Association of decoding units to access units+--------------+-+-+-+-+-+-+-+-+-+-+--+--+--+--+--+---+-+---+---+---+|  Access unit |0|0|0|1|1|1|2|2|2|3| 3| 3| 4| 4| 4|...|n| n | n |...|+--------------+-+-+-+-+-+-+-+-+-+-+--+--+--+--+--+---+-+---+---+---+|Decoding unit |0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|...|m|m+1|m+2|...|+--------------+-+-+-+-+-+-+-+-+-+-+--+--+--+--+--+---+-+---+---+---+

The accronym HSS stands for Hypothetical Stream Scheduler, i.e. the transmission plan.

C.2 Operation of Coded Picture Buffer

Resume: Track the time at which NAL units enter and leave the CPB. When a CPB overflow occurs, change the schedule used for transmission by the HSS.

C.2.2 Timing of decoding unit arrival

Resume: Determine the arrival time of the first and last bit of an access unit (or a decoding unit).

  1. Determine the initial removal delay and the delay offset.
    • If the very first access unit is BLA_W_RADL or BLA_N_LP and the SEI message specifies an alternate initial removal delay.
    • Or if the very first access unit is BLA_W_LP or CRA and the SEI message specifies an alternate initial removal delay.
    • Or sub pictures are used an this is not the first decoding unit of the current access unit.
      • InitCbpRemovalDelay[SchedSelIdx] is set to the value specified in the SEI message.
      • InitCbpRemovalDelayOffset[SchedSelIdx] is set to the value specified in the SEI message.
    • Else
      • InitCbpRemovalDelay[SchedSelIdx] is set to the value specified in either the NAL parameter set (NalHrdModeFlag = 1), or the VCL parameter set (NalHrdModeFlag = 0).
      • InitCbpRemovalDelayOffset[SchedSelIdx] is set to the value specified in either the NAL parameter set (NalHrdModeFlag = 1), or the VCL parameter set (NalHrdModeFlag = 0).
  2. Derive the initial arrival time of a decoding unit.
    • If we assume constant bit rate:
      • The initial arrival time is set to the final arrival time of the previous access/decoding unit.
    • Else (we assume variable bit rate):
      • The initial arrival time is set to the latest of either the final time of the previous access/decoding unit or the earlier arrival time of the access/decoding unit.
  3. Derive the final arrival time of a decoding unit.
    • The final arrival time is set to the initial arrival time plus the time it takes to receive the packet according to the bitrate (sizeInBits / BitRate).

When the HSS changes the transmission parameters (a different selected schedule index), the spec assumes the following:

  • The new bit rate comes into effect at the init CPB arrival of the current access unit. That is, all the decoding units of an access unit are sent with the same bit rate.
  • If the CPB size increases, the new value takes effect at the arrival time of the current access unit.
  • If the CPB size decreases, the new value takes effect at the removal time of the current access unit.

C.2.3 Timing of decoding unit removal and decoding of decoding unit

Resume: Determine when a unit may be removed from the CPB for decoding purposes.

  1. Init the removal delay and offset the same way as the previous section.
    • If the alternate values specified in the SEI message are used, init CbpDelayOffset and DpbDelayOffset to the values specified in the SEI message.
    • Else, both CpbDelayOffset and DpbDelayOffset are set to 0.
  2. Determine the nominal removal time.
    • If we are dealing with the very first access unit, set the nominal removal time to InitCpbRemovalDelay / 90000.
    • Else, we are dealing with an access unit that does not init the HRD.
      • See the spec for the correct computation. Note that ClockTick is specified in a VUI message.
  3. Determine the CPB removal time.
    • See the spec for the correct computation.
  4. Remove the current access unit from the CPB:
    • When dealing with access units, instantaneous decoding is possible at the CPB removal time.
    • When dealing with decoding units, instantaneous decoding is possible at the CPB removal time. When the last decoding unit of the current access unit is done, the following occurs:
      • The current picture is considered decoded.
      • The final CPB arrival time, he nominal removal time, and the CPB removal time are all set with the values of the last decoding unit associated to the current access unit.

C.3 Operation of the decoded picture buffer

Resume: Parse the slice headers and update the DPB accordingly. Match the picture display to the removal time in the CPB and the display delay.

C.3.2 Removal of pictures from the DPB

Resume: Remove a picture from the DPB when it has to be displayed and update the DPB's status.

  1. Parse the RPS info from the slice header.
  2. Derive the value of NoOutputOfPriorPics.
    • Either the current picture is a CRA and NoOutputOfPriorPics is set to 1.
    • Or, the previously activated SPS changes for the one signales in the current slice header so NoOutputOfPriorPics is set to 1.
    • Otherwise, NoOutputOfPriorPics is set to 0.
  3. If NoOutputOfPriorPics is set to 1, flush the DPB, and set DPB fullness to 0.
  4. Track all pictures marked as "unused for reference".
    • Remove the marked pictures from the DPB if they are not to be outputted or if they are ready to be displayed according to their removal time.
    • Decrement the DPB fullness accordingly.

C.3.3 Picture output

Resume: Display a picture when its presentation time is reached. This only applies to picture marked for display (i.e. PicOutputFlag is set to 1).

  1. Set the output time using the CPB removal time of the access unit associated to the current picture and the DPB display delay.
  2. Determine whether to output the picture or store it in the DPB for later use:
    • Output the picture if DpbOutputTime matches AuCpbRemovalTime.
      • Crop the picture according to the conformance window specified in the active SPS.
    • Store the picture in the DPB if PicOutputFlag is set to 0 (see next section).
    • Store the picture in the DPB is DpbOutputTime is greater than AuCpbRemovalTime.

C.3.4 Current decoded picture marking and storage

Resume: Picture that will not be outputted need to be stored in the DPB until they are marked as "unused for reference" in the RPS. Each of these pictures take one spot in the DPB.

C.4 Bitstream conformance

Resume: Make sure that the tools used in the bitstream match those signaled in the parameter sets. Make sure the CPB never overflows nor underflows. Make sure pictures marked as unsued for reference are never used as reference afterwards.

  • The first picture shall be either an IDR, a CRA or a BLA.
  1. deltaTime90k = 90000 * (AuNominalRemovalTime[n] - AuFinalArrivalTime[n-1])
    • If the bit rate is variable, then:
      • The initial CPB removal delay happens before or at the same time as ceil(deltaTime90k).
    • If the bit rate is constant, then:
      • The initial CPB removal delay happens in between floor(deltaTime90k) and ceil(deltaTime90k).
  2. A CPB overflow (too many undecoded units for the decoder's buffer) will never happen.
  3. A CPB underflow (the decoder is ready to process but its buffer is empty) will never happen. If a unit (access or decoding) is being received (i.e. we know the init arrival time, but not the final arrival time), that counts as an underflow.
  4. When dealing with a low delay transmission scenario, the decoder can start decoding a unit while it is being writen in the CPB. That is, the nominal removal time can preceed the final arrival time.
  5. Starting with the second picture, the nominal and CPB removal times are subject to the profile/level constraints specified in sections A.4.1 and A.4.2.
  6. After removing pictures from the DPB (according to C.3.2), the number of pictures still in the DPB will be less or equal to the number specified in the active SPS (sps_max_dec_pic_buffering_minus1).
  7. All pictures required for inter prediction refered to in the RPS shall be present in the DPB.
  8. The difference between the highest and lowest PicOrderCnt values shall always be less than MaxPicOrderCntLsb/2.
  9. The rate at which the pictures are displayed shall respect the profile/level constraints imposed in A.4.1. This is expressed through the difference of the DpbOutputInterval values of two consecutive pictures with PicOutputFlag set to 1.
  10. When dealing with decoding units in a low delay transmission scenario, the total removal delay is set to the sum of the CPB delay increments of all the decoding units associated to the current access unit. The sum multiplied by ClockSubTick (a value set by the VUI parameters) shall be equal to the difference between the time when the access unit is ready for CPB removal and the CPB removal time of the first decoding unit.

C.5 Decoder conformance

Resume: Make sure that a decoder claiming conformance to a target triplet (profile,level,tier) is truthful. Build conforming streams and give them to the decoder.

C.5.1 General

Resume: What a decoder needs to check conformance.

  • All VPSs, SPSs, and PPSs referred to in the stream shall be provided to the decoder.
  • Sufficient buffering is assumed to hold the stream according to the timing information in the SEI messages (the timing can be specified by external means).
  • Syntax elements and NAL units that use reserved values shall be ignored by a decoder checking for conformance.
  • A decoder can check for timing conformance or output order conformance.
    • Output order conformance checks that the outputted samples match what the spec would do. In other words, be bit exact with the reference software.
    • Timing conformance checks that the delivery of the stream will work under the profile/level constraints imposed in Annex A to make sure the maximum supported bit rate and maximum buffer size are respected. (Note: Timing constraints are different from what section C.2 checks.)

C.5.2.2 Output and removal of pictures from the DPB

Resume: After parsing the slice header, but before processing the CTBs, remove pictures marked as "unused for reference" from the DPB.

  • Process the RPS information in the slice header.
  • If the current picture is IRAP and NoRaslOutputFlag is set to 1:
    1. Derive NoOutputOfPriorPicsFlag (see C.3.2).
    2. If NoOutputOfPriorPicsFlags is set to 1:
      • Flush the entire DPB without outputting the pictures.
    3. Else:
      • Pictures marked as "unused for reference" or "not needed for output" are flush from the DPB without being outputted.
      • The remaining pictures are "bumped" (see C.5.2.4) until the DPB is empty.
  • Else:
    • Flush all pictures marked as "unused for reference" and "not needed for output" from the DPB without outputting them.
    • Decrease the DPB fullness accordingly.
    • If the number of pictures marked as "needed for output" exceeds the SPS threshold (sps_max_num_reorder_pics) or the there is at least one picture in the DPB marked as "needed for output" which happens to have a picture latency greater or equal to the max latency defined in the active SPS or the number of pictures stored in the DPB exceeds the limit (sps_max_dec_pic_buffering_minus1):
      • "Bump" pictures until the above contion is no longer true.

C.5.2.3 Picture decoding, marking, additional bumping, and storage

Resume: Update the DPB when the last decoding unit of the current access unit is removed from the CPB.

  • Increment the picture latency by 1 for each of the pictures in the DPB marked for "needed for output".
  • The current picture is considered decoded when the last decoding unit is decoded.
  • Store the current picture in an empty spot in the DPB.
    • If the picture is to be outputted (PicOutputFlag set to 1), mark it as "needed for output" and set the latency to 0.
    • If the picture is not to be outputted (PicOutputFlag set to 0), mark is as "not needed for output" and don't touch the latency.
  • Mark the current picture as "used for short term reference".
  • If the number of pictures marked as "needed for output" exceed the threshold defined in the SPS (sps_max_num_reorder_pics) or there is at least one picture in the DPB marked as "needed for output" which happens to have a picture latency greater of equal to the max latency defined in the active SPS:
    • "Bump" pictures (see next section) until the above condition is no longer true.

C.5.2.4 "Bumping" process

Resume: Crop, output, and then release a picture from the DPB to make room for another picture.

  1. Find the picture with the smallest picture order count value that is marked for output (PicOutputFlag set to 1).
  2. Crop the picture according to the conformance window settings in the active SPS, output the picture, then mark it as "not needed for output".
  3. Empty the DPB spot of any picture marked as "unused for reference".

HRD variables

  • low_delay_hrd_flag:
    • True if big frames are allowed to arrive late.
    • Assumed to be false.
  • cpb_size_value_minus1: size of the CPB in bits.
  • MaxDpbSize (see specification, this is in terms of frames).
  • BitRate:
    • The current bit rate in bits/sec.
    • Assuming constant bit rate (CBR).
  • tc/Tc: clock tick. Only used for numerical precision.
  • b(m): size of the frame in the bitstream in bits.
  • initial_cpb_removal_delay_offset:
    • Time the encoder buffers the stream before sending it.
    • Assuming InitCpbRemovalDelay = initial_cpb_removal_delay.
  • initial_cpb_removal_delay:
    • Time the decoder buffers the stream before decoding it.
    • Assuming InitCpbRemovalDelay = initial_cpb_removal_delay.
  • cpb_removal_delay:
    • From the specification:

      cpb_removal_delay specifies how many clock ticks to wait after removal from the CPB of the access unit associated with the most recent buffering period SEI message in a preceding access unit before removing from the buffer the access unit data associated with the picture timing SEI message.

    • Assuming CpbRemovalDelay(m) = cpb_removal_delay.
  • dpb_output_delay:
    • Used to compute when a frame is displayed. Needed for B frame reordering.
    • From the specification:

      dpb_output_delay is used to compute the DPB output time of the picture. It specifies how many clock ticks to wait after removal of the last [picture] from the CPB before the decoded picture is output from the DPB.

  • tai(m):
    • tai = initial arrival time.
    • Time at which the first bit of the frame enters the CPB.
    • tai(0) = 0.
    • tai(m) = taf(m-1).
  • taf(m):
    • taf = final arrival time.
    • Time at which the last bit of the frame enters the CPB.
    • taf(m) = tai(m) + b(m)/BitRate.
  • tr,n(m):
    • tr,n = nominal removal time.
    • Time at which a frame should be removed from the CPB in general. In low delay mode, a big frame is allowed to arrive late. Assuming this doesn't happen.
    • tr,n(0) = InitCpbRemovalDelay.
    • tr,n(m) = tr,n(m-1) + Tc*CpbRemovalDelay(m).
  • tr(m):
    • tr = removal time.
    • Time at which a frame is actually removed from the CPB.
    • tr(m) = tr,n(m) in general.
  • to,dpb(n):
    • to,dpb = DPB output time.
    • to,dpb(n) = tr(n) + tc*dpb_output_delay(n).
    • Time at which a frame exits the DPB.
    • A frame exits the DPB and soon as it is displayed and not used for reference.
    • DELTAo,dpb(n) = to,dpb(n+1) - to,dpb(n).

0 0
原创粉丝点击