RDO of H264

来源：互联网发布：mac numbers使用视频编辑：程序博客网时间：2024/06/03 06:09

http://www.pixeltools.com/rate_control_paper.html

H.264 Rate-Distortion Optimization and Global Rate Control

H.264 provides 7 modes for inter (temporal) prediction, 9 modes for intra (spatial) prediction of 4x4 blocks, 4 modes for intra prediction of 16 x 16 macroblocks, and one skip mode. Each 16 x 16 macroblock can be broken down in numerous ways. Thus, mode selection for each macroblock is a critical and time-consuming step that enables much of the dramatic bitrate reduction.

Selection of the optimal mode is done by an algorithm called rate-distortion optimization (RDO) [8], which essentially involves 1) an exhaustive pre-calculation of all feasible modes to determine the bits and distortion of each; 2) evaluation of a metric that considers both bitrate and distortion; and 3) selection of the mode that minimizes the metric.

QP is input to the RDO process, which does not regulate QP or modify the quality of the residual coefficients. RDO is complementary to rate control; these two aspects of the problem are decoupled because a fully coupled optimization would require a more expensive iterative solution.

The interplay with RDO, described in [4] as a "chicken and egg" dilemma, influences implementation of a rate control algorithm. The MAD is needed by the rate control algorithm, but it is available only after the RDO has used a QP value to generate it. Thus, the rate control algorithm must use an estimate for MAD based upon complexity of prior pictures in the sequence.

ExpertH264 Implementation of Rate Control

PixelTools has implemented the H.264 rate control recommendations in a recent release of ExpertH264. For this release, we have provided picture level control without frame skip. Especially for offline applications for encoding to stored media, this algorithm provides excellent tracking of bitrates for GOPs of a wide variety of sizes.

Typical results track GOP bitrate within 1% without B pictures or 2-3% with B pictures, with good stabilization of QP to prevent noticeable swings in quality. You can try this for yourself by requesting a free demo ofExpertH264 fromPixelTools Corporation.

In subsequent releases, we plan to allow flexibility for smaller basic units, which will allow closer bitrate tracking on the individual picture level, as well as for smaller virtual buffer capacities. We will also support both frame skip and stuffing bits in a subsequent release – depending upon the end requirements, use of one or both of these techniques will reduce variations in bitrate.

The algorithm is a separate module having several interfaces that can be called by the encoder, and with callbacks to the encoder for retrieving key information such as residual bits and residual coefficients. Construction of the complexity metric (i.e., prediction error MAD) is part of the rate control algorithm. C Interfaces and utility functions include:

init_rateControl
initRateControlParams
gopRateControl
frameRateControl
getQB
updateModel
updateBFrameState
getMbMAD
initialQP

Thus, developers of hardware and software encoders can consider integrating this algorithm into their own environments. For example, after the encoding step, a call to updateModel refreshes the empirical coefficients such as C1 and C2 in equation (2). Similarly frameRateControl is called prior to encoding each picture and supplies the quantization parameter.

Terminology
The following glossary is intended to help with a common understanding of rate control issues.

Prediction. Both H.264 and MPEG-* may predict a macroblock by traditional inter (temporal) prediction, i.e., a motion estimation from previous reference pictures followed by transmission of the motion vector. Additionally, H.264 supports advanced intra (spatial) prediction of a macroblock from encoded values for neighboring pixels that have already been encoded (e.g., in raster-scan order).

Residual. The difference between the source andprediction signals is called the residual, or the prediction error. A spatial transform is then applied to the residual to produce transformed coefficients that carry any spatial detail that is not captured in the prediction itself or its reference pictures.

Distortion. Distortion refers to the difference between the original source image x, and the reconstructed image y after it has been decoded. In H.264, sum of squared difference is used to quantify distortion as (1/N) i |yi – xi |2, for any set of N pixels.

Complexity. As the saying goes, I can't define complexity, but I know it when I see it! A single source picture is complex if it is "busy" and has lots of spatial detail. The termspatial activity is synonymous with source complexity for this case. However, for a video sequence, the meaning of complexity is, well, more complex! For example, if a video sequence consists of one busy object that translates slowly across the field of view, it may not require very many bits because the temporal prediction can easily capture the motion using a single reference picture and a series of motion vectors. It is difficult to define an inclusive video complexity metric that is also easy to calculate. SeeMAD

MAD: Mean Absolute Difference of Prediction Error. For rate control, what is more important is the encoding complexity of the residuals that are left over after the inter or intra prediction process is finished. The Mean Absolute Difference of Prediction Error is usually closely related to encoding complexity. Suppose xi is the source value for ith pixel, then:

Spatial Activity. This term is used to quantify the amount of spatial variation within a part of the picture, normally a block of N pixels. Suppose the N pixel values xi, i = 1,..,N. Then the activity for those N pixels is: (1/N) i (xi – <x> )2, where <x> = (1/N)i xi. In other words the spatial activity is the sample variance of a block's values. It is the measure for local complexity used in MPEG-2.

Bitrate. Bitrate refers to the bits per second consumed by a sequence of pictures, i.e., bitrate = (average bits per picture) / (frames per second). In practice, it is equated to the reliable network bandwidth that is provisioned or available for the stream.

Quantization Parameter (QP). Residuals are transformed into the spatial frequency domain by an integer transform that approximates the familiar Discrete Cosine Transform (DCT). The Quantization Parameter determines the step size for associating the transformed coefficients with a finite set of steps. Large values of QP represent big steps that crudely approximate the spatial transform, so that most of the signal can be captured by only a few coefficients. Small values of QP more accurately approximate the block's spatial frequency spectrum, but at the cost of more bits. In H.264, each unit increase of QP lengthens the step size by 12% and reduces the bitrate by roughly 12%.

Group of Pictures (GOP). The Group of Picture concept is inherited from MPEG and refers to an I-picture, followed by all the P and B pictures until the next I picture. A typical MPEG GOP structures might be IBBPBBPBBI. Although H.264 does not strictly require more than one I picture per video sequence, the recommended rate control approach does require a repeating GOP structure to be effective. Thus, H.264 rate control will not work properly if the IntraPeriod parameter is set to 0.

Basic unit. The authors of references [4] and [5] introduced this useful term that expresses the granularity on which QP is adjusted in the feedback control loop. If the basic unit is a picture, then the rate controller's adjustments to QP are uniform across the picture. In MPEG-2, the basic unit is a macroblock. Initially, most H.264 applications will probably use the picture as basic unit, but ultimately a full or partial row of macroblocks is expected to yield the best compromise between uniform bitrate and uniform quality.

Summary

This white paper presents the basics of rate control for H.264 and compares them to the Test Model 5 approach of MPEG-2. Implementers needing a detailed description of the algorithm should see [5] or [6]. The structure shown in our Figure 5, the discussion of its modules, and the terminology glossary should provide a useful companion to help in understanding the densely packed equations found in these references.

References

1. C. Poynton, Digital Video and HDTV, Elsevier Science 2003, pp. 491-2
2. A. Vetro, "MPEG-4 Rate Control for Multiple Video Objects," IEEE Transactions on Circuits and Systems for Video Technology," Vol. 9, No. 1, February 1999
3. G. Sullivan, T. Wiegand and K.P. Lim, "Joint Model Reference Encoding Methods and Decoding Concealment Methods; Section 2.6: Rate Control" JVT-I049, San Diego, September 2003
4. Z. Li et al., "Adaptive Basic Unit Layer Rate Control for JVT," JVT-G012, 7th Meeting: Pattaya, Thailand, March 2003
5. Z. Li et al., "Proposed Draft of Adaptive Rate Control," JVT-H017, 8th Meeting: Geneva, May 2003
6. G. Sullivan, T. Wiegand and K.P. Lim, "Joint Model Reference Encoding Methods and Decoding Concealment Methods; Section 2.6: Rate Control" JVT-I049, San Diego, September 2003
7. MPEG 2 Test Model 5, Rev. 2, Section 10: Rate Control and Quantization Optimization, ISO/IEC/JTC1SC29WG11, April 1993
8. T. Wiegand, H. Schwarz, A. Joch, F. Kossentini and G. Sullivan, "Rate-Constrained Coder Control and Comparison of Video Coding Standards," IEEE Transactions on Circuits & Systems for Video Technology, 13, #7, July 2003

Let us know if we can help or request a free demo of our products. View our productsfeatures at a glance.

Visit our products page and check out at our PixelTools Store to purchase any of our products

Thank you for your interest in PixelTools

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

http://akuvian.org/src/x264/trellis.txt

http://www.cim.mcgill.ca/~latorres/Viterbi/va_alg.htm

Notes on the implementation of trellis quantization in H.264by Loren Merritt, 2005-11-03.----------While the decoding process is standardized, the encoder has complete leeway as to what it chooses to put in the bitstream. We would of course like the encoded stream to look similar to the input video, but there are still many choices to be made in all stages of the encoding process.One such decision is what DCT coefficients to use. After selecting a macroblock type and motion vector, we compute the residual, which is the DCT of the difference between the input video and the inter prediction. Now we have to select the integer values used to represent those DCT coefficients: "quantization".The most obvious scalar quantization method is division. For each coefficient, pick the quantized value that's closest to the desired value. This minimizes error at a given QP, but ignores bitrate.A better method (the conventional one in most codecs) is "uniform deadzone". This works by biasing the result of division towards zero, because smaller values take fewer bits to code. This is RD-optimal assuming the coefficients are independent and follow a Laplacian distribution. (exponential probabilities => the difference between the cost of N and the cost of N+1 is constant).JM includes an "adaptive deadzone", which allows the magnitude of the Laplacian to vary over time, and vary for different frequencies. It still assumes independence between coefficients. In practice, this differs from uniform deadzone only at very high bitrates. (At low rates, the vast majority of coefficients are 0 or 1, so the spatial correlation matters much more than the order0 distribution.)Trellis further relaxes those assumptions: it uses the real cabac costs, including the effect of other coefficients in the DCT block.(not implemented in x264 yet):Lookahead-deadzone, instead of improving the estimate of bitcost, expands the scope of the optimization. Conventional RD considers the rate+distortion of a single macroblock at at a time, holding all others constant. Lookahead directly includes the effect of quantization decisions on both the rate and distortion of future inter-blocks and neighboring intra-blocks.Lookahead-trellis is theoretically possible, but might be computationally infeasible.----------Most literature assumes the distribution of coef magnitudes is i.i.d. Laplacian. This is a pretty good approximation.If the entropy coder actually obeyed the Laplacian distribution, then uniform-deadzone would be optimal. But the MPEG-1/2/4 residual coders differ from Laplacian in 2 ways:1) They use VLC. So each token must take up a whole number of bits. For some coefs, rounding up or down gives the same bit size, so you should round to nearest. For other coefs, rounding down saves a whole 1 or 2 bits, so you should round with more bias than deadzone would.2) They use run-level coding. So it's sometimes worth zeroing a coef in order to merge two zero runs. Or other dependencies between the cost of coding a given magnitude and the number of adjacent zeros.These have been exploited in [1].H.264 CAVLC is similar, though more complicated. It shouldn't affect the potential gain of trellis, but will make it much harder to implement (maybe exponential time).H.264 CABAC differs from Laplacian in other ways:-1) It does not require whole numbers of bits.-2) It does not use run-level coding; since cabac doesn't have VLC's minimum bit cost per token, each coef can have its own significance flag. So you don't directly gain anything by merging runs.3) If the local distribution of coefs does not match the global Laplacian, or if the average value varies locally, cabac adapts. This could also be handled by adaptive deadzone, though JVT-N011 has a different method than trellis: N011 sets the deadzone based on current (pre-quantization) coef distribution and assumes that cabac will adjust the bit costs to match; trellis sets effective deadzone based directly on current cabac bit costs.4) The cost of coding a given nonzero coef depends on the number and magnitude of previous nonzero coefs in the block.So, MPEG-* trellis keeps a candidate encoding for each possible run length (magnitude can be decided once, no need to keep multiple versions.)H.264 trellis keeps a candidate encoding for each possible combination of cabac context numbers (point 4). Point 3 is impossible to globally optimize over: there are way too many possible states, so probably the best we can do is a greedy search.----------Goal: given a 4x4 or 8x8 block of DCT coefficients, and the initial cabac context they will be encoded under, select quantized values for each coefficient so as to minimize the total rate-distortion cost.Algorithm:Viterbi search / Dijkstra shortest path (with some simplifications due to the regularity of the graph).The states to search over are (dct_index,cabac_context,level).This is implemented as a dynamic program, where any two states of the same (dct_index,cabac_context) are considered the same, and only the one with the best score is kept. The states are evaluated in decreasing order of dct_index, so the size of the search frontier is bounded by the number of different values of cabac_context (which is 8).I chose decreasing dct_index because that's the order of magnitude coding in the real cabac residual coding. Thus we can ignore the contents of each cabac state, and let the entropy coder update them as normal. In the 4x4 transform, the nonzero and last_nonzero flags use a separate cabac state for each coefficient, so their order of evaluation doesn't matter. In the 8x8 transform they are not all separate, and we have to code them in reverse of the real bitstream order, so we have to approximate their states. My approximation is to simply not update the nonzero and last_nonzero cabac states during the 8x8 trellis search. There might be better ways.In a conventional DCT, the basis functions are orthonormal. So a SSD between original and dequantized DCT coefficients produces the same result as a SSD between original and reconstructed pixels. (Assumes no rounding error during the iDCT; rounding is negligible compared to quantization error). H.264's transforms are not normalized, but they are still orthogonal, so the same principle works. It just requires a weighting based on the coefficient position [2].I only search two possible levels for each coefficient. A larger search range is possible, but gave negligible PSNR improvement (.003 dB) and was very slow.The pseudocode deals with coefficients' absolute values only. Signs are not entropy coded (always 1 bit each), so the optimal quantization always uses the same signs as the unquantized coefficients.Implementation note:Evaluating the cabac cost of a decision is much faster than encoding that decision to the bitstream. In this algorithm we do not perform any bitstream writing. Rather, each cost evaluation can be a single table lookup of entropy as a function of cabac state.----------Pseudocode:typedef:  node = ( context, cabac, score, levels )    context  = ( # of coefs with level==1, # of coefs with level>1 ).                There are only 8 different pairs for the purpose of cabac coding:                (0,0), (1,0), (2,0), (>=3,0), (any,1), (any,2), (any,3), (any,>=4)    cabac    = the states relevant to coding abs_level. (I don't have to store the nonzero and last_nonzero states in each node, since they aren't updated during the search.)    score    = sum of distortion+lambda*rate over coefficients processed so far    levels[] = the list of levels in the path leading to the current nodeinputs:  lambda     = the rate-distortion constant  dct[]      = the absolute values from fdct before quantization, in zigzag scan order  weight[]   = the normalization factors for fdct, in zigzag scan order  quant_mf[] = the factors to divide by in conventional quantization, in zigzag scan order  n_coefs    = the size of the dct block (64 (8x8), 16 (4x4), or 15 (4x4 AC))  cabac_in   = the state of the cabac encoderoutputs:  levels[]   = the absolute values of the quantized coefficientscode:  nodes_cur[0] = { (.context = 0, .cabac = cabac_in, .score = 0, .levels = {}) }  for( i = n_coefs-1; i >= 0; i-- )    nodes_prev[] = nodes_cur[]    q = round( dct[i] / quant_mf[i] )    foreach level in { max(q-1,0), q }      diff = ( dct[i] - level * quant_mf[i] ) * weight[i]      ssd = diff**2      foreach node in nodes_prev[] (copied, not aliased)        node.score += ssd + lambda * (bit cost of coding level with node.cabac)        update node.context and node.cabac after coding level        append level to node.levels[]        if( nodes_cur[node.context].score > node.score )            nodes_cur[node.context] = node  final_node = the element of nodes_cur[] with the lowest .score  levels[] = final_node.levels[]----------[1] "Trellis-Based R-D Optimal Quantization in H.263+"      J.Wen, M.Luttrell, J.Villasenor      IEEE Transactions on Image Processing, Vol.9, No.8, Aug. 2000.[2] "Efficient Macroblock Coding-Mode Decision for H.264/AVC Video Coding"      J.Xin, A.Vetro, H.Sun      http://www.merl.com/reports/docs/TR2004-079.pdf, Dec. 2004.

0 0