dirac-overview.tex

   1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2 % - This chapter gives an informative overview - %
   3 % - Dirac video coding concepts                - %
   4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   5
   6 \begin{comment}
   7 \subsection{A general introduction to video compression for beginners}
   8 This is a simple guide to video coding. You can miss this subsection out
   9 if you already know the basics.
  10
  11 Video signals are made up of a succession of still pictures, displayed
  12 one after the other. Each picture is made up from a series of elementary
  13 units called pixels, arranged in a raster of lines.
  14
  15 The raw total capacity required for the transmission of moving pictures
  16 is the product of the number of pixels per picture, pictures per second,
  17 colours in use and the quantising accuracy adopted. In nearly every
  18 field of application, the resultant raw bandwidth required exceeds that
  19 available by more than an order of magnitude.  If we keep the same size
  20 pictures, then the only variable we have with the simple system is to
  21 change the quantisation accuracy and lowering the accuracy rapidly leads
  22 to poor quality pictures.
  23
  24 The solution which has been used from before 1970 is to make a
  25 prediction of the value of each pixel using information that should be
  26 available in the decoder. In general, if the prediction is good, then
  27 the difference between reality and the prediction is small. The entropy
  28 of the difference signal is low, and so we need less capacity to deliver
  29 it. One simple solution was to recognise that the quantisation accuracy
  30 required for the difference signal could be coarser than that required
  31 for the original picture, and so fewer bits are needed to deliver the
  32 difference signal than for the original.
  33
  34 This has formed the basis for many of the early video compression
  35 systems.  The elements that have changed over the years have been
  36
  37 \begin{itemize}
  38     \item the algorithms used for making the prediction, and
  39     \item the formats used for delivering the difference signal
  40     efficiently to the receiver.
  41 \end{itemize}
  42
  43 Many of the changes have been enabled by the availability of better
  44 electronic circuitry. Real-time operation puts an upper limit on the
  45 time available for processing.  We can only use more sophisticated
  46 algorithms if we can carry out the calculations in the time available.
  47 Fortunately, the improvement in the speed of hardware has almost matched
  48 the development of algorithms.
  49
  50 Early algorithms made use of simple predictions predicting one pixel
  51 from others nearby in the picture, or (spatial prediction in the
  52 jargon).
  53
  54 Predictions were helped when field-stores became affordable (and smaller
  55 than a small garden shed) and it became possible to use information from
  56 the previous field (temporal prediction in the jargon). This process
  57 works well when the picture is still, but is less effective when the
  58 pictures depict a lot of movement.
  59
  60 This led to a whole raft of developments, seeking to identify motion in
  61 the picture. Knowing how parts of the scene are moving (motion
  62 estimation in the jargon) allowed much more accurate prediction than had
  63 been possible hitherto. These developments led eventually to mature
  64 products such as the MPEG 2 compression system.
  65
  66 Having created a good prediction, it is inevitable that it will not be
  67 perfect. The error signal is the signal we wish to convey to the
  68 decoder.  In any system, the raw error signal is at least as
  69 bandwidth-hungry as the original. In most cases it is slightly more
  70 hungry.
  71
  72 However, the error signal has some physical properties which help us to
  73 reduce its bandwidth. In information theory terms, the error signal is
  74 rarely totally random, real pictures have properties which distinguish
  75 them from random noise.  We can therefore use some of these properties
  76 to reduce the bit-rate further.
  77
  78 The spectrum of the error signal is usually heavily biased towards the
  79 low frequency end. This is a direct consequence of the prediction
  80 process usually being reasonably good. It also turns out that the eye is
  81 less sensitive to small inaccuracies in the high frequency components of
  82 the error signal.
  83
  84 Taken together, these observations provide the potential for coarser
  85 quantisation or omission of some of the components of the error signal.
  86
  87 There are several methods available if we wish to translate the temporal
  88 signal we started with into the frequency domain. Although many coding
  89 systems use Discrete Cosine Transforms, we preferred to use the Discrete
  90 Wavelet Transform. This approach divides the signal into higher and
  91 lower frequency sub-bands. By quantising the different sub bands
  92 appropriately, we achieve significant reduction in bandwidth.
  93
  94 Finally, when all the information for this system is packed together,
  95 there is still a structure it is not statistically random. Information
  96 theory says that we can use entropy coding to further increase the
  97 randomness, and thereby reduce the bandwidth.  One of the more powerful
  98 of these is arithmetic coding the system adopted in Dirac.  So we have
  99 now outlined some of the key elements of a modern coder.
 100
 101
 102 Fig XX. An outline of a typical modern video coder.
 103
 104 It is worth a quick look at the simple representation of the encoder in
 105 Fig. XX as it leads us to understand the receiver topology.
 106
 107 The input signal $V_{in}$ is compared with the prediction $P$ to produce
 108 an error signal $e$.  This is then compressed and passed (with the
 109 various elements of compression metadata) to the arithmetic coder for
 110 transmission.
 111
 112 The prediction is created from a local version of the signal for a very
 113 good reason.  This is the signal that the decoder is able to recreate
 114 with the information available to it. The signal delivered to the
 115 receiver  $e_{TQ}$ allows the receiver to recreate a close copy of the
 116 prediction error $e$. If we compare the two signals,
 117
 118 \begin{align*}
 119   V_{in} &= P + e \\
 120   V_{local}     &= P + e'
 121 \end{align*}
 122
 123 As the difference between the two error signals $e$ and $e'$ is the
 124 distortion introduced by the compression algorithm, the local signal is
 125 a close approximation to the input signal. Looking carefully at Fig. XX,
 126 we can see that we can create a decoder using a subset of the encoder.
 127 This is shown in Fig. XXX.
 128
 129 It is quite clear that the main elements of the receiver are duplicated
 130 in the encoder.  The encoder has to maintain a local decoder within it,
 131 in part so that the result of the compression can be monitored at the
 132 time of compression, but mainly because compressed pictures must be used
 133 as reference frames for subsequent motion compensation else the encoder
 134 and the decoder will not remain in synchronism.
 135
 136 The motion vectors are delivered from the encoder as metadata. This
 137 avoids the need to analyse motion vectors in the receiver, and allows
 138 the encoder considerable flexibility in the choice of appropriate motion
 139 vectors.  We will now move on to consider the application of this
 140 technology to Dirac.
 141
 142
 143 Fig. YY An outline of the decoder
 144 \end{comment}
 145
 146 \subsection{Outline of the compression methods used in Dirac}
 147 The Dirac video codec uses four main techniques to compress the signal:
 148
 149 \begin{itemize}
 150     \item Prediction using motion compensation to exploit temporal redundancy
 151     \item Wavelet transformation to exploit spatial redundancy
 152     \item Quantisation of coefficients to form an approximate description of the picture data
 153     \item Entropy coding of the resulting quantised values to maximise efficiency
 154 \end{itemize}
 155
 156
 157 Initially, similarities between frames (temporal redundancy) are
 158 exploited to predict one frame from another. The process is aided by
 159 motion vectors - metadata detailing where a particular pixel in the
 160 predicted frame might have been in the reference frame. It is a bit
 161 wasteful to assign a motion vector to each pixel, so pixels are
 162 aggregated in blocks. In Dirac, blocks are overlapping. This reduces
 163 some of the artefacts found at the boundaries of blocks in some earlier
 164 systems. The motion is calculated to sub pixel accuracy.
 165
 166 Once the prediction has been made, it is compared with the actual image
 167 to be transmitted.  The resulting difference (error) is potentially
 168 greater in range than the original signal. To reduce it the difference
 169 signal is then transformed using the discrete wavelet transform. This
 170 process, for the majority of video sequences, produces coefficients
 171 which are largely zero or near zero, and most of the non-zero
 172 coefficients are concentrated near at the lower frequency end of the
 173 range. The properties of the eye allow us to coarsely quantise the high
 174 frequency coefficients.
 175
 176 Both the motion vectors and the quantised coefficients are then further
 177 compressed using arithmetic coding.
 178
 179 \subsection{Prediction using motion compensation}
 180
 181 The object of motion compensation is to try to predict the picture in
 182 one image from others in the sequence. In television, the picture
 183 usually contains moving objects, so a key part of the process is to
 184 identify the moving objects and their the details of the motion.
 185
 186 In Dirac, the motion vector is merely an indication of which pixel in
 187 the reference picture can be used as a good prediction for a particular
 188 pixel in the current picture.
 189
 190 \subsubsection{Types of picture}
 191
 192 Three types of picture are defined.
 193
 194 Intra pictures are coded without reference to other pictures in the
 195 sequence. These pictures form a useful point to start the decoding
 196 process. They also have uses if low-delay coding is required - a
 197 sequence of Intra pictures has mimimum delay, at the expense of
 198 potentially greater bandwidth requirements. There is a third use for
 199 Intra pictures: when the sequence is highly dynamic, prediction using
 200 motion compensation may be impractical. In such instances, a coder may
 201 choose to default to a sequence of Intra frames.
 202
 203 Inter pictures are coded with reference to other pictures in the
 204 sequence, and are split into two types:
 205 \begin{itemize}
 206         \item references for other pictures
 207         \item not references for other pictures
 208 \end{itemize}
 209
 210 Within the receiver, we have to arrange that all the references for
 211 pictures are available at the right time. This means that pictures are
 212 often delivered in a different sequence from the display sequence, to
 213 ensure that the buffer memory in the receiver has the necessary
 214 information to decode pictures when it is needed.
 215
 216 \subsubsection{Blocks}
 217
 218 In theory, we could define a motion vector for every pixel in the image.
 219 This would be extremely data intensive and of no practical value.
 220
 221 Instead, pixels are grouped into small regions or blocks, with a single
 222 motion vector assigned to each block. Ideally these blocks are large -
 223 thus minimising the amount of information we have to transmit. However,
 224 the larger the block, the greater the chance that we have more than one
 225 object or region, and so a single motion vector might be inappropriate.
 226
 227 Although there are several standard block sizes identified within the
 228 specification, you can choose your own, and that could be as large as
 229 the original picture. This would not be unreasonable for a picture which
 230 is a static scene, with the camera being panned across it - although
 231 there are other ways of identifying this in Dirac.
 232
 233 One of the hard tasks the coder has is to identify which is the optimum
 234 size of block, and the best compromise for the motion vector in the
 235 block.
 236
 237 When we have created the prediction, we then calculate the error. The
 238 error is often greatest at the block boundaries, as this is where the
 239 motion vector is often least accurate.
 240
 241 To overcome this, Dirac overlaps blocks. Each pixel in the overlap
 242 regions uses a weighted prediction, incorporating information from all
 243 the blocks it may lie in. This smooths the error signal, and making the
 244 following wavelet transformation much more effective.
 245
 246 \subsubsection{Global motion parameters}
 247
 248 Although we said that it is of no practical value to specify a motion
 249 vector for each pixel, there are exceptions.
 250
 251 Some types of motion are more likely to be features of the camera than
 252 the scene. For example, a camera may pan, tilt, rotate, zoom, sheer, or
 253 change of perspective through, say tracking.
 254
 255 If the coder can identify such motion, then Dirac permits the motion to
 256 be signalled by a small set of parameters, which in effect assign a
 257 motion vector to each pixel.
 258
 259 \subsubsection{Accuracy}
 260 Often, we find that the motion vectors required to match a pixel in a
 261 reference frame to the predicted frame are not integer values.
 262
 263 In Dirac, we can specify motion vectors to 1/8 pixel accuracy.
 264
 265 This means that the coder and decoder have to carry out a process that
 266 is effectively an upconversion of the signal.
 267
 268 \subsubsection{Intra frames}
 269 The Intra fields are not predicted using motion compensation.
 270
 271 This leaves us with a potential problem. The wavelet transform process
 272 assumes that we have a signal which tends to a zero average.
 273
 274 The Intra field is processed to give it a zero mean by removing the DC
 275 component (average value) of each block from the signal. The DC
 276 components are signalled separately. In effect the DC components are a
 277 local spatial prediction, rather than a temporal prediction as applied
 278 in motion compensation.
 279
 280 \subsection{Wavelet transforms}
 281 The consequence of processing the Intra frames by removing the DC
 282 values, and the Inter frames by removing the predicted values gives us a
 283 difference signal. This difference signal is hopefully largely zero, but
 284 can have some large peaks, of either polarity. The signal usually has a
 285 large amount of low-frequency energy, but with occasional elements of
 286 high frequency as the prediction process gets it wrong.
 287
 288 Conventional theory says that we can manipulate this signal in the
 289 frequency domain to reduce the amount of information we need to
 290 transmit. The properties of the eye are such that many of the higher
 291 frequency components are less sensitive to coarse quantisation.
 292
 293 We can use wavelet transforms in Dirac. The wavelet transform
 294 decorrelates the data in a roughly frequency-sensitive way, and
 295 preserves the fine details of images better than the ubiquitous Discrete
 296 Cosine Transform.
 297
 298 \subsubsection{Wavelet analysis}
 299
 300 Put simply, wavelet analysis splits a signal into a low and a high
 301 frequency component, and then subsamples the two partial streams by a
 302 factor of two.
 303
 304 The two filters used to split the signal are called the analysis
 305 filters. It is not possible to use just any pair of half-band filters to
 306 do this. There is an extensive mathematical theory of wavelet filter
 307 banks. Appropriately chosen filters allow us to undo the aliasing
 308 introduced by the critical sampling in the down conversion process and
 309 perfectly reconstruct the signal.
 310
 311 Iterative use of a wavelet process allows us to decompose the
 312 low-frequency component into successively lower components.
 313
 314 For pictures, we apply wavelet filters in both the horizontal and
 315 vertical directions. This results in four so-called subbands, termed
 316 Low-Low (LL), Low-High (LH) High-Low (HL), and High-High (HH). The LL
 317 band can be iteratively decomposed to create a wavelet transformation of
 318 many levels.
 319
 320
 321 \subsubsection{Parent-child relationships}
 322 In wavelet analysis, each subband represents a filtered and subsampled
 323 version of the picture. Because all the subbands are derived from a
 324 single source image, there is likely to be some form of relationship
 325 between the images in the different subbands.
 326
 327 The coefficients of each subband relate to specific areas in the image.
 328 We find that there is often a correlation between these specific areas
 329 in the different subbands.
 330
 331 The subsampling structure means that a coefficient in the lowest level
 332 corresponds to  2 by 2 block of coefficients in the next level, and so
 333 on up the levels. In the jargon, the low-level component is referred to
 334 as the parent and the higher-level component is referred to as the
 335 child.
 336
 337 When coding picture features (edges on objects especially), significant
 338 coefficients are found distributed across subbands, in position related
 339 by the parent-child structure.
 340
 341 A coefficient of a child is more likely to be significant if its parent
 342 is also significant. Children with zero or small parents seem to have
 343 different statistics from children with large parents or ancestors.
 344
 345 These features allow us to entropy code the wavelet coefficients after
 346 they have been quantised.
 347
 348 \subsection{Entropy coding}
 349
 350 Transmission of the raw motion vectors and wavelet coefficients is
 351 inefficient. There are still many redundancies in the data, and the form
 352 of the data is itself suboptimum.
 353
 354 Coding of the motion vectors is especially important for codecs with a
 355 high level of motion accuracy (quarter or eighth pixel say). Motion
 356 vector coding and decoding is quite complicated, since significant gains
 357 in efficiency can be made by choosing a good prediction and entropy
 358 coding structure.
 359
 360 The processes of removing the redundancies is probably one of the most
 361 complicated part of the codec. Similar processes are used for coding the
 362 motion vectors and the wavelet coefficients. In various ways they use a
 363 combination of
 364
 365 \begin{itemize}
 366         \item prediction,
 367         \item binarisation,
 368         \item context modelling and
 369         \item adaptive arithmetic coding.
 370 \end{itemize}
 371
 372 \subsubsection{Entropy coding of wavelets}
 373
 374 The entropy coding used by Dirac in wavelet subband coefficient coding
 375 is based on three stages: binarisation, context modelling and adaptive
 376 arithmetic coding.
 377
 378 Figure: Entropy coding block diagram
 379
 380 The purpose of the first stage is to provide a bitstream with easily
 381 analysable statistics that can be encoded using arithmetic coding which
 382 can adapt to those statistics, reflecting any local statistical
 383 features.
 384
 385 Binarisation is the process of transforming the multi-valued coefficient
 386 symbols into bits. The resulting bitstream can then be arithmetic coded.
 387
 388 Transform coefficients tend to have a roughly Laplacian distribution,
 389 which decays exponentially with magnitude. This suits so-called unary
 390 binarization. Unary codes are simple variable-length codes in which
 391 every non-negative number $N$ is mapped to $N$ zeros followed by a 1:
 392
 393
 394 \begin{verbatim}
 395 U(0)    =   1
 396 U(1)    =   0   1
 397 U(2)    =   0   0   1
 398 U(3)    =   0   0   0   1
 399 U(4)    =   0   0   0   0   1
 400 U(5)    =   0   0   0   0   0   1
 401 U(6)    =   0   0   0   0   0   0   1
 402 Bins:       1   2   3   4   5   6   7
 403 \end{verbatim}
 404
 405 For Laplacian distributed values, the probability of $N$ occurring is
 406 $2-(|N|+1)$, so the probability of a zero or a 1 occurring in any unary
 407 bin is constant. So for an ideal only one context would be needed for
 408 all the bins, leading to a very compact and reliable description of the
 409 statistics. In practice, the coefficients do deviate from the Laplacian
 410 ideal and so the lower bins are modelled separately and the larger bins
 411 lumped into one context.
 412
 413 The process is best explained by example. Suppose one wished to encode
 414 the sequence:
 415
 416 \begin{verbatim}
 417 -3 0 1 0 -1
 418 \end{verbatim}
 419
 420 When binarized, the sequence to be encoded is:
 421
 422 \begin{verbatim}
 423 0 0 0 1 | 0 | 1 | 0 1 | 1 | 1 | 0 1 | 0
 424 \end{verbatim}
 425
 426 The first 4 bits encode the magnitude, 3. The first bit is encoded using
 427 the statistics for Bin1, the second using those for Bin 2 and so on.
 428 When a 1 is detected, the magnitude is decoded and a sign bit is
 429 expected. This is encoded using the sign context statistics; here it is
 430 0 to signify a negative sign. The next bit must be a magnitude bit and
 431 is encoded using the Bin 1 contexts; since it is 1 the value is 0 and
 432 there is no need for a subsequent sign bit. And so on.
 433
 434 The context modelling in Dirac is based on the principle that whether a
 435 coefficient is small (or zero, in particular) or not is well-predicted
 436 by its neighbours and its parents. Therefore the codec conditions the
 437 probabilities used by the arithmetic coder for coding bins 1 and 2 on
 438 the size of the neighbouring coefficients and the parent coefficient.
 439
 440 The reason for this approach is that, whereas the wavelet transform
 441 largely removes correlation between a coefficient and its neighbours,
 442 they may not be statistically independent even if they are uncorrelated.
 443 The main reason for this is that small and especially zero coefficients
 444 in wavelet subbands tend to clump together, located at points
 445 corresponding to smooth areas in the image, and as discussed elsewhere,
 446 are grouped together across subbands in the parent-child relationship.
 447
 448 Conceptually, an arithmetic coder can be thought of a progressive way of
 449 producing variable-length codes for entire sequences of symbols based on
 450 the probabilities of their constituent symbols.
 451
 452 For example, if we know the probability of 0 and 1 in a binary sequence,
 453 we also know the probability of the sequence itself occurring. So if
 454
 455 $P(0)=0.2, $
 456
 457 $P(1)=0.8$
 458
 459 then
 460
 461 $P(11101111111011110101)=(0.2)*3*(0.8)*17=1.8 * 10^{-4}$ (assuming
 462 independent occurrences).
 463
 464 Information theory then says that optimal entropy coding of this
 465 sequence requires $log_2 (\frac{1}{p})=12.4$ bits. Arithmetic coding
 466 produces a code word very close to this optimal length, and
 467 implementations can do so progressively, outputting bits when possible
 468 as more arrive.
 469
 470 All arithmetic coding requires are estimates of the probabilities of
 471 symbols as they occur, and this is where context modelling fits in.
 472 Since arithmetic coding can, in effect, assign a fractional number of
 473 bits to a symbol, it is very efficient for coding symbols with
 474 probabilities very close to 1, without the additional complication of
 475 run-length coding. The aim of context modelling within Dirac is to use
 476 information about the symbol stream to be encoded to produce accurate
 477 probabilities as close to 1 as possible.
 478
 479 Dirac computes these estimates for each context simply by counting their
 480 occurrences. In order for the decoder to be in the same state as the
 481 encoder, these statistics cannot be updated until after a binary symbol
 482 has been encoded. This means that the contexts must be initialised with
 483 a count for both 0 and 1, which is used for encoding the first symbol in
 484 that context.
 485
 486 An additional source of redundancy lies in the local nature of the
 487 statistics. If the contexts are not refreshed periodically then later
 488 data has less influence in shaping the statistics than earlier data,
 489 resulting in bias, and local statistics are not exploited. Dirac adopts
 490 a simple way of refreshing the contexts by halving the the counts of 0
 491 and 1 for that context at regular intervals. The effect is to maintain
 492 the probabilities to a reasonable level of accuracy, but to keep the
 493 influence of all coefficients roughly constant.
 494
 495 \subsubsection{Entropy coding of motion vectors}
 496
 497 The basic format of the coding the motion vectors is similar to the
 498 coding of wavelet data: it consists of prediction, followed by
 499 binarisation, context modelling and adaptive arithmetic coding.
 500
 501
 502 Figure: motion vector entropy coding architecture
 503
 504 All the motion vector data are predicted from previously encoded data
 505 from nearest neighbours.
 506
 507 \subsection{Bytestream}\input{begin-bs}
 508
 509 The bytestream of Dirac is the complete, compressed video stream, ready
 510 for file storage or transmission.
 511
 512 The features are
 513
 514 \begin{itemize}
 515     \item A parse structure which gives ready access to the video, even
 516     in mid file. This identifies all the static features of the stream.
 517     Intelligent systems can use the redundancy in this structure to
 518     recover from errors in the stream.
 519
 520     \item Blocks of data comprising the arithmetically-coded motion
 521     vector information
 522
 523     \item Blocks of data comprising the arithmetically-coded wavelet
 524     coefficients
 525 \end{itemize}
 526
 527 Because of the arithmetic coding, it is difficult to provide a simple
 528 specification which gives meaningful details of the bytestream without
 529 some reference to either the early coding processes, or the later
 530 decoding process. The simple description of the bytestream would only
 531 refer to compressed motion vectors, and compressed wavelet coefficients.
 532 We have therefore had to provide extra information to explain how to
 533 unpack this data, and how to use the resulting structured data.
 534