Include a pre-rendered pdf version
[dirac-spec-errata.git] / dirac-overview.tex
blobbe93425bc61a173cbe2e366e0c45b47ca1971543
1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2 % - This chapter gives an informative overview - %
3 % - Dirac video coding concepts - %
4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
6 \begin{comment}
7 \subsection{A general introduction to video compression for beginners}
8 This is a simple guide to video coding. You can miss this subsection out
9 if you already know the basics.
11 Video signals are made up of a succession of still pictures, displayed
12 one after the other. Each picture is made up from a series of elementary
13 units called pixels, arranged in a raster of lines.
15 The raw total capacity required for the transmission of moving pictures
16 is the product of the number of pixels per picture, pictures per second,
17 colours in use and the quantising accuracy adopted. In nearly every
18 field of application, the resultant raw bandwidth required exceeds that
19 available by more than an order of magnitude. If we keep the same size
20 pictures, then the only variable we have with the simple system is to
21 change the quantisation accuracy and lowering the accuracy rapidly leads
22 to poor quality pictures.
24 The solution which has been used from before 1970 is to make a
25 prediction of the value of each pixel using information that should be
26 available in the decoder. In general, if the prediction is good, then
27 the difference between reality and the prediction is small. The entropy
28 of the difference signal is low, and so we need less capacity to deliver
29 it. One simple solution was to recognise that the quantisation accuracy
30 required for the difference signal could be coarser than that required
31 for the original picture, and so fewer bits are needed to deliver the
32 difference signal than for the original.
34 This has formed the basis for many of the early video compression
35 systems. The elements that have changed over the years have been
37 \begin{itemize}
38 \item the algorithms used for making the prediction, and
39 \item the formats used for delivering the difference signal
40 efficiently to the receiver.
41 \end{itemize}
43 Many of the changes have been enabled by the availability of better
44 electronic circuitry. Real-time operation puts an upper limit on the
45 time available for processing. We can only use more sophisticated
46 algorithms if we can carry out the calculations in the time available.
47 Fortunately, the improvement in the speed of hardware has almost matched
48 the development of algorithms.
50 Early algorithms made use of simple predictions predicting one pixel
51 from others nearby in the picture, or (spatial prediction in the
52 jargon).
54 Predictions were helped when field-stores became affordable (and smaller
55 than a small garden shed) and it became possible to use information from
56 the previous field (temporal prediction in the jargon). This process
57 works well when the picture is still, but is less effective when the
58 pictures depict a lot of movement.
60 This led to a whole raft of developments, seeking to identify motion in
61 the picture. Knowing how parts of the scene are moving (motion
62 estimation in the jargon) allowed much more accurate prediction than had
63 been possible hitherto. These developments led eventually to mature
64 products such as the MPEG 2 compression system.
66 Having created a good prediction, it is inevitable that it will not be
67 perfect. The error signal is the signal we wish to convey to the
68 decoder. In any system, the raw error signal is at least as
69 bandwidth-hungry as the original. In most cases it is slightly more
70 hungry.
72 However, the error signal has some physical properties which help us to
73 reduce its bandwidth. In information theory terms, the error signal is
74 rarely totally random, real pictures have properties which distinguish
75 them from random noise. We can therefore use some of these properties
76 to reduce the bit-rate further.
78 The spectrum of the error signal is usually heavily biased towards the
79 low frequency end. This is a direct consequence of the prediction
80 process usually being reasonably good. It also turns out that the eye is
81 less sensitive to small inaccuracies in the high frequency components of
82 the error signal.
84 Taken together, these observations provide the potential for coarser
85 quantisation or omission of some of the components of the error signal.
87 There are several methods available if we wish to translate the temporal
88 signal we started with into the frequency domain. Although many coding
89 systems use Discrete Cosine Transforms, we preferred to use the Discrete
90 Wavelet Transform. This approach divides the signal into higher and
91 lower frequency sub-bands. By quantising the different sub bands
92 appropriately, we achieve significant reduction in bandwidth.
94 Finally, when all the information for this system is packed together,
95 there is still a structure it is not statistically random. Information
96 theory says that we can use entropy coding to further increase the
97 randomness, and thereby reduce the bandwidth. One of the more powerful
98 of these is arithmetic coding the system adopted in Dirac. So we have
99 now outlined some of the key elements of a modern coder.
102 Fig XX. An outline of a typical modern video coder.
104 It is worth a quick look at the simple representation of the encoder in
105 Fig. XX as it leads us to understand the receiver topology.
107 The input signal $V_{in}$ is compared with the prediction $P$ to produce
108 an error signal $e$. This is then compressed and passed (with the
109 various elements of compression metadata) to the arithmetic coder for
110 transmission.
112 The prediction is created from a local version of the signal for a very
113 good reason. This is the signal that the decoder is able to recreate
114 with the information available to it. The signal delivered to the
115 receiver $e_{TQ}$ allows the receiver to recreate a close copy of the
116 prediction error $e$. If we compare the two signals,
118 \begin{align*}
119 V_{in} &= P + e \\
120 V_{local} &= P + e'
121 \end{align*}
123 As the difference between the two error signals $e$ and $e'$ is the
124 distortion introduced by the compression algorithm, the local signal is
125 a close approximation to the input signal. Looking carefully at Fig. XX,
126 we can see that we can create a decoder using a subset of the encoder.
127 This is shown in Fig. XXX.
129 It is quite clear that the main elements of the receiver are duplicated
130 in the encoder. The encoder has to maintain a local decoder within it,
131 in part so that the result of the compression can be monitored at the
132 time of compression, but mainly because compressed pictures must be used
133 as reference frames for subsequent motion compensation else the encoder
134 and the decoder will not remain in synchronism.
136 The motion vectors are delivered from the encoder as metadata. This
137 avoids the need to analyse motion vectors in the receiver, and allows
138 the encoder considerable flexibility in the choice of appropriate motion
139 vectors. We will now move on to consider the application of this
140 technology to Dirac.
143 Fig. YY An outline of the decoder
144 \end{comment}
146 \subsection{Outline of the compression methods used in Dirac}
147 The Dirac video codec uses four main techniques to compress the signal:
149 \begin{itemize}
150 \item Prediction using motion compensation to exploit temporal redundancy
151 \item Wavelet transformation to exploit spatial redundancy
152 \item Quantisation of coefficients to form an approximate description of the picture data
153 \item Entropy coding of the resulting quantised values to maximise efficiency
154 \end{itemize}
157 Initially, similarities between frames (temporal redundancy) are
158 exploited to predict one frame from another. The process is aided by
159 motion vectors - metadata detailing where a particular pixel in the
160 predicted frame might have been in the reference frame. It is a bit
161 wasteful to assign a motion vector to each pixel, so pixels are
162 aggregated in blocks. In Dirac, blocks are overlapping. This reduces
163 some of the artefacts found at the boundaries of blocks in some earlier
164 systems. The motion is calculated to sub pixel accuracy.
166 Once the prediction has been made, it is compared with the actual image
167 to be transmitted. The resulting difference (error) is potentially
168 greater in range than the original signal. To reduce it the difference
169 signal is then transformed using the discrete wavelet transform. This
170 process, for the majority of video sequences, produces coefficients
171 which are largely zero or near zero, and most of the non-zero
172 coefficients are concentrated near at the lower frequency end of the
173 range. The properties of the eye allow us to coarsely quantise the high
174 frequency coefficients.
176 Both the motion vectors and the quantised coefficients are then further
177 compressed using arithmetic coding.
179 \subsection{Prediction using motion compensation}
181 The object of motion compensation is to try to predict the picture in
182 one image from others in the sequence. In television, the picture
183 usually contains moving objects, so a key part of the process is to
184 identify the moving objects and their the details of the motion.
186 In Dirac, the motion vector is merely an indication of which pixel in
187 the reference picture can be used as a good prediction for a particular
188 pixel in the current picture.
190 \subsubsection{Types of picture}
192 Three types of picture are defined.
194 Intra pictures are coded without reference to other pictures in the
195 sequence. These pictures form a useful point to start the decoding
196 process. They also have uses if low-delay coding is required - a
197 sequence of Intra pictures has mimimum delay, at the expense of
198 potentially greater bandwidth requirements. There is a third use for
199 Intra pictures: when the sequence is highly dynamic, prediction using
200 motion compensation may be impractical. In such instances, a coder may
201 choose to default to a sequence of Intra frames.
203 Inter pictures are coded with reference to other pictures in the
204 sequence, and are split into two types:
205 \begin{itemize}
206 \item references for other pictures
207 \item not references for other pictures
208 \end{itemize}
210 Within the receiver, we have to arrange that all the references for
211 pictures are available at the right time. This means that pictures are
212 often delivered in a different sequence from the display sequence, to
213 ensure that the buffer memory in the receiver has the necessary
214 information to decode pictures when it is needed.
216 \subsubsection{Blocks}
218 In theory, we could define a motion vector for every pixel in the image.
219 This would be extremely data intensive and of no practical value.
221 Instead, pixels are grouped into small regions or blocks, with a single
222 motion vector assigned to each block. Ideally these blocks are large -
223 thus minimising the amount of information we have to transmit. However,
224 the larger the block, the greater the chance that we have more than one
225 object or region, and so a single motion vector might be inappropriate.
227 Although there are several standard block sizes identified within the
228 specification, you can choose your own, and that could be as large as
229 the original picture. This would not be unreasonable for a picture which
230 is a static scene, with the camera being panned across it - although
231 there are other ways of identifying this in Dirac.
233 One of the hard tasks the coder has is to identify which is the optimum
234 size of block, and the best compromise for the motion vector in the
235 block.
237 When we have created the prediction, we then calculate the error. The
238 error is often greatest at the block boundaries, as this is where the
239 motion vector is often least accurate.
241 To overcome this, Dirac overlaps blocks. Each pixel in the overlap
242 regions uses a weighted prediction, incorporating information from all
243 the blocks it may lie in. This smooths the error signal, and making the
244 following wavelet transformation much more effective.
246 \subsubsection{Global motion parameters}
248 Although we said that it is of no practical value to specify a motion
249 vector for each pixel, there are exceptions.
251 Some types of motion are more likely to be features of the camera than
252 the scene. For example, a camera may pan, tilt, rotate, zoom, sheer, or
253 change of perspective through, say tracking.
255 If the coder can identify such motion, then Dirac permits the motion to
256 be signalled by a small set of parameters, which in effect assign a
257 motion vector to each pixel.
259 \subsubsection{Accuracy}
260 Often, we find that the motion vectors required to match a pixel in a
261 reference frame to the predicted frame are not integer values.
263 In Dirac, we can specify motion vectors to 1/8 pixel accuracy.
265 This means that the coder and decoder have to carry out a process that
266 is effectively an upconversion of the signal.
268 \subsubsection{Intra frames}
269 The Intra fields are not predicted using motion compensation.
271 This leaves us with a potential problem. The wavelet transform process
272 assumes that we have a signal which tends to a zero average.
274 The Intra field is processed to give it a zero mean by removing the DC
275 component (average value) of each block from the signal. The DC
276 components are signalled separately. In effect the DC components are a
277 local spatial prediction, rather than a temporal prediction as applied
278 in motion compensation.
280 \subsection{Wavelet transforms}
281 The consequence of processing the Intra frames by removing the DC
282 values, and the Inter frames by removing the predicted values gives us a
283 difference signal. This difference signal is hopefully largely zero, but
284 can have some large peaks, of either polarity. The signal usually has a
285 large amount of low-frequency energy, but with occasional elements of
286 high frequency as the prediction process gets it wrong.
288 Conventional theory says that we can manipulate this signal in the
289 frequency domain to reduce the amount of information we need to
290 transmit. The properties of the eye are such that many of the higher
291 frequency components are less sensitive to coarse quantisation.
293 We can use wavelet transforms in Dirac. The wavelet transform
294 decorrelates the data in a roughly frequency-sensitive way, and
295 preserves the fine details of images better than the ubiquitous Discrete
296 Cosine Transform.
298 \subsubsection{Wavelet analysis}
300 Put simply, wavelet analysis splits a signal into a low and a high
301 frequency component, and then subsamples the two partial streams by a
302 factor of two.
304 The two filters used to split the signal are called the analysis
305 filters. It is not possible to use just any pair of half-band filters to
306 do this. There is an extensive mathematical theory of wavelet filter
307 banks. Appropriately chosen filters allow us to undo the aliasing
308 introduced by the critical sampling in the down conversion process and
309 perfectly reconstruct the signal.
311 Iterative use of a wavelet process allows us to decompose the
312 low-frequency component into successively lower components.
314 For pictures, we apply wavelet filters in both the horizontal and
315 vertical directions. This results in four so-called subbands, termed
316 Low-Low (LL), Low-High (LH) High-Low (HL), and High-High (HH). The LL
317 band can be iteratively decomposed to create a wavelet transformation of
318 many levels.
321 \subsubsection{Parent-child relationships}
322 In wavelet analysis, each subband represents a filtered and subsampled
323 version of the picture. Because all the subbands are derived from a
324 single source image, there is likely to be some form of relationship
325 between the images in the different subbands.
327 The coefficients of each subband relate to specific areas in the image.
328 We find that there is often a correlation between these specific areas
329 in the different subbands.
331 The subsampling structure means that a coefficient in the lowest level
332 corresponds to 2 by 2 block of coefficients in the next level, and so
333 on up the levels. In the jargon, the low-level component is referred to
334 as the parent and the higher-level component is referred to as the
335 child.
337 When coding picture features (edges on objects especially), significant
338 coefficients are found distributed across subbands, in position related
339 by the parent-child structure.
341 A coefficient of a child is more likely to be significant if its parent
342 is also significant. Children with zero or small parents seem to have
343 different statistics from children with large parents or ancestors.
345 These features allow us to entropy code the wavelet coefficients after
346 they have been quantised.
348 \subsection{Entropy coding}
350 Transmission of the raw motion vectors and wavelet coefficients is
351 inefficient. There are still many redundancies in the data, and the form
352 of the data is itself suboptimum.
354 Coding of the motion vectors is especially important for codecs with a
355 high level of motion accuracy (quarter or eighth pixel say). Motion
356 vector coding and decoding is quite complicated, since significant gains
357 in efficiency can be made by choosing a good prediction and entropy
358 coding structure.
360 The processes of removing the redundancies is probably one of the most
361 complicated part of the codec. Similar processes are used for coding the
362 motion vectors and the wavelet coefficients. In various ways they use a
363 combination of
365 \begin{itemize}
366 \item prediction,
367 \item binarisation,
368 \item context modelling and
369 \item adaptive arithmetic coding.
370 \end{itemize}
372 \subsubsection{Entropy coding of wavelets}
374 The entropy coding used by Dirac in wavelet subband coefficient coding
375 is based on three stages: binarisation, context modelling and adaptive
376 arithmetic coding.
378 Figure: Entropy coding block diagram
380 The purpose of the first stage is to provide a bitstream with easily
381 analysable statistics that can be encoded using arithmetic coding which
382 can adapt to those statistics, reflecting any local statistical
383 features.
385 Binarisation is the process of transforming the multi-valued coefficient
386 symbols into bits. The resulting bitstream can then be arithmetic coded.
388 Transform coefficients tend to have a roughly Laplacian distribution,
389 which decays exponentially with magnitude. This suits so-called unary
390 binarization. Unary codes are simple variable-length codes in which
391 every non-negative number $N$ is mapped to $N$ zeros followed by a 1:
394 \begin{verbatim}
395 U(0) = 1
396 U(1) = 0 1
397 U(2) = 0 0 1
398 U(3) = 0 0 0 1
399 U(4) = 0 0 0 0 1
400 U(5) = 0 0 0 0 0 1
401 U(6) = 0 0 0 0 0 0 1
402 Bins: 1 2 3 4 5 6 7
403 \end{verbatim}
405 For Laplacian distributed values, the probability of $N$ occurring is
406 $2-(|N|+1)$, so the probability of a zero or a 1 occurring in any unary
407 bin is constant. So for an ideal only one context would be needed for
408 all the bins, leading to a very compact and reliable description of the
409 statistics. In practice, the coefficients do deviate from the Laplacian
410 ideal and so the lower bins are modelled separately and the larger bins
411 lumped into one context.
413 The process is best explained by example. Suppose one wished to encode
414 the sequence:
416 \begin{verbatim}
417 -3 0 1 0 -1
418 \end{verbatim}
420 When binarized, the sequence to be encoded is:
422 \begin{verbatim}
423 0 0 0 1 | 0 | 1 | 0 1 | 1 | 1 | 0 1 | 0
424 \end{verbatim}
426 The first 4 bits encode the magnitude, 3. The first bit is encoded using
427 the statistics for Bin1, the second using those for Bin 2 and so on.
428 When a 1 is detected, the magnitude is decoded and a sign bit is
429 expected. This is encoded using the sign context statistics; here it is
430 0 to signify a negative sign. The next bit must be a magnitude bit and
431 is encoded using the Bin 1 contexts; since it is 1 the value is 0 and
432 there is no need for a subsequent sign bit. And so on.
434 The context modelling in Dirac is based on the principle that whether a
435 coefficient is small (or zero, in particular) or not is well-predicted
436 by its neighbours and its parents. Therefore the codec conditions the
437 probabilities used by the arithmetic coder for coding bins 1 and 2 on
438 the size of the neighbouring coefficients and the parent coefficient.
440 The reason for this approach is that, whereas the wavelet transform
441 largely removes correlation between a coefficient and its neighbours,
442 they may not be statistically independent even if they are uncorrelated.
443 The main reason for this is that small and especially zero coefficients
444 in wavelet subbands tend to clump together, located at points
445 corresponding to smooth areas in the image, and as discussed elsewhere,
446 are grouped together across subbands in the parent-child relationship.
448 Conceptually, an arithmetic coder can be thought of a progressive way of
449 producing variable-length codes for entire sequences of symbols based on
450 the probabilities of their constituent symbols.
452 For example, if we know the probability of 0 and 1 in a binary sequence,
453 we also know the probability of the sequence itself occurring. So if
455 $P(0)=0.2, $
457 $P(1)=0.8$
459 then
461 $P(11101111111011110101)=(0.2)*3*(0.8)*17=1.8 * 10^{-4}$ (assuming
462 independent occurrences).
464 Information theory then says that optimal entropy coding of this
465 sequence requires $log_2 (\frac{1}{p})=12.4$ bits. Arithmetic coding
466 produces a code word very close to this optimal length, and
467 implementations can do so progressively, outputting bits when possible
468 as more arrive.
470 All arithmetic coding requires are estimates of the probabilities of
471 symbols as they occur, and this is where context modelling fits in.
472 Since arithmetic coding can, in effect, assign a fractional number of
473 bits to a symbol, it is very efficient for coding symbols with
474 probabilities very close to 1, without the additional complication of
475 run-length coding. The aim of context modelling within Dirac is to use
476 information about the symbol stream to be encoded to produce accurate
477 probabilities as close to 1 as possible.
479 Dirac computes these estimates for each context simply by counting their
480 occurrences. In order for the decoder to be in the same state as the
481 encoder, these statistics cannot be updated until after a binary symbol
482 has been encoded. This means that the contexts must be initialised with
483 a count for both 0 and 1, which is used for encoding the first symbol in
484 that context.
486 An additional source of redundancy lies in the local nature of the
487 statistics. If the contexts are not refreshed periodically then later
488 data has less influence in shaping the statistics than earlier data,
489 resulting in bias, and local statistics are not exploited. Dirac adopts
490 a simple way of refreshing the contexts by halving the the counts of 0
491 and 1 for that context at regular intervals. The effect is to maintain
492 the probabilities to a reasonable level of accuracy, but to keep the
493 influence of all coefficients roughly constant.
495 \subsubsection{Entropy coding of motion vectors}
497 The basic format of the coding the motion vectors is similar to the
498 coding of wavelet data: it consists of prediction, followed by
499 binarisation, context modelling and adaptive arithmetic coding.
502 Figure: motion vector entropy coding architecture
504 All the motion vector data are predicted from previously encoded data
505 from nearest neighbours.
507 \subsection{Bytestream}\input{begin-bs}
509 The bytestream of Dirac is the complete, compressed video stream, ready
510 for file storage or transmission.
512 The features are
514 \begin{itemize}
515 \item A parse structure which gives ready access to the video, even
516 in mid file. This identifies all the static features of the stream.
517 Intelligent systems can use the redundancy in this structure to
518 recover from errors in the stream.
520 \item Blocks of data comprising the arithmetically-coded motion
521 vector information
523 \item Blocks of data comprising the arithmetically-coded wavelet
524 coefficients
525 \end{itemize}
527 Because of the arithmetic coding, it is difficult to provide a simple
528 specification which gives meaningful details of the bytestream without
529 some reference to either the early coding processes, or the later
530 decoding process. The simple description of the bytestream would only
531 refer to compressed motion vectors, and compressed wavelet coefficients.
532 We have therefore had to provide extra information to explain how to
533 unpack this data, and how to use the resulting structured data.