gfx/docs/RenderingOverview.rst

   1 Rendering Overview
   2 ==================
   3
   4 This document is an overview of the steps to render a webpage, and how HTML
   5 gets transformed and broken down, step by step, into commands that can execute
   6 on the GPU.
   7
   8 If you're coming into the graphics team with not a lot of background
   9 in browsers, start here :)
  10
  11 .. contents::
  12
  13 High level overview
  14 -------------------
  15
  16 .. image:: RenderingOverviewSimple.png
  17    :width: 100%
  18
  19 Layout
  20 ~~~~~~
  21 Starting at the left in the above image, we have a document
  22 represented by a DOM - a Document Object Model.  A Javascript engine
  23 will execute JS code, either to make changes to the DOM, or to respond to
  24 events generated by the DOM (or do both).
  25
  26 The DOM is a high level description and we don't know what to draw or
  27 where until it is combined with a Cascading Style Sheet (CSS).
  28 Combining these two and figuring out what, where and how to draw
  29 things is the responsibility of the Layout team.  The
  30 DOM is converted into a hierarchical Frame Tree, which nests visual
  31 elements (boxes).  Each element points to some node in a Style Tree
  32 that describes what it should look like -- color, transparency, etc.
  33 The result is that now we know exactly what to render where, what goes
  34 on top of what (layering and blending) and at what pixel coordinate.
  35 This is the Display List.
  36
  37 The Display List is a light-weight data structure because it's shallow
  38 -- it mostly points back to the Frame Tree.  There are two problems
  39 with this.  First, we want to cross process boundaries at this point.
  40 Everything up until now happens in a Content Process (of which there are
  41 several).  Actual GPU rendering happens in a GPU Process (on some
  42 platforms).  Second, everything up until now was written in C++; but
  43 WebRender is written in Rust.  Thus the shallow Display List needs to
  44 be serialized in a completely self-contained binary blob that will
  45 survive Interprocess Communication (IPC) and a language switch (C++ to
  46 Rust).  The result is the WebRender Display List.
  47
  48 WebRender
  49 ~~~~~~~~~
  50
  51 The GPU process receives the WebRender Display List blob and
  52 de-serializes it into a Scene.  This Scene contains more than the
  53 strictly visible elements; for example, to anticipate scrolling, we
  54 might have several paragraphs of text extending past the visible page.
  55
  56 For a given viewport, the Scene gets culled and stripped down to a
  57 Frame.  This is also where we start preparing data structures for GPU
  58 rendering, for example getting some font glyphs into an atlas for
  59 rasterizing text.
  60
  61 The final step takes the Frame and submits commands to the GPU to
  62 actually render it.  The GPU will execute the commands and composite
  63 the final page.
  64
  65 Software
  66 ~~~~~~~~
  67
  68 The above is the new WebRender-enabled way to do things.  But in the
  69 schematic you'll note a second branch towards the bottom: this is the
  70 legacy code path which does not use WebRender (nor Rust).  In this
  71 case, the Display List is converted into a Layer Tree. The purpose of
  72 this Tree is to try and avoid having to re-render absolutely
  73 everything when the page needs to be refreshed. For example, when
  74 scrolling we should be able to redraw the page by mostly shifting
  75 things around. However that requires those 'things' to still be around
  76 from last time we drew the page.  In other words, visual elements that
  77 are likely to be static and reusable need to be drawn into their own
  78 private "page" (a cache).  Then we can recombine (composite) all of
  79 these when redrawing the actual page.
  80
  81 Figuring out which elements would be good candidates for this, and
  82 striking a balance between good performance versus excessive memory
  83 use, is the purpose of the Layer Tree.  Each 'layer' is a cached image
  84 of some element(s).  This logic also takes occlusion into account, eg.
  85 don't allocate and render a layer for elements that are known to be
  86 completely obscured by something in front of them.
  87
  88 Redrawing the page by combining the Layer Tree with any newly
  89 rasterized elements is the job of the Compositor.
  90
  91
  92 Even when a layer cannot be reused in its entirety, it is likely
  93 that only a small part of it was invalidated.  Thus there is an
  94 elaborate system for tracking dirty rectangles, starting an update by
  95 copying the area that can be salvaged, and then redrawing only what
  96 cannot.
  97
  98 In fact, this idea can be extended to delta-tracking of display lists
  99 themselves. Traversing the layout tree and building a display list is
 100 also not cheap, so the code tries to partially invalidate and rebuild
 101 the display list incrementally when possible.
 102 This optimization is used both for non-WebRender and WebRender in
 103 fact.
 104
 105
 106 Asynchronous Panning And Zooming
 107 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 108 Earlier we mentioned that a Scene might contain more elements than are
 109 strictly necessary for rendering what's visible (the Frame).  The
 110 reason for that is Asynchronous Panning and Zooming, or APZ for short.
 111 The browser will feel much more responsive if scrolling & zooming can
 112 short-circuit all of these data transformations and IPC boundaries,
 113 and instead directly update an offset of some layer and recomposite.
 114 (Think of late-latching in a VR context)
 115
 116 This simple idea introduces a lot of complexity: how much extra do you
 117 rasterize, and in which direction?  How much memory can we afford?
 118 What about Javascript that responds to scroll events and perhaps does
 119 something 'interesting' with the page in return?  What about nested
 120 frames or nested scrollbars?  What if we scroll so much that we go
 121 past the boundaries of the Scene that we know about?
 122
 123 See AsyncPanZoom.rst for all that and more.
 124
 125 A Few More Details
 126 ~~~~~~~~~~~~~~~~~~
 127
 128 Here's another schematic which basically repeats the previous one, but
 129 showing a little bit more detail.  Note that the direction is reversed
 130 -- the data flow starts at the right.  Sorry about that :)
 131
 132 .. image:: RenderingOverviewDetail.png
 133    :width: 100%
 134
 135 Some things to note:
 136
 137 - there are multiple content processes, currently 4 of them.  This is
 138   for security reasons (sandboxing), stability (isolate crashes) and
 139   performance (multi-core machines);
 140 - ideally each "webpage" would run in its own process for security;
 141   this is being developed under the term 'fission';
 142 - there is only a single GPU process, if there is one at all;
 143   some platforms have it as part of the Parent;
 144 - not shown here is the Extension process that isolates WebExtensions;
 145 - for non-WebRender, rasterization happens in the Content Process, and
 146   we send entire Layers to the GPU/Compositor process (via shared
 147   memory, only using actual IPC for its metadata like width & height);
 148 - if the GPU process crashes (a bug or a driver issue) we can simply
 149   restart it, resend the display list, and the browser itself doesn't crash;
 150 - the browser UI is just another set of DOM+JS, albeit one that runs
 151   with elevated privileges. That is, its JS can do things that
 152   normal JS cannot.  It lives in the Parent Process, which then uses
 153   IPC to get it rendered, same as regular Content. (the IPC arrow also
 154   goes to WebRender Display List but is omitted to reduce clutter);
 155 - UI events get routed to APZ first, to minimize latency. By running
 156   inside the GPU process, we may have access to data such
 157   as rasterized clipping masks that enables finer grained hit testing;
 158 - the GPU process talks back to the content process; in particular,
 159   when APZ scrolls out of bounds, it asks Content to enlarge/shift the
 160   Scene with a new "display port";
 161 - we still use the GPU when we can for compositing even in the
 162   non-WebRender case;
 163
 164
 165 WebRender In Detail
 166 -------------------
 167
 168 Converting a display list into GPU commands is broken down into a
 169 number of steps and intermediate data structures.
 170
 171
 172 .. image:: RenderingOverviewTrees.png
 173    :width: 75%
 174    :align: center
 175
 176 ..
 177
 178     *Each element in the picture tree points to exactly one node in the spatial
 179     tree. Only a few of these links are shown for clarity (the dashed lines).*
 180
 181 The Picture Tree
 182 ~~~~~~~~~~~~~~~~
 183
 184 The incoming display list uses "stacking contexts".  For example, to
 185 render some text with a drop shadow, a display list will contain three
 186 items:
 187
 188 - "enable shadow" with some parameters such as shadow color, blur size, and offset;
 189 - the text item;
 190 - "pop all shadows" to deactivate shadows;
 191
 192 WebRender will break this down into two distinct elements, or
 193 "pictures".  The first represents the shadow, so it contains a copy of the
 194 text item, but modified to use the shadow's color, and to shift the
 195 text by the shadow's offset.  The second picture contains the original text
 196 to draw on top of the shadow.
 197
 198 The fact that the first picture, the shadow, needs to be blurred, is a
 199 "compositing" property of the picture which we'll deal with later.
 200
 201 Thus, the stack-based display list gets converted into a list of pictures
 202 -- or more generally, a hierarchy of pictures, since items are nested
 203 as per the original HTML.
 204
 205 Example visual elements are a TextRun, a LineDecoration, or an Image
 206 (like a .png file).
 207
 208 Compared to 3D rendering, the picture tree is similar to a scenegraph: it's a
 209 parent/child hierarchy of all the drawable elements that make up the "scene", in
 210 this case the webpage.  One important difference is that the transformations are
 211 stored in a separate tree, the spatial tree.
 212
 213 The Spatial Tree
 214 ~~~~~~~~~~~~~~~~
 215
 216 The nodes in the spatial tree represent coordinate transforms.  Every time the
 217 DOM hierarchy needs child elements to be transformed relative to their parent,
 218 we add a new Spatial Node to the tree. All those child elements will then point
 219 to this node as their "local space" reference (aka coordinate frame).  In
 220 traditional 3D terms, it's a scenegraph but only containing transform nodes.
 221
 222 The nodes are called frames, as in "coordinate frame":
 223
 224 - a Reference Frame corresponds to a ``<div>``;
 225 - a Scrolling Frame corresponds to a scrollable part of the page;
 226 - a Sticky Frame corresponds to some fixed position CSS style.
 227
 228 Each element in the picture tree then points to a spatial node inside this tree,
 229 so by walking up and down the tree we can find the absolute position of where
 230 each element should render (traversing down) and how large each element needs to
 231 be (traversing up).  Originally the transform information was part of the
 232 picture tree, as in a traditional scenegraph, but visual elements and their
 233 transforms were split apart for technical reasons.
 234
 235 Some of these nodes are dynamic.  A scroll-frame can obviously scroll, but a
 236 Reference Frame might also use a property binding to enable a live link with
 237 JavaScript, for dynamic updates of (currently) the transform and opacity.
 238
 239 Axis-aligned transformations (scales and translations) are considered "simple",
 240 and are conceptually combined into a single "CoordinateSystem".  When we
 241 encounter a non-axis-aligned transform, we start a new CoordinateSystem.  We
 242 start in CoordinateSystem 0 at the root, and would bump this to CoordinateSystem
 243 1 when we encounter a Reference Frame with a rotation or 3D transform, for
 244 example.  This would then be the CoordinateSystem index for all its children,
 245 until we run into another (nested) non-simple transform, and so on.  Roughly
 246 speaking, as long as we're in the same CoordinateSystem, the transform stack is
 247 simple enough that we have a reasonable chance of being able to flatten it. That
 248 lets us directly rasterize text at its final scale for example, optimizing
 249 away some of the intermediate pictures (offscreen textures).
 250
 251 The layout code positions elements relative to their parent.  Thus to position
 252 the element on the actual page, we need to walk the Spatial Tree all the way to
 253 the root and apply each transform; the result is a ``LayoutToWorldTransform``.
 254
 255 One final step transforms from World to Device coordinates, which deals with
 256 DPI scaling and such.
 257
 258 .. csv-table::
 259     :header: "WebRender term", "Rough analogy"
 260
 261       Spatial Tree, Scenegraph -- transforms only
 262       Picture Tree, Scenegraph -- drawables only (grouping)
 263       Spatial Tree Rootnode, World Space
 264       Layout space, Local/Object Space
 265       Picture, RenderTarget (sort of; see RenderTask below)
 266       Layout-To-World transform, Local-To-World transform
 267       World-To-Device transform, World-To-Clipspace transform
 268
 269
 270 The Clip Tree
 271 ~~~~~~~~~~~~~
 272
 273 Finally, we also have a Clip Tree, which contains Clip Shapes. For
 274 example, a rounded corner div will produce a clip shape, and since
 275 divs can be nested, you end up with another tree.  By pointing at a Clip Shape,
 276 visual elements will be clipped against this shape plus all parent shapes above it
 277 in the Clip Tree.
 278
 279 As with CoordinateSystems, a chain of simple 2D clip shapes can be collapsed
 280 into something that can be handled in the vertex shader, at very little extra
 281 cost.  More complex clips must be rasterized into a mask first, which we then
 282 sample from to ``discard`` in the pixel shader as needed.
 283
 284 In summary, at the end of scene building the display list turned into
 285 a picture tree, plus a spatial tree that tells us what goes where
 286 relative to what, plus a clip tree.
 287
 288 RenderTask Tree
 289 ~~~~~~~~~~~~~~~
 290
 291 Now in a perfect world we could simply traverse the picture tree and start
 292 drawing things: one drawcall per picture to render its contents, plus one
 293 drawcall to draw the picture into its parent.  However, recall that the first
 294 picture in our example is a "text shadow" that needs to be blurred.  We can't
 295 just rasterize blurry text directly, so we need a number of steps or "render
 296 passes" to get the intended effect:
 297
 298 .. image:: RenderingOverviewBlurTask.png
 299    :align: right
 300    :height: 400px
 301
 302 - rasterize the text into an offscreen rendertarget;
 303 - apply one or more downscaling passes until the blur radius is reasonable;
 304 - apply a horizontal Gaussian blur;
 305 - apply a vertical Gaussian blur;
 306 - use the result as an input for whatever comes next, or blit it to
 307   its final position on the page (or more generally, on the containing
 308   parent surface/picture).
 309
 310 In the general case, which passes we need and how many of them depends
 311 on how the picture is supposed to be composited (CSS filters, SVG
 312 filters, effects) and its parameters (very large vs. small blur
 313 radius, say).
 314
 315 Thus, we walk the picture tree and build a render task tree: each high
 316 level abstraction like "blur me" gets broken down into the necessary
 317 render passes to get the effect.  The result is again a tree because a
 318 render pass can have multiple input dependencies (eg. blending).
 319
 320 (Cfr. games, this has echoes of the Frostbite Framegraph in that it
 321 dynamically builds up a renderpass DAG and dynamically allocates storage
 322 for the outputs).
 323
 324 If there are complicated clip shapes that need to be rasterized first,
 325 so their output can be sampled as a texture for clip/discard
 326 operations, that would also end up in this tree as a dependency... (I think?).
 327
 328 Once we have the entire tree of dependencies, we analyze it to see
 329 which tasks can be combined into a single pass for efficiency.  We
 330 ping-pong rendertargets when we can, but sometimes the dependencies
 331 cut across more than one level of the rendertask tree, and some
 332 copying is necessary.
 333
 334 Once we've figured out the passes and allocated storage for anything
 335 we wish to persist in the texture cache, we finally start rendering.
 336
 337 When rasterizing the elements into the Picture's offscreen texture, we'd
 338 position them by walking the transform hierarchy as far up as the picture's
 339 transform node, resulting in a ``Layout To Picture`` transform.  The picture
 340 would then go onto the page using a ``Picture To World`` coordinate transform.
 341
 342 Caching
 343 ```````
 344
 345 Just as with layers in the software rasterizer, it is not always necessary to
 346 redraw absolutely everything when parts of a document change.  The webrender
 347 equivalent of layers is Slices -- a grouping of pictures that are expected to
 348 render and update together.  Slices are automatically created based on
 349 heuristics and layout hints/flags.
 350
 351 Implementation wise, slices reuse a lot of the existing machinery for Pictures;
 352 in fact they're implemented as a "Virtual picture" of sorts.  The similarities
 353 make sense: both need to allocate offscreen textures in a cache, both will
 354 position and render all their children into it, and both then draw themselves
 355 into their parent as part of the parent's draw.
 356
 357 If a slice isn't expected to change much, we give it a TileCacheInstance. It is
 358 itself made up of Tiles, where each tile will track what's in it, what's
 359 changing, and if it needs to be invalidated and redrawn or not as a result.
 360 Thus the "damage" from changes can be localized to single tiles, while we
 361 salvage the rest of the cache.  If tiles keep seeing a lot of invalidations,
 362 they will recursively divide themselves in a quad-tree like structure to try and
 363 localize the invalidations.  (And conversely, they'll recombine children if
 364 nothing is invalidating them "for a while").
 365
 366 Interning
 367 `````````
 368
 369 To spot invalidated tiles, we need a fast way to compare its contents from the
 370 previous frame with the current frame.  To speed this up, we use interning;
 371 similar to string-interning, this means that each ``TextRun``, ``Decoration``,
 372 ``Image`` and so on is registered in a repository (a ``DataStore``) and
 373 consequently referred to by its unique ID. Cache contents can then be encoded as a
 374 list of IDs (one such list per internable element type).  Diffing is then just a
 375 fast list comparison.
 376
 377
 378 Callbacks
 379 `````````
 380 GPU text rendering assumes that the individual font-glyphs are already
 381 available in a texture atlas.  Likewise SVG is not being rendered on
 382 the GPU.  Both inputs are prepared during scene building; glyph
 383 rasterization via a thread pool from within Rust itself, and SVG via
 384 opaque callbacks (back to C++) that produce blobs.