docs/performance/memory/heap_scan_mode.md

   1 # DMD heap scan mode
   2
   3 Firefox's DMD heap scan mode tracks the set of all live blocks of
   4 malloc-allocated memory and their allocation stacks, and allows you to
   5 log these blocks, and the values stored in them, to a file. When
   6 combined with cycle collector logging, this can be used to investigate
   7 leaks of refcounted cycle collected objects, by figuring out what holds
   8 a strong reference to a leaked object.
   9
  10 **When should you use this?** DMD heap scan mode is intended to be used
  11 to investigate leaks of cycle collected (CCed) objects. DMD heap scan
  12 mode is a "tool of last resort" that should only be used when all
  13 other avenues have been tried and failed, except possibly [ref count
  14 logging](refcount_tracing_and_balancing.md).
  15 It is particularly useful if you have no idea what is causing the leak.
  16 If you have a patch that introduces a leak, you are probably better off
  17 auditing all of the strong references that your patch creates before
  18 trying this.
  19
  20 The particular steps given below are intended for the case where the
  21 leaked object is alive all the way through shutdown. You could modify
  22 these steps for leaks that go away in shutdown by collecting a CC and
  23 DMD log prior to shutdown. However, in that case it may be easier to use
  24 refcount logging, or rr with a conditional breakpoint set on calls to
  25 `Release()` for the leaking object, to see what object actually does the
  26 release that causes the leaked object to go away.
  27
  28 ## Prerequisites
  29
  30 -   A debug DMD build of Firefox. [This
  31     page](dmd.md)
  32     describes how to do that. This should probably be an optimized
  33     build. Non-optimized DMD builds will generate better stack traces,
  34     but they can be so slow as to be useless.
  35 -   The build is going to be very slow, so you may need to disable some
  36     shutdown checks. First, in
  37     `toolkit/components/terminator/nsTerminator.cpp`, delete everything
  38     in `RunWatchDog` but the call to `NS_SetCurrentThreadName`. This
  39     will keep the watch dog from killing the browser when shut down
  40     takes multiple minutes. Secondly, you may need to comment out the
  41     call to `MOZ_CRASH("NSS_Shutdown failed");` in
  42     `xpcom/build/XPCOMInit.cpp`, as this also seems to trigger when
  43     shutdown is extremely slow.
  44 -   You need the cycle collector analysis script `find_roots.py`, which
  45     can be downloaded as part of [this repo on
  46     Github](https://github.com/amccreight/heapgraph).
  47
  48 ## Generating Logs
  49
  50 The next step is to generate a number of log files. You need to get a
  51 shutdown CC log and a DMD log, for a single run.
  52
  53 **Definitions** I'll write `$objdir` for the object directory for your
  54 Firefox DMD build, `$srcdir` for the top level of the Firefox source
  55 directory, and `$heapgraph` for the location of the heapgraph repo, and
  56 `$logdir` for the location you want logs to go to. `$logdir` should end
  57 in a path separator. For instance, `~/logs/leak/`.
  58
  59 The command you need to run Firefox will look something like this:
  60
  61     XPCOM_MEM_BLOAT_LOG=1 MOZ_CC_LOG_SHUTDOWN=1 MOZ_DISABLE_CONTENT_SANDBOX=t MOZ_CC_LOG_DIRECTORY=$logdir
  62     MOZ_CC_LOG_PROCESS=content MOZ_CC_LOG_THREAD=main MOZ_DMD_SHUTDOWN_LOG=$logdir MOZ_DMD_LOG_PROCESS=tab ./mach run --dmd --mode=scan
  63
  64 Breaking this down:
  65
  66 -   `XPCOM_MEM_BLOAT_LOG=1`: This reports a list of the counts of every
  67     object created and destroyed and tracked by the XPCOM leak tracking
  68     system. From this chart, you can see how many objects of a
  69     particular type were leaked through shutdown. This can come in handy
  70     during the manual analysis phase later, to get evidence to support
  71     your hunches. For instance, if you think that an `nsFoo` object
  72     might be holding your leaking object alive, you can use this to
  73     easily see if we leaked an `nsFoo` object.
  74 -   `MOZ_CC_LOG_SHUTDOWN=1`: This generates a cycle collector log during
  75     shutdown. Creating this log during shutdown is nice because there
  76     are less things unrelated to the leak in the log, and various cycle
  77     collector optimizations are disabled. A garbage collector log will
  78     also be created, which you may not need.
  79 -   `MOZ_DISABLE_CONTENT_SANDBOX=t`: This disables the content process
  80     sandbox, which is needed because the DMD and CC log files are
  81     created directly by the child processes.
  82 -   `MOZ_CC_LOG_DIRECTORY=$logdir`: This selects the location for cycle
  83     collector logs to be saved.
  84 -   `MOZ_CC_LOG_PROCESS=content MOZ_CC_LOG_THREAD=main`: These options
  85     specify that we only want CC logs for the main thread of content
  86     processes, to make shutdown less slow. If your leak is happening in
  87     a different process or thread, change the options, which are listed
  88     in `xpcom/base/nsCycleCollector.cpp`.
  89 -   `MOZ_DMD_SHUTDOWN_LOG=$logdir`: This option specifies that we want a
  90     DMD log to be taken very late in XPCOM shutdown, and the location
  91     for that log to be saved. Like with the CC log, we want this log
  92     very late to avoid as many non-leaking things as possible.
  93 -   `MOZ_DMD_LOG_PROCESS=tab`: As with the CC, this means that we only
  94     want these logs in content processes, in order to make shutdown
  95     faster. The allowed values here are the same as those returned by
  96     `XRE_GetProcessType()`, so adjust as needed.
  97 -   Finally, the `--dmd` option need to be passed in so that DMD will be
  98     run. `--mode=scan` is needed so that when we get a DMD log the
  99     entire contents of each block of memory is saved for later analysis.
 100
 101 With that command line in hand, you can start Firefox. Be aware that
 102 this may take multiple minutes if you have optimization disabled.
 103
 104 Once it has started, go through the steps you need to reproduce your
 105 leak. If your leak is a ghost window, it can be handy to get an
 106 `about:memory` report and write down the PID of the leaking process. You
 107 may want to wait 10 or so seconds after this to make sure as much as
 108 possible is cleaned up.
 109
 110 Next, exit the browser. This will cause a lot of logs to be written out,
 111 so it can take a while.
 112
 113 ## Analyzing the Logs
 114
 115 ### Getting the PID and address of the leaking object
 116
 117 The first step is to figure out the **PID** of the leaking process. The
 118 second step is to figure out **the address of the leaking object**,
 119 usually a window. Conveniently, you can usually do both at once using
 120 the cycle collector log. If you are investigating a leak of
 121 `www.example.com`, then from `$logdir` you can do
 122 `"grep nsGlobalWindow cc-edges* | grep example.com"`. This looks through
 123 all of the windows in all of the CC logs (which may leaked, this late in
 124 shutdown), and then filters out windows where the URL contains
 125 `example.com`.
 126
 127 The result of that grep will contain output that looks something like
 128 this:
 129
 130     cc-edges.15873.log:0x7f0897082c00 [rc=1285] nsGlobalWindowInner # 2147483662 inner https://www.example.com/
 131
 132 cc-edges.15873.log: The first part is the file name where it was
 133 found. `15873` is the PID of the process that leaked. You'll want to
 134 write down the name of the file and the PID. Let's call the file
 135 `$cclog` and the pid `$pid`.
 136
 137 0x7f0897082c00: This is the address of the leaking window. You'll
 138 also want to write that down. Let's call this `$winaddr`.
 139
 140 If there are multiple files, you'll end up with one that looks like
 141 `cc-edges.$pid.log` and one or more that look like
 142 `cc-edges.$pid-$n.log` for various values of `$n`. You want the one with
 143 the largest `$n`, as this was recorded the latest, and so it will
 144 contain the least non-garbage.
 145
 146 ### Identifying the root in the cycle collector log
 147
 148 The next step is to figure out why the cycle collector could not collect
 149 the window, using the `find_roots.py` script from the heapgraph
 150 repository. The command to invoke this looks like this:
 151
 152     python $heapgraph/find_roots.py $cclog $winaddr
 153
 154 This may take a few seconds. It will eventually produce some output.
 155 You'll want to save a copy of this output for later.
 156
 157 The output will look something like this, after a message about loading
 158 progress:
 159
 160     0x7f0882fe3230 [FragmentOrElement (xhtml) script https://www.example.com]
 161     --[[via hash] mListenerManager]--> 0x7f0899b4e550 [EventListenerManager]
 162     --[mListeners event=onload listenerType=3 [i]]--> 0x7f0882ff8f80 [CallbackObject]
 163     --[mIncumbentGlobal]--> 0x7f0897082c00 [nsGlobalWindowInner # 2147483662 inner https://www.example.com]
 164
 165 Root 0x7f0882fe3230 is a ref counted object with 1 unknown edge(s).
 166     known edges:
 167     0x7f08975a24c0 [FragmentOrElement (xhtml) head https://www.example.com] --[mAttrsAndChildren[i]]--> 0x7f0882fe3230
 168     0x7f08967e7b20 [JS Object (HTMLScriptElement)] --[UnwrapDOMObject(obj)]--> 0x7f0882fe3230
 169
 170 The first two lines mean that the script element `0x7f0882fe3230`
 171 contains a strong reference to the EventListenerManager
 172 `0x7f0899b4e550`. "[via hash] mListenerManager" is a description of
 173 that strong reference. Together, these lines show a chain of strong
 174 references from an object the cycle collector thinks needs to be kept
 175 alive, `0x7f0899b4e550`, to the object` 0x7f0897082c00` that you asked
 176 about. Most of the time, the actual chain is not important, because the
 177 cycle collector can only tell us about what went right. Let us call the
 178 address of the leaking object (`0x7f0882fe3230` in this case)
 179 `$leakaddr`.
 180
 181 Besides `$leakaddr`, the other interesting part is the chunk at the
 182 bottom. It tells us that there is 1 unknown edge, and 2 known edges.
 183 What this means is that the leaking object has a refcount of 3, but the
 184 cycle collector was only told about these two references. In this case,
 185 a head element and a JS object (the JS reflector of the script element).
 186 We need to figure out what the unknown reference is from, as that is
 187 where our leak really is.
 188
 189 ### Figure out what is holding the leaking object alive.
 190
 191 Now we need to use the DMD heap scan logs. These contain the contents of
 192 every live block of memory.
 193
 194 The first step to using the DMD heap scan logs is to do some
 195 pre-processing for the DMD log. Stacks need to be symbolicated, and we
 196 need to clamp the values contained in the heap. Clamping is the same
 197 kind of analysis that a conservative GC does: if a word-aligned value in
 198 a heap block points to somewhere within another heap block, replace that
 199 value with the address of the block.
 200
 201 Both kinds of preprocessing are done by the `dmd.py` script, which can
 202 be invoked like this:
 203
 204     $objdir/dist/bin/dmd.py --clamp-contents dmd-$pid.log.gz
 205
 206 This can take a few minutes due to symbolification, but you only need to
 207 run it once on a log file.
 208
 209 You can also locally symbolicate stacks from DMD logs generated on TreeHerder,
 210 but it will [take a few extra steps](/contributing/debugging/local_symbols.rst)
 211 that you need to do before running `dmd.py`.
 212
 213 After that is done, we can finally find out which objects (possibly)
 214 point to other objects, using the block_analyzer script:
 215
 216     python $srcdir/memory/replace/dmd/block_analyzer.py dmd-$pid.log.gz $leakaddr
 217
 218 This will look through every block of memory in the log, and give some
 219 basic information about any block of memory that (possibly) contains a
 220 pointer to that object. You can pass some additional options to affect
 221 how the results are displayed. "-sfl 10000 -a" is useful. The -sfl 10000
 222 tells it to not truncate stack frames, and -a tells it to not display
 223 generic frames related to the allocator.
 224
 225 Caveat: I think block_analyzer.py does not attempt to clamp the address
 226 you pass into it, so if it is an offset from the start of the block, it
 227 won't find it.
 228
 229     block_analyzer.py` will return a series of entries that look like this
 230     with the [...] indicating where I have removed things):
 231     0x7f089306b000 size = 4096 bytes at byte offset 2168
 232     nsAttrAndChildArray::GrowBy[...]
 233     nsAttrAndChildArray::InsertChildAt[...]
 234     [...]
 235
 236 `0x7f089306b000` is the address of the block that contains `$leakaddr`.
 237 144 bytes is the size of that block. That can be useful for confirming
 238 your guess about what class the block actually is. The byte offset tells
 239 you were in the block the pointer is. This is mostly useful for larger
 240 objects, and you can potentially combine this with debugging information
 241 to figure out exactly what field this is. The rest of the entry is the
 242 stack trace for the allocation of the block, which is the most useful
 243 piece of information.
 244
 245 What you need to do now is to go through every one of these entries and
 246 place it into three categories: strong reference known to the cycle
 247 collector, weak reference, or something else! The goal is to eventually
 248 shrink down the "something else" category until there are only as many
 249 things in it as there are unknown references to the leaking object, and
 250 then you have your leaker.
 251
 252 To place an entry into one of the categories, you must look at the code
 253 locations given in the stack trace, and see if you can tell what the
 254 object is based on that, then compare that to what `find_roots.py` told
 255 you.
 256
 257 For instance, one of the strong references in the CC log is from a head
 258 element to its child via `mAttrsAndChildren`, and that sounds a lot like
 259 this, so we can mark it as being a strong known reference.
 260
 261 This is an iterative process, where you first go through and mark off
 262 the things that are easily categorizable, and repeat until you have a
 263 small list of things to analyze.
 264
 265 ### Example analysis of block_analyzer.py results
 266
 267 In one debugging session where I was investigating the leak from bug
 268 1451985, I eventually reduced the list of entries until this was the
 269 most suspicious looking entry:
 270
 271     0x7f0892f29630 size = 392 bytes at byte offset 56
 272     mozilla::dom::ScriptLoader::ProcessExternalScript[...]
 273     [...]
 274
 275 I went to that line of `ScriptLoader::ProcessExternalScript()`, and it
 276 contained a call to ScriptLoader::CreateLoadRequest(). Fortunately, this
 277 method mostly just contains two calls to `new`, one for
 278 `ScriptLoadRequest` and one for `ModuleLoadRequest`. (This is where an
 279 unoptimized build comes in handy, as it would have pointed out the exact
 280 line. Unfortunately, in this particular case, the unoptimized build was
 281 so slow I wasn't getting any logs.) I then looked through the list of
 282 leaked objects generated by `XPCOM_MEM_BLOAT_LOG` and saw that we were
 283 leaking a `ScriptLoadRequest`, so I went and looked at its class
 284 definition, where I noticed that `ScriptLoadRequest` had a strong
 285 reference to an element that it wasn't telling the cycle collector
 286 about, which seemed suspicious.
 287
 288 The first thing I did to try to confirm that this was the source of the
 289 leak was pass the address of this object into the cycle collector
 290 analysis log, `find_roots.py`, that we used at an earlier step. That
 291 gave a result that contained this:
 292
 293     0x7f0882fe3230 [FragmentOrElement (xhtml) script [...]
 294     --[mNodeInfo]--> 0x7f0897431f00 [NodeInfo (xhtml) script]
 295     [...]
 296     --[mLoadingAsyncRequests]--> 0x7f0892f29630 [ScriptLoadRequest]
 297
 298 This confirms that this block is actually a ScriptLoadRequest. Secondly,
 299 notice that the load request is being held alive by the very same script
 300 element that is causing the window leak! This strongly suggests that
 301 there is a cycle of strong references between the script element and the
 302 load request. I then added the missing field to the traverse and unlink
 303 methods of ScriptLoadRequest, and confirmed that I couldn't reproduce
 304 the leak.
 305
 306 Keep in mind that you may need to run `block_analyzer.py` multiple
 307 times. For instance, if the script element was being held alive by some
 308 container being held alive by a runnable, we'd first need to figure out
 309 that the container was holding the element. If it isn't possible to
 310 figure out what is holding that alive, you'd have to run block_analyzer
 311 again. This isn't too bad, because unlike ref count logging, we have the
 312 full state of memory in our existing log, so we don't need to run the
 313 browser again.