gcc/doc/analyzer.texi

   1 @c Copyright (C) 2019 Free Software Foundation, Inc.
   2 @c This is part of the GCC manual.
   3 @c For copying conditions, see the file gcc.texi.
   4 @c Contributed by David Malcolm <dmalcolm@redhat.com>.
   5
   6 @node Static Analyzer
   7 @chapter Static Analyzer
   8 @cindex analyzer
   9 @cindex static analysis
  10 @cindex static analyzer
  11
  12 @menu
  13 * Analyzer Internals::       Analyzer Internals
  14 * Debugging the Analyzer::   Useful debugging tips
  15 @end menu
  16
  17 @node Analyzer Internals
  18 @section Analyzer Internals
  19 @cindex analyzer, internals
  20 @cindex static analyzer, internals
  21
  22 @subsection Overview
  23
  24 The analyzer implementation works on the gimple-SSA representation.
  25 (I chose this in the hopes of making it easy to work with LTO to
  26 do whole-program analysis).
  27
  28 The implementation is read-only: it doesn't attempt to change anything,
  29 just emit warnings.
  30
  31 The gimple representation can be seen using @option{-fdump-ipa-analyzer}.
  32
  33 First, we build a @code{supergraph} which combines the callgraph and all
  34 of the CFGs into a single directed graph, with both interprocedural and
  35 intraprocedural edges.  The nodes and edges in the supergraph are called
  36 ``supernodes'' and ``superedges'', and often referred to in code as
  37 @code{snodes} and @code{sedges}.  Basic blocks in the CFGs are split at
  38 interprocedural calls, so there can be more than one supernode per
  39 basic block.  Most statements will be in just one supernode, but a call
  40 statement can appear in two supernodes: at the end of one for the call,
  41 and again at the start of another for the return.
  42
  43 The supergraph can be seen using @option{-fdump-analyzer-supergraph}.
  44
  45 We then build an @code{analysis_plan} which walks the callgraph to
  46 determine which calls might be suitable for being summarized (rather
  47 than fully explored) and thus in what order to explore the functions.
  48
  49 Next is the heart of the analyzer: we use a worklist to explore state
  50 within the supergraph, building an "exploded graph".
  51 Nodes in the exploded graph correspond to <point,@w{ }state> pairs, as in
  52      "Precise Interprocedural Dataflow Analysis via Graph Reachability"
  53      (Thomas Reps, Susan Horwitz and Mooly Sagiv).
  54
  55 We reuse nodes for <point, state> pairs we've already seen, and avoid
  56 tracking state too closely, so that (hopefully) we rapidly converge
  57 on a final exploded graph, and terminate the analysis.  We also bail
  58 out if the number of exploded <end-of-basic-block, state> nodes gets
  59 larger than a particular multiple of the total number of basic blocks
  60 (to ensure termination in the face of pathological state-explosion
  61 cases, or bugs).  We also stop exploring a point once we hit a limit
  62 of states for that point.
  63
  64 We can identify problems directly when processing a <point,@w{ }state>
  65 instance.  For example, if we're finding the successors of
  66
  67 @smallexample
  68    <point: before-stmt: "free (ptr);",
  69     state: @{"ptr": freed@}>
  70 @end smallexample
  71
  72 then we can detect a double-free of "ptr".  We can then emit a path
  73 to reach the problem by finding the simplest route through the graph.
  74
  75 Program points in the analysis are much more fine-grained than in the
  76 CFG and supergraph, with points (and thus potentially exploded nodes)
  77 for various events, including before individual statements.
  78 By default the exploded graph merges multiple consecutive statements
  79 in a supernode into one exploded edge to minimize the size of the
  80 exploded graph.  This can be suppressed via
  81 @option{-fanalyzer-fine-grained}.
  82 The fine-grained approach seems to make things simpler and more debuggable
  83 that other approaches I tried, in that each point is responsible for one
  84 thing.
  85
  86 Program points in the analysis also have a "call string" identifying the
  87 stack of callsites below them, so that paths in the exploded graph
  88 correspond to interprocedurally valid paths: we always return to the
  89 correct call site, propagating state information accordingly.
  90 We avoid infinite recursion by stopping the analysis if a callsite
  91 appears more than @code{analyzer-max-recursion-depth} in a callstring
  92 (defaulting to 2).
  93
  94 @subsection Graphs
  95
  96 Nodes and edges in the exploded graph are called ``exploded nodes'' and
  97 ``exploded edges'' and often referred to in the code as
  98 @code{enodes} and @code{eedges} (especially when distinguishing them
  99 from the @code{snodes} and @code{sedges} in the supergraph).
 100
 101 Each graph numbers its nodes, giving unique identifiers - supernodes
 102 are referred to throughout dumps in the form @samp{SN': @var{index}} and
 103 exploded nodes in the form @samp{EN: @var{index}} (e.g. @samp{SN: 2} and
 104 @samp{EN:29}).
 105
 106 The supergraph can be seen using @option{-fdump-analyzer-supergraph-graph}.
 107
 108 The exploded graph can be seen using @option{-fdump-analyzer-exploded-graph}
 109 and other dump options.  Exploded nodes are color-coded in the .dot output
 110 based on state-machine states to make it easier to see state changes at
 111 a glance.
 112
 113 @subsection State Tracking
 114
 115 There's a tension between:
 116 @itemize @bullet
 117 @item
 118 precision of analysis in the straight-line case, vs
 119 @item
 120 exponential blow-up in the face of control flow.
 121 @end itemize
 122
 123 For example, in general, given this CFG:
 124
 125 @smallexample
 126       A
 127      / \
 128     B   C
 129      \ /
 130       D
 131      / \
 132     E   F
 133      \ /
 134       G
 135 @end smallexample
 136
 137 we want to avoid differences in state-tracking in B and C from
 138 leading to blow-up.  If we don't prevent state blowup, we end up
 139 with exponential growth of the exploded graph like this:
 140
 141 @smallexample
 142
 143            1:A
 144           /   \
 145          /     \
 146         /       \
 147       2:B       3:C
 148        |         |
 149       4:D       5:D        (2 exploded nodes for D)
 150      /   \     /   \
 151    6:E   7:F 8:E   9:F
 152     |     |   |     |
 153    10:G 11:G 12:G  13:G    (4 exploded nodes for G)
 154
 155 @end smallexample
 156
 157 Similar issues arise with loops.
 158
 159 To prevent this, we follow various approaches:
 160
 161 @enumerate a
 162 @item
 163 state pruning: which tries to discard state that won't be relevant
 164 later on withing the function.
 165 This can be disabled via @option{-fno-analyzer-state-purge}.
 166
 167 @item
 168 state merging.  We can try to find the commonality between two
 169 program_state instances to make a third, simpler program_state.
 170 We have two strategies here:
 171
 172   @enumerate
 173   @item
 174      the worklist keeps new nodes for the same program_point together,
 175      and tries to merge them before processing, and thus before they have
 176      successors.  Hence, in the above, the two nodes for D (4 and 5) reach
 177      the front of the worklist together, and we create a node for D with
 178      the merger of the incoming states.
 179
 180   @item
 181      try merging with the state of existing enodes for the program_point
 182      (which may have already been explored).  There will be duplication,
 183      but only one set of duplication; subsequent duplicates are more likely
 184      to hit the cache.  In particular, (hopefully) all merger chains are
 185      finite, and so we guarantee termination.
 186      This is intended to help with loops: we ought to explore the first
 187      iteration, and then have a "subsequent iterations" exploration,
 188      which uses a state merged from that of the first, to be more abstract.
 189   @end enumerate
 190
 191 We avoid merging pairs of states that have state-machine differences,
 192 as these are the kinds of differences that are likely to be most
 193 interesting.  So, for example, given:
 194
 195 @smallexample
 196       if (condition)
 197         ptr = malloc (size);
 198       else
 199         ptr = local_buf;
 200
 201       .... do things with 'ptr'
 202
 203       if (condition)
 204         free (ptr);
 205
 206       ...etc
 207 @end smallexample
 208
 209 then we end up with an exploded graph that looks like this:
 210
 211 @smallexample
 212
 213                    if (condition)
 214                      / T      \ F
 215             ---------          ----------
 216            /                             \
 217       ptr = malloc (size)             ptr = local_buf
 218           |                               |
 219       copy of                         copy of
 220         "do things with 'ptr'"          "do things with 'ptr'"
 221       with ptr: heap-allocated        with ptr: stack-allocated
 222           |                               |
 223       if (condition)                  if (condition)
 224           | known to be T                 | known to be F
 225       free (ptr);                         |
 226            \                             /
 227             -----------------------------
 228                          | ('ptr' is pruned, so states can be merged)
 229                         etc
 230
 231 @end smallexample
 232
 233 where some duplication has occurred, but only for the places where the
 234 the different paths are worth exploringly separately.
 235
 236 Merging can be disabled via @option{-fno-analyzer-state-merge}.
 237 @end enumerate
 238
 239 @subsection Region Model
 240
 241 Part of the state stored at a @code{exploded_node} is a @code{region_model}.
 242 This is an implementation of the region-based ternary model described in
 243 @url{http://lcs.ios.ac.cn/~xuzb/canalyze/memmodel.pdf,
 244 "A Memory Model for Static Analysis of C Programs"}
 245 (Zhongxing Xu, Ted Kremenek, and Jian Zhang).
 246
 247 A @code{region_model} encapsulates a representation of the state of
 248 memory, with a tree of @code{region} instances, along with their associated
 249 values.  The representation is graph-like because values can be pointers
 250 to regions.  It also stores a constraint_manager, capturing relationships
 251 between the values.
 252
 253 Because each node in the @code{exploded_graph} has a @code{region_model},
 254 and each of the latter is graph-like, the @code{exploded_graph} is in some
 255 ways a graph of graphs.
 256
 257 Here's an example of printing a @code{region_model}, showing the ASCII-art
 258 used to visualize the region hierarchy (colorized when printing to stderr):
 259
 260 @smallexample
 261 (gdb) call debug (*this)
 262 r0: @{kind: 'root', parent: null, sval: null@}
 263 |-stack: r1: @{kind: 'stack', parent: r0, sval: sv1@}
 264 |  |: sval: sv1: @{poisoned: uninit@}
 265 |  |-frame for 'test': r2: @{kind: 'frame', parent: r1, sval: null, map: @{'ptr_3': r3@}, function: 'test', depth: 0@}
 266 |  |  `-'ptr_3': r3: @{kind: 'map', parent: r2, sval: sv3, type: 'void *', map: @{@}@}
 267 |  |    |: sval: sv3: @{type: 'void *', unknown@}
 268 |  |    |: type: 'void *'
 269 |  `-frame for 'calls_malloc': r4: @{kind: 'frame', parent: r1, sval: null, map: @{'result_3': r7, '_4': r8, '<anonymous>': r5@}, function: 'calls_malloc', depth: 1@}
 270 |    |-'<anonymous>': r5: @{kind: 'map', parent: r4, sval: sv4, type: 'void *', map: @{@}@}
 271 |    |  |: sval: sv4: @{type: 'void *', &r6@}
 272 |    |  |: type: 'void *'
 273 |    |-'result_3': r7: @{kind: 'map', parent: r4, sval: sv4, type: 'void *', map: @{@}@}
 274 |    |  |: sval: sv4: @{type: 'void *', &r6@}
 275 |    |  |: type: 'void *'
 276 |    `-'_4': r8: @{kind: 'map', parent: r4, sval: sv4, type: 'void *', map: @{@}@}
 277 |      |: sval: sv4: @{type: 'void *', &r6@}
 278 |      |: type: 'void *'
 279 `-heap: r9: @{kind: 'heap', parent: r0, sval: sv2@}
 280   |: sval: sv2: @{poisoned: uninit@}
 281   `-r6: @{kind: 'symbolic', parent: r9, sval: null, map: @{@}@}
 282 svalues:
 283   sv0: @{type: 'size_t', '1024'@}
 284   sv1: @{poisoned: uninit@}
 285   sv2: @{poisoned: uninit@}
 286   sv3: @{type: 'void *', unknown@}
 287   sv4: @{type: 'void *', &r6@}
 288 constraint manager:
 289   equiv classes:
 290     ec0: @{sv0 == '1024'@}
 291     ec1: @{sv4@}
 292   constraints:
 293 @end smallexample
 294
 295 This is the state at the point of returning from @code{calls_malloc} back
 296 to @code{test} in the following:
 297
 298 @smallexample
 299 void *
 300 calls_malloc (void)
 301 @{
 302   void *result = malloc (1024);
 303   return result;
 304 @}
 305
 306 void test (void)
 307 @{
 308   void *ptr = calls_malloc ();
 309   /* etc.  */
 310 @}
 311 @end smallexample
 312
 313 The ``root'' region (``r0'') has a ``stack'' child (``r1''), with two
 314 children: a frame for @code{test} (``r2''), and a frame for
 315 @code{calls_malloc} (``r4'').  These frame regions have child regions for
 316 storing their local variables.  For example, the return region
 317 and that of various other regions within the ``calls_malloc'' frame all have
 318 value ``sv4'', a pointer to a heap-allocated region ``r6''.  Within the parent
 319 frame, @code{ptr_3} has value ``sv3'', an unknown @code{void *}.
 320
 321 @subsection Analyzer Paths
 322
 323 We need to explain to the user what the problem is, and to persuade them
 324 that there really is a problem.  Hence having a @code{diagnostic_path}
 325 isn't just an incidental detail of the analyzer; it's required.
 326
 327 Paths ought to be:
 328 @itemize @bullet
 329 @item
 330 interprocedurally-valid
 331 @item
 332 feasible
 333 @end itemize
 334
 335 Without state-merging, all paths in the exploded graph are feasible
 336 (in terms of constraints being satisified).
 337 With state-merging, paths in the exploded graph can be infeasible.
 338
 339 We collate warnings and only emit them for the simplest path
 340 e.g. for a bug in a utility function, with lots of routes to calling it,
 341 we only emit the simplest path (which could be intraprocedural, if
 342 it can be reproduced without a caller).  We apply a check that
 343 each duplicate warning's shortest path is feasible, rejecting any
 344 warnings for which the shortest path is infeasible (which could lead to
 345 false negatives).
 346
 347 We use the shortest feasible @code{exploded_path} through the
 348 @code{exploded_graph} (a list of @code{exploded_edge *}) to build a
 349 @code{diagnostic_path} (a list of events for the diagnostic subsystem) -
 350 specifically a @code{checker_path}.
 351
 352 Having built the @code{checker_path}, we prune it to try to eliminate
 353 events that aren't relevant, to minimize how much the user has to read.
 354
 355 After pruning, we notify each event in the path of its ID and record the
 356 IDs of interesting events, allowing for events to refer to other events
 357 in their descriptions.  The @code{pending_diagnostic} class has various
 358 vfuncs to support emitting more precise descriptions, so that e.g.
 359
 360 @itemize @bullet
 361 @item
 362 a deref-of-unchecked-malloc diagnostic might use:
 363 @smallexample
 364   returning possibly-NULL pointer to 'make_obj' from 'allocator'
 365 @end smallexample
 366 for a @code{return_event} to make it clearer how the unchecked value moves
 367 from callee back to caller
 368 @item
 369 a double-free diagnostic might use:
 370 @smallexample
 371   second 'free' here; first 'free' was at (3)
 372 @end smallexample
 373 and a use-after-free might use
 374 @smallexample
 375   use after 'free' here; memory was freed at (2)
 376 @end smallexample
 377 @end itemize
 378
 379 At this point we can emit the diagnostic.
 380
 381 @subsection Limitations
 382
 383 @itemize @bullet
 384 @item
 385 Only for C so far
 386 @item
 387 The implementation of call summaries is currently very simplistic.
 388 @item
 389 Lack of function pointer analysis
 390 @item
 391 The constraint-handling code assumes reflexivity in some places
 392 (that values are equal to themselves), which is not the case for NaN.
 393 @item
 394 The region model code creates lots of little mutable objects at each
 395 @code{region_model} (and thus per @code{exploded_node}) rather than
 396 sharing immutable objects and having the mutable state in the
 397 @code{program_state} or @code{region_model}.  The latter approach might be
 398 more efficient, and might avoid dealing with IDs rather than pointers
 399 (which requires us to impose an ordering to get meaningful equality).
 400 @item
 401 The region model code doesn't yet support @code{memcpy}.  At the
 402 gimple-ssa level these have been optimized to statements like this:
 403 @smallexample
 404 _10 = MEM <long unsigned int> [(char * @{ref-all@})&c]
 405 MEM <long unsigned int> [(char * @{ref-all@})&d] = _10;
 406 @end smallexample
 407 Perhaps they could be supported via a new @code{compound_svalue} type.
 408 @item
 409 There are various other limitations in the region model (grep for TODO/xfail
 410 in the testsuite).
 411 @item
 412 The constraint_manager's implementation of transitivity is currently too
 413 expensive to enable by default and so must be manually enabled via
 414 @option{-fanalyzer-transitivity}).
 415 @item
 416 The checkers are currently hardcoded and don't allow for user extensibility
 417 (e.g. adding allocate/release pairs).
 418 @item
 419 Although the analyzer's test suite has a proof-of-concept test case for
 420 LTO, LTO support hasn't had extensive testing.  There are various
 421 lang-specific things in the analyzer that assume C rather than LTO.
 422 For example, SSA names are printed to the user in ``raw'' form, rather
 423 than printing the underlying variable name.
 424 @end itemize
 425
 426 Some ideas for other checkers
 427 @itemize @bullet
 428 @item
 429 File-descriptor-based APIs
 430 @item
 431 Linux kernel internal APIs
 432 @item
 433 Signal handling
 434 @end itemize
 435
 436 @node Debugging the Analyzer
 437 @section Debugging the Analyzer
 438 @cindex analyzer, debugging
 439 @cindex static analyzer, debugging
 440
 441 @subsection Special Functions for Debugging the Analyzer
 442
 443 The analyzer recognizes various special functions by name, for use
 444 in debugging the analyzer.  Declarations can be seen in the testsuite
 445 in @file{analyzer-decls.h}.  None of these functions are actually
 446 implemented.
 447
 448 Add:
 449 @smallexample
 450   __analyzer_break ();
 451 @end smallexample
 452 to the source being analyzed to trigger a breakpoint in the analyzer when
 453 that source is reached.  By putting a series of these in the source, it's
 454 much easier to effectively step through the program state as it's analyzed.
 455
 456 @smallexample
 457 __analyzer_dump ();
 458 @end smallexample
 459
 460 will dump the copious information about the analyzer's state each time it
 461 reaches the call in its traversal of the source.
 462
 463 @smallexample
 464 __analyzer_dump_path ();
 465 @end smallexample
 466
 467 will emit a placeholder ``note'' diagnostic with a path to that call site,
 468 if the analyzer finds a feasible path to it.
 469
 470 The builtin @code{__analyzer_dump_exploded_nodes} will dump information
 471 after analysis on all of the exploded nodes at that program point:
 472
 473 @smallexample
 474   __analyzer_dump_exploded_nodes (0);
 475 @end smallexample
 476
 477 will dump just the number of nodes, and their IDs.
 478
 479 @smallexample
 480   __analyzer_dump_exploded_nodes (1);
 481 @end smallexample
 482
 483 will also dump all of the states within those nodes.
 484
 485 @smallexample
 486    __analyzer_dump_region_model ();
 487 @end smallexample
 488 will dump the region_model's state to stderr.
 489
 490 @smallexample
 491 __analyzer_eval (expr);
 492 @end smallexample
 493 will emit a warning with text "TRUE", FALSE" or "UNKNOWN" based on the
 494 truthfulness of the argument.  This is useful for writing DejaGnu tests.
 495
 496
 497 @subsection Other Debugging Techniques
 498
 499 One approach when tracking down where a particular bogus state is
 500 introduced into the @code{exploded_graph} is to add custom code to
 501 @code{region_model::validate}.
 502
 503 For example, this custom code (added to @code{region_model::validate})
 504 breaks with an assertion failure when a variable called @code{ptr}
 505 acquires a value that's unknown, using
 506 @code{region_model::get_value_by_name} to locate the variable
 507
 508 @smallexample
 509     /* Find a variable matching "ptr".  */
 510     svalue_id sid = get_value_by_name ("ptr");
 511     if (!sid.null_p ())
 512       @{
 513         svalue *sval = get_svalue (sid);
 514         gcc_assert (sval->get_kind () != SK_UNKNOWN);
 515       @}
 516 @end smallexample
 517
 518 making it easier to investigate further in a debugger when this occurs.