boehm-gc/doc/gcdescr.html

   1 <HTML>
   2 <HEAD>
   3     <TITLE> Conservative GC Algorithmic Overview </TITLE>
   4     <AUTHOR> Hans-J. Boehm, Silicon Graphics</author>
   5 </HEAD>
   6 <BODY>
   7 <H1> <I>This is under construction</i> </h1>
   8 <H1> Conservative GC Algorithmic Overview </h1>
   9 <P>
  10 This is a description of the algorithms and data structures used in our
  11 conservative garbage collector.  I expect the level of detail to increase
  12 with time.  For a survey of GC algorithms, see for example
  13 <A HREF="ftp://ftp.cs.utexas.edu/pub/garbage/gcsurvey.ps"> Paul Wilson's
  14 excellent paper</a>.  For an overview of the collector interface,
  15 see <A HREF="gcinterface.html">here</a>.
  16 <P>
  17 This description is targeted primarily at someone trying to understand the
  18 source code.  It specifically refers to variable and function names.
  19 It may also be useful for understanding the algorithms at a higher level.
  20 <P>
  21 The description here assumes that the collector is used in default mode.
  22 In particular, we assume that it used as a garbage collector, and not just
  23 a leak detector.  We initially assume that it is used in stop-the-world,
  24 non-incremental mode, though the presence of the incremental collector
  25 will be apparent in the design.
  26 We assume the default finalization model, but the code affected by that
  27 is very localized.
  28 <H2> Introduction </h2>
  29 The garbage collector uses a modified mark-sweep algorithm.  Conceptually
  30 it operates roughly in four phases:
  31
  32 <OL>
  33
  34 <LI>
  35 <I>Preparation</i> Clear all mark bits, indicating that all objects
  36 are potentially unreachable.
  37
  38 <LI>
  39 <I>Mark phase</i> Marks all objects that can be reachable via chains of
  40 pointers from variables.  Normally the collector has no real information
  41 about the location of pointer variables in the heap, so it
  42 views all static data areas, stacks and registers as potentially containing
  43 containing pointers.  Any bit patterns that represent addresses inside
  44 heap objects managed by the collector are viewed as pointers.
  45 Unless the client program has made heap object layout information
  46 available to the collector, any heap objects found to be reachable from
  47 variables are again scanned similarly.
  48
  49 <LI>
  50 <I>Sweep phase</i> Scans the heap for inaccessible, and hence unmarked,
  51 objects, and returns them to an appropriate free list for reuse.  This is
  52 not really a separate phase; even in non incremental mode this is operation
  53 is usually performed on demand during an allocation that discovers an empty
  54 free list.  Thus the sweep phase is very unlikely to touch a page that
  55 would not have been touched shortly thereafter anyway.
  56
  57 <LI>
  58 <I>Finalization phase</i> Unreachable objects which had been registered
  59 for finalization are enqueued for finalization outside the collector.
  60
  61 </ol>
  62
  63 <P>
  64 The remaining sections describe the memory allocation data structures,
  65 and then the last 3 collection phases in more detail. We conclude by
  66 outlining some of the additional features implemented in the collector.
  67
  68 <H2>Allocation</h2>
  69 The collector includes its own memory allocator.  The allocator obtains
  70 memory from the system in a platform-dependent way.  Under UNIX, it
  71 uses either <TT>malloc</tt>, <TT>sbrk</tt>, or <TT>mmap</tt>.
  72 <P>
  73 Most static data used by the allocator, as well as that needed by the
  74 rest of the garbage collector is stored inside the
  75 <TT>_GC_arrays</tt> structure.
  76 This allows the garbage collector to easily ignore the collectors own
  77 data structures when it searches for root pointers.  Other allocator
  78 and collector internal data structures are allocated dynamically
  79 with <TT>GC_scratch_alloc</tt>. <TT>GC_scratch_alloc</tt> does not
  80 allow for deallocation, and is therefore used only for permanent data
  81 structures.
  82 <P>
  83 The allocator allocates objects of different <I>kinds</i>.
  84 Different kinds are handled somewhat differently by certain parts
  85 of the garbage collector.  Certain kinds are scanned for pointers,
  86 others are not.  Some may have per-object type descriptors that
  87 determine pointer locations.  Or a specific kind may correspond
  88 to one specific object layout.  Two built-in kinds are uncollectable.
  89 One (<TT>STUBBORN</tt>) is immutable without special precautions.
  90 In spite of that, it is very likely that most applications currently
  91 use at most two kinds: <TT>NORMAL</tt> and <TT>PTRFREE</tt> objects.
  92 <P>
  93 The collector uses a two level allocator.  A large block is defined to
  94 be one larger than half of <TT>HBLKSIZE</tt>, which is a power of 2,
  95 typically on the order of the page size.
  96 <P>
  97 Large block sizes are rounded up to
  98 the next multiple of <TT>HBLKSIZE</tt> and then allocated by
  99 <TT>GC_allochblk</tt>.  This uses roughly what Paul Wilson has termed
 100 a "next fit" algorithm, i.e. first-fit with a rotating pointer.
 101 The implementation does check for a better fitting immediately
 102 adjacent block, which gives it somewhat better fragmentation characteristics.
 103 I'm now convinced it should use a best fit algorithm.  The actual
 104 implementation of <TT>GC_allochblk</tt>
 105 is significantly complicated by black-listing issues
 106 (see below).
 107 <P>
 108 Small blocks are allocated in blocks of size <TT>HBLKSIZE</tt>.
 109 Each block is
 110 dedicated to only one object size and kind.  The allocator maintains
 111 separate free lists for each size and kind of object.
 112 <P>
 113 In order to avoid allocating blocks for too many distinct object sizes,
 114 the collector normally does not directly allocate objects of every possible
 115 request size.  Instead request are rounded up to one of a smaller number
 116 of allocated sizes, for which free lists are maintained.  The exact
 117 allocated sizes are computed on demand, but subject to the constraint
 118 that they increase roughly in geometric progression.  Thus objects
 119 requested early in the execution are likely to be allocated with exactly
 120 the requested size, subject to alignment constraints.
 121 See <TT>GC_init_size_map</tt> for details.
 122 <P>
 123 The actual size rounding operation during small object allocation is
 124 implemented as a table lookup in <TT>GC_size_map</tt>.
 125 <P>
 126 Both collector initialization and computation of allocated sizes are
 127 handled carefully so that they do not slow down the small object fast
 128 allocation path.  An attempt to allocate before the collector is initialized,
 129 or before the appropriate <TT>GC_size_map</tt> entry is computed,
 130 will take the same path as an allocation attempt with an empty free list.
 131 This results in a call to the slow path code (<TT>GC_generic_malloc_inner</tt>)
 132 which performs the appropriate initialization checks.
 133 <P>
 134 In non-incremental mode, we make a decision about whether to garbage collect
 135 whenever an allocation would otherwise have failed with the current heap size.
 136 If the total amount of allocation since the last collection is less than
 137 the heap size divided by <TT>GC_free_space_divisor</tt>, we try to
 138 expand the heap.  Otherwise, we initiate a garbage collection.  This ensures
 139 that the amount of garbage collection work per allocated byte remains
 140 constant.
 141 <P>
 142 The above is in fat an oversimplification of the real heap expansion
 143 heuristic, which adjusts slightly for root size and certain kinds of
 144 fragmentation.  In particular, programs with a large root set size and
 145 little live heap memory will expand the heap to amortize the cost of
 146 scanning the roots.
 147 <P>
 148 Versions 5.x of the collector actually collect more frequently in
 149 nonincremental mode.  The large block allocator usually refuses to split
 150 large heap blocks once the garbage collection threshold is
 151 reached.  This often has the effect of collecting well before the
 152 heap fills up, thus reducing fragmentation and working set size at the
 153 expense of GC time.  6.x will chose an intermediate strategy depending
 154 on how much large object allocation has taken place in the past.
 155 (If the collector is configured to unmap unused pages, versions 6.x
 156 will use the 5.x strategy.)
 157 <P>
 158 (It has been suggested that this should be adjusted so that we favor
 159 expansion if the resulting heap still fits into physical memory.
 160 In many cases, that would no doubt help.  But it is tricky to do this
 161 in a way that remains robust if multiple application are contending
 162 for a single pool of physical memory.)
 163
 164 <H2>Mark phase</h2>
 165
 166 The marker maintains an explicit stack of memory regions that are known
 167 to be accessible, but that have not yet been searched for contained pointers.
 168 Each stack entry contains the starting address of the block to be scanned,
 169 as well as a descriptor of the block.  If no layout information is
 170 available for the block, then the descriptor is simply a length.
 171 (For other possibilities, see <TT>gc_mark.h</tt>.)
 172 <P>
 173 At the beginning of the mark phase, all root segments are pushed on the
 174 stack by <TT>GC_push_roots</tt>.  If <TT>ALL_INTERIOR_PTRS</tt> is not
 175 defined, then stack roots require special treatment.  In this case, the
 176 normal marking code ignores interior pointers, but <TT>GC_push_all_stack</tt>
 177 explicitly checks for interior pointers and pushes descriptors for target
 178 objects.
 179 <P>
 180 The marker is structured to allow incremental marking.
 181 Each call to <TT>GC_mark_some</tt> performs a small amount of
 182 work towards marking the heap.
 183 It maintains
 184 explicit state in the form of <TT>GC_mark_state</tt>, which
 185 identifies a particular sub-phase.  Some other pieces of state, most
 186 notably the mark stack, identify how much work remains to be done
 187 in each sub-phase.  The normal progression of mark states for
 188 a stop-the-world collection is:
 189 <OL>
 190 <LI> <TT>MS_INVALID</tt> indicating that there may be accessible unmarked
 191 objects.  In this case <TT>GC_objects_are_marked</tt> will simultaneously
 192 be false, so the mark state is advanced to
 193 <LI> <TT>MS_PUSH_UNCOLLECTABLE</tt> indicating that it suffices to push
 194 uncollectable objects, roots, and then mark everything reachable from them.
 195 <TT>Scan_ptr</tt> is advanced through the heap until all uncollectable
 196 objects are pushed, and objects reachable from them are marked.
 197 At that point, the next call to <TT>GC_mark_some</tt> calls
 198 <TT>GC_push_roots</tt> to push the roots.  It the advances the
 199 mark state to
 200 <LI> <TT>MS_ROOTS_PUSHED</tt> asserting that once the mark stack is
 201 empty, all reachable objects are marked.  Once in this state, we work
 202 only on emptying the mark stack.  Once this is completed, the state
 203 changes to
 204 <LI> <TT>MS_NONE</tt> indicating that reachable objects are marked.
 205 </ol>
 206
 207 The core mark routine <TT>GC_mark_from_mark_stack</tt>, is called
 208 repeatedly by several of the sub-phases when the mark stack starts to fill
 209 up.  It is also called repeatedly in <TT>MS_ROOTS_PUSHED</tt> state
 210 to empty the mark stack.
 211 The routine is designed to only perform a limited amount of marking at
 212 each call, so that it can also be used by the incremental collector.
 213 It is fairly carefully tuned, since it usually consumes a large majority
 214 of the garbage collection time.
 215 <P>
 216 The marker correctly handles mark stack overflows.  Whenever the mark stack
 217 overflows, the mark state is reset to <TT>MS_INVALID</tt>.
 218 Since there are already marked objects in the heap,
 219 this eventually forces a complete
 220 scan of the heap, searching for pointers, during which any unmarked objects
 221 referenced by marked objects are again pushed on the mark stack.  This
 222 process is repeated until the mark phase completes without a stack overflow.
 223 Each time the stack overflows, an attempt is made to grow the mark stack.
 224 All pieces of the collector that push regions onto the mark stack have to be
 225 careful to ensure forward progress, even in case of repeated mark stack
 226 overflows.  Every mark attempt results in additional marked objects.
 227 <P>
 228 Each mark stack entry is processed by examining all candidate pointers
 229 in the range described by the entry.  If the region has no associated
 230 type information, then this typically requires that each 4-byte aligned
 231 quantity (8-byte aligned with 64-bit pointers) be considered a candidate
 232 pointer.
 233 <P>
 234 We determine whether a candidate pointer is actually the address of
 235 a heap block.  This is done in the following steps:
 236 <NL>
 237 <LI> The candidate pointer is checked against rough heap bounds.
 238 These heap bounds are maintained such that all actual heap objects
 239 fall between them.  In order to facilitate black-listing (see below)
 240 we also include address regions that the heap is likely to expand into.
 241 Most non-pointers fail this initial test.
 242 <LI> The candidate pointer is divided into two pieces; the most significant
 243 bits identify a <TT>HBLKSIZE</tt>-sized page in the address space, and
 244 the least significant bits specify an offset within that page.
 245 (A hardware page may actually consist of multiple such pages.
 246 HBLKSIZE is usually the page size divided by a small power of two.)
 247 <LI>
 248 The page address part of the candidate pointer is looked up in a
 249 <A HREF="tree.html">table</a>.
 250 Each table entry contains either 0, indicating that the page is not part
 251 of the garbage collected heap, a small integer <I>n</i>, indicating
 252 that the page is part of large object, starting at least <I>n</i> pages
 253 back, or a pointer to a descriptor for the page.  In the first case,
 254 the candidate pointer i not a true pointer and can be safely ignored.
 255 In the last two cases, we can obtain a descriptor for the page containing
 256 the beginning of the object.
 257 <LI>
 258 The starting address of the referenced object is computed.
 259 The page descriptor contains the size of the object(s)
 260 in that page, the object kind, and the necessary mark bits for those
 261 objects.  The size information can be used to map the candidate pointer
 262 to the object starting address.  To accelerate this process, the page header
 263 also contains a pointer to a precomputed map of page offsets to displacements
 264 from the beginning of an object.  The use of this map avoids a
 265 potentially slow integer remainder operation in computing the object
 266 start address.
 267 <LI>
 268 The mark bit for the target object is checked and set.  If the object
 269 was previously unmarked, the object is pushed on the mark stack.
 270 The descriptor is read from the page descriptor.  (This is computed
 271 from information <TT>GC_obj_kinds</tt> when the page is first allocated.)
 272 </nl>
 273 <P>
 274 At the end of the mark phase, mark bits for left-over free lists are cleared,
 275 in case a free list was accidentally marked due to a stray pointer.
 276
 277 <H2>Sweep phase</h2>
 278
 279 At the end of the mark phase, all blocks in the heap are examined.
 280 Unmarked large objects are immediately returned to the large object free list.
 281 Each small object page is checked to see if all mark bits are clear.
 282 If so, the entire page is returned to the large object free list.
 283 Small object pages containing some reachable object are queued for later
 284 sweeping.
 285 <P>
 286 This initial sweep pass touches only block headers, not
 287 the blocks themselves.  Thus it does not require significant paging, even
 288 if large sections of the heap are not in physical memory.
 289 <P>
 290 Nonempty small object pages are swept when an allocation attempt
 291 encounters an empty free list for that object size and kind.
 292 Pages for the correct size and kind are repeatedly swept until at
 293 least one empty block is found.  Sweeping such a page involves
 294 scanning the mark bit array in the page header, and building a free
 295 list linked through the first words in the objects themselves.
 296 This does involve touching the appropriate data page, but in most cases
 297 it will be touched only just before it is used for allocation.
 298 Hence any paging is essentially unavoidable.
 299 <P>
 300 Except in the case of pointer-free objects, we maintain the invariant
 301 that any object in a small object free list is cleared (except possibly
 302 for the link field).  Thus it becomes the burden of the small object
 303 sweep routine to clear objects.  This has the advantage that we can
 304 easily recover from accidentally marking a free list, though that could
 305 also be handled by other means.  The collector currently spends a fair
 306 amount of time clearing objects, and this approach should probably be
 307 revisited.
 308 <P>
 309 In most configurations, we use specialized sweep routines to handle common
 310 small object sizes.  Since we allocate one mark bit per word, it becomes
 311 easier to examine the relevant mark bits if the object size divides
 312 the word length evenly.  We also suitably unroll the inner sweep loop
 313 in each case.  (It is conceivable that profile-based procedure cloning
 314 in the compiler could make this unnecessary and counterproductive.  I
 315 know of no existing compiler to which this applies.)
 316 <P>
 317 The sweeping of small object pages could be avoided completely at the expense
 318 of examining mark bits directly in the allocator.  This would probably
 319 be more expensive, since each allocation call would have to reload
 320 a large amount of state (e.g. next object address to be swept, position
 321 in mark bit table) before it could do its work.  The current scheme
 322 keeps the allocator simple and allows useful optimizations in the sweeper.
 323
 324 <H2>Finalization</h2>
 325 Both <TT>GC_register_disappearing_link</tt> and
 326 <TT>GC_register_finalizer</tt> add the request to a corresponding hash
 327 table.  The hash table is allocated out of collected memory, but
 328 the reference to the finalizable object is hidden from the collector.
 329 Currently finalization requests are processed non-incrementally at the
 330 end of a mark cycle.
 331 <P>
 332 The collector makes an initial pass over the table of finalizable objects,
 333 pushing the contents of unmarked objects onto the mark stack.
 334 After pushing each object, the marker is invoked to mark all objects
 335 reachable from it.  The object itself is not explicitly marked.
 336 This assures that objects on which a finalizer depends are neither
 337 collected nor finalized.
 338 <P>
 339 If in the process of marking from an object the
 340 object itself becomes marked, we have uncovered
 341 a cycle involving the object.  This usually results in a warning from the
 342 collector.  Such objects are not finalized, since it may be
 343 unsafe to do so.  See the more detailed
 344 <A HREF="finalization.html"> discussion of finalization semantics</a>.
 345 <P>
 346 Any objects remaining unmarked at the end of this process are added to
 347 a queue of objects whose finalizers can be run.  Depending on collector
 348 configuration, finalizers are dequeued and run either implicitly during
 349 allocation calls, or explicitly in response to a user request.
 350 <P>
 351 The collector provides a mechanism for replacing the procedure that is
 352 used to mark through objects.  This is used both to provide support for
 353 Java-style unordered finalization, and to ignore certain kinds of cycles,
 354 <I>e.g.</i> those arising from C++ implementations of virtual inheritance.
 355
 356 <H2>Generational Collection and Dirty Bits</h2>
 357 We basically use the parallel and generational GC algorithm described in
 358 <A HREF="papers/pldi91.ps.gz">"Mostly Parallel Garbage Collection"</a>,
 359 by Boehm, Demers, and Shenker.
 360 <P>
 361 The most significant modification is that
 362 the collector always runs in the allocating thread.
 363 There is no separate garbage collector thread.
 364 If an allocation attempt either requests a large object, or encounters
 365 an empty small object free list, and notices that there is a collection
 366 in progress, it immediately performs a small amount of marking work
 367 as described above.
 368 <P>
 369 This change was made both because we wanted to easily accommodate
 370 single-threaded environments, and because a separate GC thread requires
 371 very careful control over the scheduler to prevent the mutator from
 372 out-running the collector, and hence provoking unneeded heap growth.
 373 <P>
 374 In incremental mode, the heap is always expanded when we encounter
 375 insufficient space for an allocation.  Garbage collection is triggered
 376 whenever we notice that more than
 377 <TT>GC_heap_size</tt>/2 * <TT>GC_free_space_divisor</tt>
 378 bytes of allocation have taken place.
 379 After <TT>GC_full_freq</tt> minor collections a major collection
 380 is started.
 381 <P>
 382 All collections initially run interrupted until a predetermined
 383 amount of time (50 msecs by default) has expired.  If this allows
 384 the collection to complete entirely, we can avoid correcting
 385 for data structure modifications during the collection.  If it does
 386 not complete, we return control to the mutator, and perform small
 387 amounts of additional GC work during those later allocations that
 388 cannot be satisfied from small object free lists. When marking completes,
 389 the set of modified pages is retrieved, and we mark once again from
 390 marked objects on those pages, this time with the mutator stopped.
 391 <P>
 392 We keep track of modified pages using one of three distinct mechanisms:
 393 <OL>
 394 <LI>
 395 Through explicit mutator cooperation.  Currently this requires
 396 the use of <TT>GC_malloc_stubborn</tt>.
 397 <LI>
 398 By write-protecting physical pages and catching write faults.  This is
 399 implemented for many Unix-like systems and for win32.  It is not possible
 400 in a few environments.
 401 <LI>
 402 By retrieving dirty bit information from /proc.  (Currently only Sun's
 403 Solaris supports this.  Though this is considerably cleaner, performance
 404 may actually be better with mprotect and signals.)
 405 </ol>
 406
 407 <H2>Thread support</h2>
 408 We support several different threading models.  Unfortunately Pthreads,
 409 the only reasonably well standardized thread model, supports too narrow
 410 an interface for conservative garbage collection.  There appears to be
 411 no portable way to allow the collector to coexist with various Pthreads
 412 implementations.  Hence we currently support only a few of the more
 413 common Pthreads implementations.
 414 <P>
 415 In particular, it is very difficult for the collector to stop all other
 416 threads in the system and examine the register contents.  This is currently
 417 accomplished with very different mechanisms for different Pthreads
 418 implementations.  The Solaris implementation temporarily disables much
 419 of the user-level threads implementation by stopping kernel-level threads
 420 ("lwp"s).  The Irix implementation sends signals to individual Pthreads
 421 and has them wait in the signal handler.  The Linux implementation
 422 is similar in spirit to the Irix one.
 423 <P>
 424 The Irix implementation uses
 425 only documented Pthreads calls, but relies on extensions to their semantics,
 426 notably the use of mutexes and condition variables from signal
 427 handlers.  The Linux implementation should be far closer to
 428 portable, though impirically it is not completely portable.
 429 <P>
 430 All implementations must
 431 intercept thread creation and a few other thread-specific calls to allow
 432 enumeration of threads and location of thread stacks.  This is current
 433 accomplished with <TT># define</tt>'s in <TT>gc.h</tt>, or optionally
 434 by using ld's function call wrapping mechanism under Linux.
 435 <P>
 436 Comments are appreciated.  Please send mail to
 437 <A HREF="mailto:boehm@acm.org"><TT>boehm@acm.org</tt></a>
 438 </body>