2003-12-26 Guilhem Lavaux <guilhem@kaffe.org>
[official-gcc.git] / boehm-gc / doc / gcdescr.html
blob8ecbac8db62e6205e0cd3f9d2375ed2c5efd16b4
1 <HTML>
2 <HEAD>
3 <TITLE> Conservative GC Algorithmic Overview </TITLE>
4 <AUTHOR> Hans-J. Boehm, HP Labs (Much of this was written at SGI)</author>
5 </HEAD>
6 <BODY>
7 <H1> <I>This is under construction</i> </h1>
8 <H1> Conservative GC Algorithmic Overview </h1>
9 <P>
10 This is a description of the algorithms and data structures used in our
11 conservative garbage collector. I expect the level of detail to increase
12 with time. For a survey of GC algorithms, see for example
13 <A HREF="ftp://ftp.cs.utexas.edu/pub/garbage/gcsurvey.ps"> Paul Wilson's
14 excellent paper</a>. For an overview of the collector interface,
15 see <A HREF="gcinterface.html">here</a>.
16 <P>
17 This description is targeted primarily at someone trying to understand the
18 source code. It specifically refers to variable and function names.
19 It may also be useful for understanding the algorithms at a higher level.
20 <P>
21 The description here assumes that the collector is used in default mode.
22 In particular, we assume that it used as a garbage collector, and not just
23 a leak detector. We initially assume that it is used in stop-the-world,
24 non-incremental mode, though the presence of the incremental collector
25 will be apparent in the design.
26 We assume the default finalization model, but the code affected by that
27 is very localized.
28 <H2> Introduction </h2>
29 The garbage collector uses a modified mark-sweep algorithm. Conceptually
30 it operates roughly in four phases:
32 <OL>
34 <LI>
35 <I>Preparation</i> Clear all mark bits, indicating that all objects
36 are potentially unreachable.
38 <LI>
39 <I>Mark phase</i> Marks all objects that can be reachable via chains of
40 pointers from variables. Normally the collector has no real information
41 about the location of pointer variables in the heap, so it
42 views all static data areas, stacks and registers as potentially containing
43 containing pointers. Any bit patterns that represent addresses inside
44 heap objects managed by the collector are viewed as pointers.
45 Unless the client program has made heap object layout information
46 available to the collector, any heap objects found to be reachable from
47 variables are again scanned similarly.
49 <LI>
50 <I>Sweep phase</i> Scans the heap for inaccessible, and hence unmarked,
51 objects, and returns them to an appropriate free list for reuse. This is
52 not really a separate phase; even in non incremental mode this is operation
53 is usually performed on demand during an allocation that discovers an empty
54 free list. Thus the sweep phase is very unlikely to touch a page that
55 would not have been touched shortly thereafter anyway.
57 <LI>
58 <I>Finalization phase</i> Unreachable objects which had been registered
59 for finalization are enqueued for finalization outside the collector.
61 </ol>
63 <P>
64 The remaining sections describe the memory allocation data structures,
65 and then the last 3 collection phases in more detail. We conclude by
66 outlining some of the additional features implemented in the collector.
68 <H2>Allocation</h2>
69 The collector includes its own memory allocator. The allocator obtains
70 memory from the system in a platform-dependent way. Under UNIX, it
71 uses either <TT>malloc</tt>, <TT>sbrk</tt>, or <TT>mmap</tt>.
72 <P>
73 Most static data used by the allocator, as well as that needed by the
74 rest of the garbage collector is stored inside the
75 <TT>_GC_arrays</tt> structure.
76 This allows the garbage collector to easily ignore the collectors own
77 data structures when it searches for root pointers. Other allocator
78 and collector internal data structures are allocated dynamically
79 with <TT>GC_scratch_alloc</tt>. <TT>GC_scratch_alloc</tt> does not
80 allow for deallocation, and is therefore used only for permanent data
81 structures.
82 <P>
83 The allocator allocates objects of different <I>kinds</i>.
84 Different kinds are handled somewhat differently by certain parts
85 of the garbage collector. Certain kinds are scanned for pointers,
86 others are not. Some may have per-object type descriptors that
87 determine pointer locations. Or a specific kind may correspond
88 to one specific object layout. Two built-in kinds are uncollectable.
89 One (<TT>STUBBORN</tt>) is immutable without special precautions.
90 In spite of that, it is very likely that most applications currently
91 use at most two kinds: <TT>NORMAL</tt> and <TT>PTRFREE</tt> objects.
92 <P>
93 The collector uses a two level allocator. A large block is defined to
94 be one larger than half of <TT>HBLKSIZE</tt>, which is a power of 2,
95 typically on the order of the page size.
96 <P>
97 Large block sizes are rounded up to
98 the next multiple of <TT>HBLKSIZE</tt> and then allocated by
99 <TT>GC_allochblk</tt>. Recent versions of the collector
100 use an approximate best fit algorithm by keeping free lists for
101 several large block sizes.
102 The actual
103 implementation of <TT>GC_allochblk</tt>
104 is significantly complicated by black-listing issues
105 (see below).
107 Small blocks are allocated in chunks of size <TT>HBLKSIZE</tt>.
108 Each chunk is
109 dedicated to only one object size and kind. The allocator maintains
110 separate free lists for each size and kind of object.
112 Once a large block is split for use in smaller objects, it can only
113 be used for objects of that size, unless the collector discovers a completely
114 empty chunk. Completely empty chunks are restored to the appropriate
115 large block free list.
117 In order to avoid allocating blocks for too many distinct object sizes,
118 the collector normally does not directly allocate objects of every possible
119 request size. Instead request are rounded up to one of a smaller number
120 of allocated sizes, for which free lists are maintained. The exact
121 allocated sizes are computed on demand, but subject to the constraint
122 that they increase roughly in geometric progression. Thus objects
123 requested early in the execution are likely to be allocated with exactly
124 the requested size, subject to alignment constraints.
125 See <TT>GC_init_size_map</tt> for details.
127 The actual size rounding operation during small object allocation is
128 implemented as a table lookup in <TT>GC_size_map</tt>.
130 Both collector initialization and computation of allocated sizes are
131 handled carefully so that they do not slow down the small object fast
132 allocation path. An attempt to allocate before the collector is initialized,
133 or before the appropriate <TT>GC_size_map</tt> entry is computed,
134 will take the same path as an allocation attempt with an empty free list.
135 This results in a call to the slow path code (<TT>GC_generic_malloc_inner</tt>)
136 which performs the appropriate initialization checks.
138 In non-incremental mode, we make a decision about whether to garbage collect
139 whenever an allocation would otherwise have failed with the current heap size.
140 If the total amount of allocation since the last collection is less than
141 the heap size divided by <TT>GC_free_space_divisor</tt>, we try to
142 expand the heap. Otherwise, we initiate a garbage collection. This ensures
143 that the amount of garbage collection work per allocated byte remains
144 constant.
146 The above is in fact an oversimplification of the real heap expansion
147 and GC triggering heuristic, which adjusts slightly for root size
148 and certain kinds of
149 fragmentation. In particular:
150 <UL>
151 <LI> Programs with a large root set size and
152 little live heap memory will expand the heap to amortize the cost of
153 scanning the roots.
154 <LI> Versions 5.x of the collector actually collect more frequently in
155 nonincremental mode. The large block allocator usually refuses to split
156 large heap blocks once the garbage collection threshold is
157 reached. This often has the effect of collecting well before the
158 heap fills up, thus reducing fragmentation and working set size at the
159 expense of GC time. Versions 6.x choose an intermediate strategy depending
160 on how much large object allocation has taken place in the past.
161 (If the collector is configured to unmap unused pages, versions 6.x
162 use the 5.x strategy.)
163 <LI> In calculating the amount of allocation since the last collection we
164 give partial credit for objects we expect to be explicitly deallocated.
165 Even if all objects are explicitly managed, it is often desirable to collect
166 on rare occasion, since that is our only mechanism for coalescing completely
167 empty chunks.
168 </ul>
170 It has been suggested that this should be adjusted so that we favor
171 expansion if the resulting heap still fits into physical memory.
172 In many cases, that would no doubt help. But it is tricky to do this
173 in a way that remains robust if multiple application are contending
174 for a single pool of physical memory.
176 <H2>Mark phase</h2>
178 The marker maintains an explicit stack of memory regions that are known
179 to be accessible, but that have not yet been searched for contained pointers.
180 Each stack entry contains the starting address of the block to be scanned,
181 as well as a descriptor of the block. If no layout information is
182 available for the block, then the descriptor is simply a length.
183 (For other possibilities, see <TT>gc_mark.h</tt>.)
185 At the beginning of the mark phase, all root segments are pushed on the
186 stack by <TT>GC_push_roots</tt>. If <TT>ALL_INTERIOR_PTRS</tt> is not
187 defined, then stack roots require special treatment. In this case, the
188 normal marking code ignores interior pointers, but <TT>GC_push_all_stack</tt>
189 explicitly checks for interior pointers and pushes descriptors for target
190 objects.
192 The marker is structured to allow incremental marking.
193 Each call to <TT>GC_mark_some</tt> performs a small amount of
194 work towards marking the heap.
195 It maintains
196 explicit state in the form of <TT>GC_mark_state</tt>, which
197 identifies a particular sub-phase. Some other pieces of state, most
198 notably the mark stack, identify how much work remains to be done
199 in each sub-phase. The normal progression of mark states for
200 a stop-the-world collection is:
201 <OL>
202 <LI> <TT>MS_INVALID</tt> indicating that there may be accessible unmarked
203 objects. In this case <TT>GC_objects_are_marked</tt> will simultaneously
204 be false, so the mark state is advanced to
205 <LI> <TT>MS_PUSH_UNCOLLECTABLE</tt> indicating that it suffices to push
206 uncollectable objects, roots, and then mark everything reachable from them.
207 <TT>Scan_ptr</tt> is advanced through the heap until all uncollectable
208 objects are pushed, and objects reachable from them are marked.
209 At that point, the next call to <TT>GC_mark_some</tt> calls
210 <TT>GC_push_roots</tt> to push the roots. It the advances the
211 mark state to
212 <LI> <TT>MS_ROOTS_PUSHED</tt> asserting that once the mark stack is
213 empty, all reachable objects are marked. Once in this state, we work
214 only on emptying the mark stack. Once this is completed, the state
215 changes to
216 <LI> <TT>MS_NONE</tt> indicating that reachable objects are marked.
217 </ol>
219 The core mark routine <TT>GC_mark_from</tt>, is called
220 repeatedly by several of the sub-phases when the mark stack starts to fill
221 up. It is also called repeatedly in <TT>MS_ROOTS_PUSHED</tt> state
222 to empty the mark stack.
223 The routine is designed to only perform a limited amount of marking at
224 each call, so that it can also be used by the incremental collector.
225 It is fairly carefully tuned, since it usually consumes a large majority
226 of the garbage collection time.
228 The fact that it perform a only a small amount of work per call also
229 allows it to be used as the core routine of the parallel marker. In that
230 case it is normally invoked on thread-private mark stacks instead of the
231 global mark stack. More details can be found in
232 <A HREF="scale.html">scale.html</a>
234 The marker correctly handles mark stack overflows. Whenever the mark stack
235 overflows, the mark state is reset to <TT>MS_INVALID</tt>.
236 Since there are already marked objects in the heap,
237 this eventually forces a complete
238 scan of the heap, searching for pointers, during which any unmarked objects
239 referenced by marked objects are again pushed on the mark stack. This
240 process is repeated until the mark phase completes without a stack overflow.
241 Each time the stack overflows, an attempt is made to grow the mark stack.
242 All pieces of the collector that push regions onto the mark stack have to be
243 careful to ensure forward progress, even in case of repeated mark stack
244 overflows. Every mark attempt results in additional marked objects.
246 Each mark stack entry is processed by examining all candidate pointers
247 in the range described by the entry. If the region has no associated
248 type information, then this typically requires that each 4-byte aligned
249 quantity (8-byte aligned with 64-bit pointers) be considered a candidate
250 pointer.
252 We determine whether a candidate pointer is actually the address of
253 a heap block. This is done in the following steps:
254 <NL>
255 <LI> The candidate pointer is checked against rough heap bounds.
256 These heap bounds are maintained such that all actual heap objects
257 fall between them. In order to facilitate black-listing (see below)
258 we also include address regions that the heap is likely to expand into.
259 Most non-pointers fail this initial test.
260 <LI> The candidate pointer is divided into two pieces; the most significant
261 bits identify a <TT>HBLKSIZE</tt>-sized page in the address space, and
262 the least significant bits specify an offset within that page.
263 (A hardware page may actually consist of multiple such pages.
264 HBLKSIZE is usually the page size divided by a small power of two.)
265 <LI>
266 The page address part of the candidate pointer is looked up in a
267 <A HREF="tree.html">table</a>.
268 Each table entry contains either 0, indicating that the page is not part
269 of the garbage collected heap, a small integer <I>n</i>, indicating
270 that the page is part of large object, starting at least <I>n</i> pages
271 back, or a pointer to a descriptor for the page. In the first case,
272 the candidate pointer i not a true pointer and can be safely ignored.
273 In the last two cases, we can obtain a descriptor for the page containing
274 the beginning of the object.
275 <LI>
276 The starting address of the referenced object is computed.
277 The page descriptor contains the size of the object(s)
278 in that page, the object kind, and the necessary mark bits for those
279 objects. The size information can be used to map the candidate pointer
280 to the object starting address. To accelerate this process, the page header
281 also contains a pointer to a precomputed map of page offsets to displacements
282 from the beginning of an object. The use of this map avoids a
283 potentially slow integer remainder operation in computing the object
284 start address.
285 <LI>
286 The mark bit for the target object is checked and set. If the object
287 was previously unmarked, the object is pushed on the mark stack.
288 The descriptor is read from the page descriptor. (This is computed
289 from information <TT>GC_obj_kinds</tt> when the page is first allocated.)
290 </nl>
292 At the end of the mark phase, mark bits for left-over free lists are cleared,
293 in case a free list was accidentally marked due to a stray pointer.
295 <H2>Sweep phase</h2>
297 At the end of the mark phase, all blocks in the heap are examined.
298 Unmarked large objects are immediately returned to the large object free list.
299 Each small object page is checked to see if all mark bits are clear.
300 If so, the entire page is returned to the large object free list.
301 Small object pages containing some reachable object are queued for later
302 sweeping, unless we determine that the page contains very little free
303 space, in which case it is not examined further.
305 This initial sweep pass touches only block headers, not
306 the blocks themselves. Thus it does not require significant paging, even
307 if large sections of the heap are not in physical memory.
309 Nonempty small object pages are swept when an allocation attempt
310 encounters an empty free list for that object size and kind.
311 Pages for the correct size and kind are repeatedly swept until at
312 least one empty block is found. Sweeping such a page involves
313 scanning the mark bit array in the page header, and building a free
314 list linked through the first words in the objects themselves.
315 This does involve touching the appropriate data page, but in most cases
316 it will be touched only just before it is used for allocation.
317 Hence any paging is essentially unavoidable.
319 Except in the case of pointer-free objects, we maintain the invariant
320 that any object in a small object free list is cleared (except possibly
321 for the link field). Thus it becomes the burden of the small object
322 sweep routine to clear objects. This has the advantage that we can
323 easily recover from accidentally marking a free list, though that could
324 also be handled by other means. The collector currently spends a fair
325 amount of time clearing objects, and this approach should probably be
326 revisited.
328 In most configurations, we use specialized sweep routines to handle common
329 small object sizes. Since we allocate one mark bit per word, it becomes
330 easier to examine the relevant mark bits if the object size divides
331 the word length evenly. We also suitably unroll the inner sweep loop
332 in each case. (It is conceivable that profile-based procedure cloning
333 in the compiler could make this unnecessary and counterproductive. I
334 know of no existing compiler to which this applies.)
336 The sweeping of small object pages could be avoided completely at the expense
337 of examining mark bits directly in the allocator. This would probably
338 be more expensive, since each allocation call would have to reload
339 a large amount of state (e.g. next object address to be swept, position
340 in mark bit table) before it could do its work. The current scheme
341 keeps the allocator simple and allows useful optimizations in the sweeper.
343 <H2>Finalization</h2>
344 Both <TT>GC_register_disappearing_link</tt> and
345 <TT>GC_register_finalizer</tt> add the request to a corresponding hash
346 table. The hash table is allocated out of collected memory, but
347 the reference to the finalizable object is hidden from the collector.
348 Currently finalization requests are processed non-incrementally at the
349 end of a mark cycle.
351 The collector makes an initial pass over the table of finalizable objects,
352 pushing the contents of unmarked objects onto the mark stack.
353 After pushing each object, the marker is invoked to mark all objects
354 reachable from it. The object itself is not explicitly marked.
355 This assures that objects on which a finalizer depends are neither
356 collected nor finalized.
358 If in the process of marking from an object the
359 object itself becomes marked, we have uncovered
360 a cycle involving the object. This usually results in a warning from the
361 collector. Such objects are not finalized, since it may be
362 unsafe to do so. See the more detailed
363 <A HREF="http://www.hpl.hp.com/personal/Hans_Boehm/gc/finalization.html"> discussion of finalization semantics</a>.
365 Any objects remaining unmarked at the end of this process are added to
366 a queue of objects whose finalizers can be run. Depending on collector
367 configuration, finalizers are dequeued and run either implicitly during
368 allocation calls, or explicitly in response to a user request.
369 (Note that the former is unfortunately both the default and not generally safe.
370 If finalizers perform synchronization, it may result in deadlocks.
371 Nontrivial finalizers generally need to perform synchronization, and
372 thus require a different collector configuration.)
374 The collector provides a mechanism for replacing the procedure that is
375 used to mark through objects. This is used both to provide support for
376 Java-style unordered finalization, and to ignore certain kinds of cycles,
377 <I>e.g.</i> those arising from C++ implementations of virtual inheritance.
379 <H2>Generational Collection and Dirty Bits</h2>
380 We basically use the concurrent and generational GC algorithm described in
381 <A HREF="http://www.hpl.hp.com/personal/Hans_Boehm/gc/papers/pldi91.ps.Z">"Mostly Parallel Garbage Collection"</a>,
382 by Boehm, Demers, and Shenker.
384 The most significant modification is that
385 the collector always starts running in the allocating thread.
386 There is no separate garbage collector thread. (If parallel GC is
387 enabled, helper threads may also be woken up.)
388 If an allocation attempt either requests a large object, or encounters
389 an empty small object free list, and notices that there is a collection
390 in progress, it immediately performs a small amount of marking work
391 as described above.
393 This change was made both because we wanted to easily accommodate
394 single-threaded environments, and because a separate GC thread requires
395 very careful control over the scheduler to prevent the mutator from
396 out-running the collector, and hence provoking unneeded heap growth.
398 In incremental mode, the heap is always expanded when we encounter
399 insufficient space for an allocation. Garbage collection is triggered
400 whenever we notice that more than
401 <TT>GC_heap_size</tt>/2 * <TT>GC_free_space_divisor</tt>
402 bytes of allocation have taken place.
403 After <TT>GC_full_freq</tt> minor collections a major collection
404 is started.
406 All collections initially run interrupted until a predetermined
407 amount of time (50 msecs by default) has expired. If this allows
408 the collection to complete entirely, we can avoid correcting
409 for data structure modifications during the collection. If it does
410 not complete, we return control to the mutator, and perform small
411 amounts of additional GC work during those later allocations that
412 cannot be satisfied from small object free lists. When marking completes,
413 the set of modified pages is retrieved, and we mark once again from
414 marked objects on those pages, this time with the mutator stopped.
416 We keep track of modified pages using one of several distinct mechanisms:
417 <OL>
418 <LI>
419 Through explicit mutator cooperation. Currently this requires
420 the use of <TT>GC_malloc_stubborn</tt>, and is rarely used.
421 <LI>
422 (<TT>MPROTECT_VDB</tt>) By write-protecting physical pages and
423 catching write faults. This is
424 implemented for many Unix-like systems and for win32. It is not possible
425 in a few environments.
426 <LI>
427 (<TT>PROC_VDB</tt>) By retrieving dirty bit information from /proc.
428 (Currently only Sun's
429 Solaris supports this. Though this is considerably cleaner, performance
430 may actually be better with mprotect and signals.)
431 <LI>
432 (<TT>PCR_VDB</tt>) By relying on an external dirty bit implementation, in this
433 case the one in Xerox PCR.
434 <LI>
435 (<TT>DEFAULT_VDB</tt>) By treating all pages as dirty. This is the default if
436 none of the other techniques is known to be usable, and
437 <TT>GC_malloc_stubborn</tt> is not used. Practical only for testing, or if
438 the vast majority of objects use <TT>GC_malloc_stubborn</tt>.
439 </ol>
441 <H2>Black-listing</h2>
443 The collector implements <I>black-listing</i> of pages, as described
445 <A HREF="http://www.acm.org/pubs/citations/proceedings/pldi/155090/p197-boehm/">
446 Boehm, ``Space Efficient Conservative Collection'', PLDI '93</a>, also available
447 <A HREF="papers/pldi93.ps.Z">here</a>.
449 During the mark phase, the collector tracks ``near misses'', i.e. attempts
450 to follow a ``pointer'' to just outside the garbage-collected heap, or
451 to a currently unallocated page inside the heap. Pages that have been
452 the targets of such near misses are likely to be the targets of
453 misidentified ``pointers'' in the future. To minimize the future
454 damage caused by such misidentifications they will be allocated only to
455 small pointerfree objects.
457 The collector understands two different kinds of black-listing. A
458 page may be black listed for interior pointer references
459 (<TT>GC_add_to_black_list_stack</tt>), if it was the target of a near
460 miss from a location that requires interior pointer recognition,
461 <I>e.g.</i> the stack, or the heap if <TT>GC_all_interior_pointers</tt>
462 is set. In this case, we also avoid allocating large blocks that include
463 this page.
465 If the near miss came from a source that did not require interior
466 pointer recognition, it is black-listed with
467 <TT>GC_add_to_black_list_normal</tt>.
468 A page black-listed in this way may appear inside a large object,
469 so long as it is not the first page of a large object.
471 The <TT>GC_allochblk</tt> routine respects black-listing when assigning
472 a block to a particular object kind and size. It occasionally
473 drops (i.e. allocates and forgets) blocks that are completely black-listed
474 in order to avoid excessively long large block free lists containing
475 only unusable blocks. This would otherwise become an issue
476 if there is low demand for small pointerfree objects.
478 <H2>Thread support</h2>
479 We support several different threading models. Unfortunately Pthreads,
480 the only reasonably well standardized thread model, supports too narrow
481 an interface for conservative garbage collection. There appears to be
482 no completely portable way to allow the collector to coexist with various Pthreads
483 implementations. Hence we currently support only a few of the more
484 common Pthreads implementations.
486 In particular, it is very difficult for the collector to stop all other
487 threads in the system and examine the register contents. This is currently
488 accomplished with very different mechanisms for some Pthreads
489 implementations. The Solaris implementation temporarily disables much
490 of the user-level threads implementation by stopping kernel-level threads
491 ("lwp"s). The Linux/HPUX/OSF1 and Irix implementations sends signals to
492 individual Pthreads and has them wait in the signal handler.
494 The Linux and Irix implementations use
495 only documented Pthreads calls, but rely on extensions to their semantics.
496 The Linux implementation <TT>linux_threads.c</tt> relies on only very
497 mild extensions to the pthreads semantics, and already supports a large number
498 of other Unix-like pthreads implementations. Our goal is to make this the
499 only pthread support in the collector.
501 (The Irix implementation is separate only for historical reasons and should
502 clearly be merged. The current Solaris implementation probably performs
503 better in the uniprocessor case, but does not support thread operations in the
504 collector. Hence it cannot support the parallel marker.)
506 All implementations must
507 intercept thread creation and a few other thread-specific calls to allow
508 enumeration of threads and location of thread stacks. This is current
509 accomplished with <TT># define</tt>'s in <TT>gc.h</tt>
510 (really <TT>gc_pthread_redirects.h</tt>), or optionally
511 by using ld's function call wrapping mechanism under Linux.
513 Comments are appreciated. Please send mail to
514 <A HREF="mailto:boehm@acm.org"><TT>boehm@acm.org</tt></a> or
515 <A HREF="mailto:Hans.Boehm@hp.com"><TT>Hans.Boehm@hp.com</tt></a>
517 This is a modified copy of a page written while the author was at SGI.
518 The original was <A HREF="http://reality.sgi.com/boehm/gcdescr.html">here</a>.
519 </body>
520 </html>