use typename
[prop.git] / docs / gc.tex
blob0008f594f57fa62ea0ff994458a415dd9d934a36
1 \def\ADLib{{\sf ADLib}}
2 \def\Prop{{\sf Prop}}
3 \def\CPP{C{\tt ++}}
5 \documentstyle{article}
6 \title{Garbage Collection in the \ADLib{} Library(Draft)}
7 \author{Allen Leung \\ {\tt leun7056@cs.nyu.edu}}
9 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
10 % Readjust the margins
11 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
12 \setlength{\textheight}{8.3in}
13 \setlength{\textwidth}{6.5in}
14 \setlength{\oddsidemargin}{-.10in}
15 \setlength{\topmargin}{-.5in}
17 \begin{document}
18 \maketitle
19 \begin{abstract}
20 We describe the design and implementation of a garbage collector
21 based on the Customisable Memory Management framework(CMM)\cite{CMM} in our
22 \ADLib{} \CPP{} class library. Like the previous approach, we have
23 implemented a mostly copying conservative collector based on the work
24 of Bartlett\cite{Mostly-copying}.
25 Similar to CMM, our architecture provides a protocol to allow
26 multiple garbage collectors using different algorithms to coexist in the
27 same memory space. A few improvements are made to
28 improve the performance, the flexibility and the functionality of
29 our collector: to reduce retention due to false roots identification,
30 blacklisting\cite{Boehm} is used to identify troublesome heap
31 addresses; the architecture of the system has been generalized so that it is
32 now possible to have multiple instantiations of Bartlett-style heaps;
33 finally, object finalization and weak pointer support
34 are added. Our collector has been tested on gcc 2.5.8 under Linux
35 and SunOS 4.1.3.
36 \end{abstract}
38 \bibliographystyle{alpha}
39 \section{Introduction}
40 The \Prop{} project involves the design and implementation of
41 an environment and an extension language based on \CPP{}
42 for compiler construction and program transformation.
43 An outgrowth of this project is the \ADLib{}
44 \CPP{} class library, which contains an extensive set of support algorithms
45 and data structures. Since a typical compiler manipulates many complex tree,
46 dag and graph objects with extended lifetime, manual memory management using
47 \CPP's constructors and destructors, reference counting smart pointers, or
48 some other techniques is frequently complex and error prone.
49 Furthermore, with the advent of new algorithms for garbage collection,
50 manual memory management techniques do not necessarily provide better
51 performance or space utilization. To make it possible to make use of garbage
52 collection as a viable alternative for memory management in
53 \CPP\cite{Safe-C++-GC}, we have implemented a garbage collection class
54 hierarchy as part of the \ADLib{} library. The class library can be used
55 directly by the users who'd like to program in \CPP; it can also be
56 used as part of the runtime system of a highly level language implemented
57 in \CPP, as we have done for \Prop.
59 We have several good reasons to prefer garbage collection over manual
60 memory management. The \Prop{} language contains many declarative constructs
61 and extensions such as algebraic datatypes, pattern matching, rewriting, and
62 logical inference. When a user programs in \Prop{} using these constructs,
63 an applicative style of programming is the most natural paradigm.
65 \section{The Framework}
67 We base our design on the work on the Customisable Memory Management(CMM)
68 system\cite{CMM}. In this framework, multiple independent heaps(including
69 the usually non-collectable heap) can coexist with each other.
70 Bartlett's mostly copying garbage collector is used as the primary collector.
71 CMM extends the work of Bartlett\cite{Mostly-copying}
72 by allowing cross heap pointers and unrestricted interior pointers.
74 However, all collectable objects are required to derive from a base class
75 and reimplement a tracing method, which identifies the internal pointers of
76 an object. This burden is usually quite small and in fact can be
77 automated from the type definition\footnote{In \Prop, a special keyword
78 \verb|collectable| is used to identify garbage collected classes and types.
79 The \Prop{} translator uses this type annotation to derive the appropriate
80 tracing methods.}
82 One of the major advantages of the CMM framework is that multiple
83 coorperating collectors can coexist at the same time, which makes it
84 possible to customize the behavior of each collector with respect to
85 the lifetime behavior of the objects. In \cite{CMM}, examples are
86 given in which the lifetime of certain class of objects exhibit
87 first-in/first-out behavior. In this case a special collector can be
88 written to take full advantage of this property.
90 \subsection{Our Framework}
91 Our framework retains all the benefits and flexibilities of CMM,
92 while extending it in several minor but useful ways:
93 \begin{itemize}
94 \item Like CMM, interior pointers(i.e. pointers to the middle of an object)
95 and crossheap pointers (i.e. complex data structures linking nodes locating
96 in multiple logical heaps.) are supported. Pointers to collectable
97 objects are non-intrusive: i.e. they are normal \CPP{} pointers
98 and not encapsulated in templates.
99 \item Also like CMM, we allow multiple garbage collectors using different
100 algorithms to coexist. Unlike CMM, however, we allow multiple
101 Bartlett-style collectors to be instantiated. Most of the services
102 involving low level page management and object marking have been relegated
103 to a separate heap manager.
104 \item We provide support for finalization and weakpointers.
105 \item We have implemented blacklisting\cite{Boehm} to reduce the chance
106 of false roots identification.
107 \end{itemize}
109 \subsection{The Implementation}
111 Our implementation has been completely written from scratch since
112 we do not desire to utilize copyrighted code and, more importantly,
113 we have to make a few important architectural changes:
114 All low level memory management services, such as management of
115 the page table and the object bitmap, is now relegated to the class
116 {\sf GCHeapManager}. Collectors now act as clients as of the heap manager
117 and the Bartlett-style collector no longer has any special status.
119 \subsection{Architecture}
121 The architecture of the memory management system is partitioned into
122 a few classes, each responsible for providing distinct services:
124 \begin{itemize}
125 \item {\sf GCHeapManager} --- The heap manager. The heap manager
126 manages the heap table, performs page level
127 allocation and deallocation, and provides miscellaneous services
128 like blacklisting. It also manages the object bitmaps.
129 \item {\sf GC} --- The base class for all garbage collectors.
130 This base class describes the protocol used by all the collector
131 classes\footnote{This base class is also inherited from class
132 {\sf Mem}, so that it adheres to the \ADLib{} memory management
133 protocol.}
134 \item {\sf CGC} --- The base class for conservative collectors.
135 This class is inherited from class {\sf GC} and implements some
136 methods for locating the stack, heap, and static data areas.
137 \item {\sf BGC} --- Bartlett-style mostly copying collector.
138 This class is inherited from class {\sf CGC} and implements
139 the Bartlett mostly copying algorithm.
140 \item {\sf MarkSweepGC} --- Mark/sweep style conservative collector.
141 This class is inherited from class {\sf CGC} and implements
142 a mark/sweep collection algorithm.
143 \item {\sf WeakPointerManager} --- The weakpointer manager.
144 This class manages the weak pointer table and provides a few
145 weak pointer scavenging services for the collectors.
146 \end{itemize}
148 \section{The Programmatic Interface}
150 The programmatic interface to the garbage collector involves two
151 base classes, {\sf GC} and {\sf GCObject}. The base class {\sf GC}
152 provides an uniform interface to all types of collectors and memory
153 managers while class {\sf GCObject} provides the interface to all
154 collectable classes. Table~\ref{GC-Classes} contains a listing
155 of the classes in our hierarchy.
157 \begin{table}
158 \begin{center}
159 \begin{tabular}{|l|l|} \hline
160 Class & Purpose \\ \hline \hline
161 \sf GCObject & Collectable object base class \\
162 \sf GC & Garbage collector base class \\
163 \sf CGC & Conservative garbage collector base class \\
164 \sf BGC & Bartlett style mostly copying collector \\
165 \sf MarkSweepGC & A mark-sweep collector \\
166 \sf GCVerifier & A heap walker that verifies the integrity of a
167 structure \\
168 \hline
169 \end{tabular}
170 \end{center}
171 \caption{\label{GC-Classes} Garbage Collection Classes.}
172 \end{table}
174 Class {\sf GCObject} is extremely simple: it redefines the operators
175 \verb|new| and \verb|delete| to allocate memory from the default collector,
176 and declares a virtual method ``\verb|trace|'' to be defined by subclasses
177 (more on this later.)
179 Memory for a {\sf GCObject} is allocated and freed using the usually
180 \verb|new| and \verb|delete| operators. Of course, freeing memory explicitly
181 using \verb|delete| is optional for many subclasses of {\sf GC}.
183 \begin{verbatim}
184 class GCObject {
185 public:
186 inline void * operator new(size_t n, GC& gc = *GC::default_gc)
187 { return gc.m_alloc(n); }
188 inline void * operator new(size_t n, size_t N, GC& gc = *GC::default_gc)
189 { return gc.m_alloc(n > N ? n : N); }
190 inline void operator delete(void *);
191 virtual void trace(GC *) = 0;
193 \end{verbatim}
195 \subsection{Memory Allocation}
197 The base class {\sf GC} is slightly more complex, as it has to provide
198 a few different functionalities. The first service that class {\sf GC} must
199 provide is of course memory allocation and deallocation. As a time saving
200 device we can specify what the default collector is using the methods
201 \verb|get_default_gc| and \verb|set_default_gc|. Method \verb|GCObject::new|
202 will use this collector by default, unless placement syntax is used.
203 Method \verb|GCObject::delete|, on the other hand, can correctly infer the
204 proper collector to use by the address of the object.
206 The low level methods to allocate and deallocate memory are \verb|m_alloc|
207 and \verb|free| respectively. The programmers usually do not have to
208 use these methods directly.
210 The method to invoke a garbage collection of a specific level is
211 \verb|collect(int level)|. This forces an explicit collection.
212 Method \verb|grow_heap(size_t)| can also be used to explicitly increase
213 the heap size of a collector. Depending of the actual behavior
214 of the subclasses, these methods may have different effects.
216 \begin{verbatim}
217 class GC {
218 protected:
219 static GC * default_gc;
220 public:
221 static GC& get_default_gc() { return *default_gc; }
222 static void set_default_gc(GC& gc) { default_gc = &gc; }
223 virtual void * m_alloc (size_t) = 0;
224 virtual void free (void *);
225 virtual void collect (int level = 0) = 0;
226 virtual void grow_heap (size_t) = 0;
227 static void garbage_collect() { default_gc->collect(); }
228 virtual void set_gc_ratio(int);
229 virtual void set_initial_heap_size (size_t);
230 virtual void set_min_heap_growth (size_t);
232 \end{verbatim}
234 \subsection{The GC Protocol}
235 The collector and collectable objects must cooperate with
236 each other by abiding to a simple protocol:
237 \begin{itemize}
238 \item All objects that are to be garbage collected must be derived
239 from {\sf GCObject}. The application programmer must also supply a
240 ``{\bf tracing}'' method. The purpose of this method is to identify
241 all internal pointers to other {\sf GCObject}'s. This method is not
242 used by the application programmer directly but only used internally
243 by the garbage collectors.
244 \item The tracing method of each collectable must in turn call
245 the tracing method in the class {\sf GC} with each pointer to
246 a collectable object that the object owns:
247 \begin{verbatim}
248 class GC {
249 public:
250 virtual GCObject * trace (GCObject *) = 0;
251 inline void trace (GCObject& obj);
253 \end{verbatim}
255 Briefly, the rules are as follows:
256 \begin{enumerate}
257 \item The tracing method of a collectable {\sf Foo} has the following
258 general form:
259 \begin{verbatim}
260 void Foo::trace(GC * gc)
264 \end{verbatim}
265 \item If class {\sf Foo} has a member that is a pointer \verb|p|
266 to a collectable object of type {\sf Bar}, then add:
267 \begin{verbatim}
268 p = (Bar)gc->trace(p);
269 \end{verbatim}
270 to the body of \verb|Foo::trace|.
271 \item If class {\sf Foo} has a member object
272 \verb|x| that is a subclass of {\sf GCObject}, also add:
273 \begin{verbatim}
274 gc->trace(x);
275 \end{verbatim}
276 to the body of \verb|Foo::trace|.
277 \item If class {\sf Foo} is derived from a class \verb|Bar| that
278 is a subclass of {\sf GCObject}, add:
279 \begin{verbatim}
280 Bar::trace(gc);
281 \end{verbatim}
282 to the body of \verb|Foo::trace|.
283 \end{enumerate}
284 Notice that these methods can be arranged in any order.
285 \end{itemize}
287 This protocol can be used by both copying and non-copying collectors.
288 In addition, the class {\sf GCVerifier} also uses this protocol to
289 walk the heap in order to verify the integrity of a garbage collected
290 data structure.
292 \subsection{Messages and Statistics}
293 All garbage collectors use the following protocols for status
294 reporting and statistics gathering.
296 \begin{verbatim}
297 class GC {
298 public:
299 enum GCNotify {
300 gc_no_notify = 0
301 gc_notify_minor_collection = 1,
302 gc_notify_major_collection = 2,
303 gc_print_collection_time = 4,
304 gc_print_debugging_info = 8
306 virtual int verbosity() const;
307 virtual void set_verbosity(int);
308 virtual ostream& get_console() const;
309 virtual void set_console(ostream&);
311 \end{verbatim}
313 \begin{verbatim}
314 class GC {
315 public:
316 struct Statistics {
317 const char * algorithm;
318 const char * version;
319 size_t bytes_used;
320 size_t bytes_managed;
321 size_t bytes_free;
322 struct timeval gc_user_time;
323 struct timeval gc_system_time;
324 struct timeval total_gc_user_time;
325 struct timeval total_gc_system_time;
327 virtual Statistics statistics();
329 \end{verbatim}
331 Each collector has a current verbosity level, which can be set
332 and reset using the methods \verb|set_verbosity(int)| and
333 \verb|verbosity() const|. The verbosity level is actually a bit
334 set containing flags of the type \verb|GC::GCNotify|. A combination
335 of these options can be used.
336 \begin{itemize}
337 \item \verb|gc_no_notify| --- no messages at all.
338 \item \verb|gc_notify_minor_collection| --- all minor collections
339 will be notified by printing a message on the current console.
340 \item \verb|gc_notify_major_collection| --- similarly for all major
341 collections.
342 \item \verb|gc_print_collection_time| --- this option will let the
343 collector to print the time spent during collection (and finalization).
344 \item \verb|gc_print_debugging_info| --- this option will print additional
345 information such as stack and heap addresses during garbage collection.
346 \end{itemize}
348 The current console of a collector is defaulted to the \CPP{} stream
349 \verb|cerr|. It can be set and inspected with the methods
350 \verb|set_console(ostream&)| and \verb|ostream& get_console()| respectively.
351 For example, the user can redirect all garbage collection messages
352 to a log file \verb|"gc.log"| by executing the following during initialization:
354 \begin{verbatim}
355 ofstream gc_log("gc.log");
356 GC::get_default_gc().set_console(gc_log);
357 \end{verbatim}
359 \subsection{The Bartlett style mostly copying collector}
361 Currently a Bartlett-style mostly copying collector has been implemented.
362 The current version is non-generational but we expect that a generational
363 version will be available in the near future.
365 The Bartlett-style collector is implemented as the concrete class
366 {\sf BGC}. Aside from all the services provided by class {\sf GC},
367 {\sf BGC} also provides the following method:
369 \begin{verbatim}
370 class BGC : public CGC {
371 public:
372 virtual void set_gc_ratio(int);
373 virtual void set_initial_heap_size (size_t);
374 virtual void set_min_heap_growth (size_t);
376 \end{verbatim}
378 The gc ratio refers to the ratio of used heap space versus
379 total heap space. Class {\sf BGC} will invoke the garbage collection
380 if this is exceeded. The default gc ratio for {\sf BGC} is 75\%.
382 The initial size of the heap and the minimal number of bytes to
383 request from the system during a heap expansion can be adjusted using
384 the methods \verb|set_initial_heap_size| and \verb|set_min_heap_growth|
385 respectively. These methods are declared in the {\sf GC}
386 protocol. By default, the initial heap size for this collector
387 is 128K and each heap expansion will increase the heap by 512K.
389 If an application uses a lot garbage collected storage it is a good
390 idea to set the heap size to larger capacity during initialization.
391 Otherwise, the collector will have to perform more garbage collection
392 and heap expansions to reach a stable state.
394 \subsection{The Mark Sweep collector}
396 An alternative to the copying collector is the the mark sweep collector.
397 Like the previous collector, this collector also uses conservative scanning
398 to locate roots. Unlikely the Boehm collector, type accurate marking
399 is used through the user supplied tracing method.
401 Since the default collector is of the {\sf BGC} variety, the user must
402 create an instance of {\sf MarkSweepGC} if the mark sweep collector
403 is to be used.
405 \begin{verbatim}
406 class MarkSweepGC : public CGC {
407 public:
408 virtual void set_gc_ratio(int);
409 virtual void set_initial_heap_size (size_t);
410 virtual void set_min_heap_growth (size_t);
412 \end{verbatim}
414 The gc ratio in class {\sf MarkSweepGC} determines whether heap
415 expansion is to be performed after a collection. The default gc ratio
416 is 50\%, thus if after garbage collection the ratio of used versus total
417 heap space exceeds one half, the heap will be expanded.
419 For most purposes the two collectors are interchangeable. Since
420 all types of garbage collectors under our framework use the same protocol,
421 applications can be written without fixing a specific garbage collection
422 algorithm before hand.
424 \subsection{Finalization}
426 One common \CPP{} idiom uses constructor and destructor for
427 resource allocation: resources (including memory but may include others,
428 such as file handles, graphical widgets, etc) that are acquired
429 in a class constructor are released(finalized) in the class'es destructor.
431 There are, however, two opposing views in how finalization should be handled
432 in a garbage collector. The first assumes that garbage collection
433 simulates an infinite amount of memory and so automatic finalization is
434 inapproriate. The second takes a more pragmatic view, and assumes that
435 automatic finalization should be provided since otherwise explicit resource
436 tracking must be provided by the user for collectable datatypes, making garbage
437 collection much less useful than it can be.
439 We will refrain from participating in any religious and philosophical
440 wars on which dogma is the One True Way. Instead, both types of collectors
441 are provided.
443 By default, class {\sf BGC} does not perform finalization on garbage
444 collected objects. If object finalization is desired then the user
445 can do one of two things:
446 \begin{itemize}
447 \item Enable the finalization of the default heap by putting
448 \begin{verbatim}
449 GC::get_default_gc().set_finalization(true);
450 \end{verbatim}
451 in the initialization code, or
452 \item Instantiate a different instance of {\sf BGC} and
453 allocate objects needing finalization from this collector.
454 This may be a better method since not all objects need finalization
455 and those that do not can still be allocated from the default collector.
456 This way we only have to pay for the performance penalty (if any)
457 proportional to the usage of finalization.
458 \end{itemize}
460 For example, if a collectable class named {\sf Foo} should be properly
461 finalized we can declare it in the following manner:
463 \begin{verbatim}
464 extern BGC bgc_f; // somewhere an instance is defined
466 class Foo : public GCObject {
467 Resource r;
468 public:
469 Foo() { r.allocate(); /* allocate resource */ }
470 ~Foo() { r.release(); /* release all resource */ }
472 // Make object allocate from bgc_f instead
473 // of the default non-collectable gc.
474 void * operator new (size_t n) { return bgc_f.m_alloc(n); }
476 \end{verbatim}
478 When an instance of class \verb|Foo| is identified as garbage, its
479 destructor will be called. [Aside: {\it currently issue such as
480 finalization synchronization is not handled directly by the collector.
481 So if there is synchronization constraints during finalization it must
482 be handled by the object destructor. Future versions of \ADLib{} will
483 provide subclasses with more sophisticated finalization mechanisms.} ]
485 Notice that the default \verb|delete| operator of class {\sf BGC} will
486 call the object explicitly destructor. It is a good idea to use
487 explicit \verb|delete| if finalization is a rare need since this service
488 involves an explicit scan of the garbage memory, which may degrade performace
489 with excessive swapping: without finalization the garbage pages will not
490 have to be referenced with a copying collector.
492 It should also be noted that since the collector is conservative, there
493 is no guarantee that garbage will be identified as such:
494 there is no guarantee that all garbage resources will be released.
495 In general, however, the efficacy of the collector is quite good and
496 so non-critical or plentiful resources can be safely finalized with
497 this collector.
499 \subsection{Weak Pointers}
501 It is frequently very useful to be able to keep track of garbage
502 collectable objects through the use of ``{\bf weak pointers}.''
503 Collectors ignore the presense of weak pointers to garbage collected
504 objects; objects with only referencing weak pointers will still be
505 collected. Weak pointers to objects that become garbage will
506 be reset to null.
508 Weak pointers are provided thru a smart pointer template
509 {\sf WeakPointer}, whose definition is shown below:
511 \begin{verbatim}
512 template <class T> class WeakPointer {
513 public:
514 inline WeakPointer();
515 inline WeakPointer(const WeakPointer<T>& wp);
516 inline WeakPointer(T * ptr);
517 inline WeakPointer<T>& operator = (const WeakPointer<T>& wp);
518 inline WeakPointer<T>& operator = (T * ptr);
519 inline bool is_null() const;
520 inline operator const T * () const;
521 inline operator T * ();
522 inline const T * operator -> () const;
523 inline T * operator -> ();
524 inline const T& operator * () const;
525 inline T& operator * ();
527 \end{verbatim}
529 \subsection{The Heap Walker}
531 There is a simple class called {\sf GCVerify} that can be used
532 to verify the integrity of a complex allocated data structure.
533 This is frequently useful during the development of a new collector,
534 or a complex class derive class of {\sf GCObject.}
536 The interface of this class is shown below.
538 \begin{verbatim}
539 class GCVerifier : protected GC {
540 public:
541 virtual bool is_valid_pointer (GCObject *);
542 virtual bool is_valid_interior_pointer (GCObject *);
543 virtual bool is_valid_structure (GCObject *);
544 virtual bool is_valid_pointer (GCObject *, GC *);
545 virtual bool is_valid_interior_pointer (GCObject *, GC *);
546 virtual bool is_valid_structure (GCObject *, GC *);
547 size_t number_of_nodes() const;
549 \end{verbatim}
551 Class {\sf GCVerify} is derived from class {\sf GC} so that the
552 same object tracing protocol can be used. Briefly, three different methods
553 are provided for verification:
554 \begin{itemize}
555 \item Call to \verb|is_valid_pointer(p,gc)| verifies that
556 \verb|p| is a valid pointer to an object within the collector \verb|gc|.
557 \item Call to \verb|is_valid_pointer(p,gc)| verifies that
558 \verb|p| is a valid interior pointer to an object
559 within the collector \verb|gc|.
560 \item Call to \verb|is_valid_structure(p,gc)| verifies the entire
561 structure reachable by \verb|p| is valid. This method uses
562 \verb|GCObject::trace| to locate the correct sub-structures.
563 \end{itemize}
565 There are alternative versions of the same methods that assumes
566 the default collector is used. Furthermore
567 \begin{itemize}
568 \item Method \verb|number_of_nodes()| returns the node count of the
569 last structure traced using \verb|is_valid_structure|, and
570 \item If \verb|set_verbosity(GC::gc_print_debugging_info)| is used
571 then error messages will be printed during structure tracing.
572 \end{itemize}
574 \bibliography{gc}
575 \end{document}