2 A new JIT compiler for the Mono Project
4 Miguel de Icaza (miguel@{ximian.com,gnome.org}),
5 Paolo Molaro (lupus@{ximian.com,debian.org})
7 This documents overall design of the Mono JIT up to version
8 2.0. After Mono 2.0 the JIT engine was upgraded from
9 a tree-based intermediate representation to a linear
10 intermediate representation.
12 The Linear IL is documented here:
14 http://www.mono-project.com/Linear_IL
18 Mini is a new compilation engine for the Mono runtime. The
19 new engine is designed to bring new code generation
20 optimizations, portability and pre-compilation.
22 In this document we describe the design decisions and the
23 architecture of the new compilation engine.
27 Mono is a Open Source implementation of the .NET Framework: it
28 is made up of a runtime engine that implements the ECMA Common
29 Language Infrastructure (CLI), a set of compilers that target
30 the CLI and a large collection of class libraries.
32 This article discusses the new code generation facilities that
33 have been added to the Mono runtime.
35 First we discuss the overall architecture of the Mono runtime,
36 and how code generation fits into it; Then we discuss the
37 development and basic architecture of our first JIT compiler
38 for the ECMA CIL framework. The next section covers the
39 objectives for the work on the new JIT compiler, then we
40 discuss the new features available in the new JIT compiler,
41 and finally a technical description of the new code generation
44 * Architecture of the Mono Runtime
46 The Mono runtime is an implementation of the ECMA Common
47 Language Infrastructure (CLI), whose aim is to be a common
48 platform for executing code in multiple languages.
50 Languages that target the CLI generate images that contain
51 code in high-level intermediate representation called the
52 "Common Intermediate Language". This intermediate language is
53 rich enough to allow for programs and pre-compiled libraries
54 to be reflected. The execution environment allows for an
55 object oriented execution environment with single inheritance
56 and multiple interface implementations.
58 This runtime provides a number of services for programs that
59 are targeted to it: Just-in-Time compilation of CIL code into
60 native code, garbage collection, thread management, I/O
61 routines, single, double and decimal floating point,
62 asynchronous method invocation, application domains, and a
63 framework for building arbitrary RPC systems (remoting) and
64 integration with system libraries through the Platform Invoke
67 The focus of this document is on the services provided by the
68 Mono runtime to transform CIL bytecodes into code that is
69 native to the underlying architecture.
71 The code generation interface is a set of macros that allow a
72 C programmer to generate code on the fly, this is done
73 through a set of macros found in the mono/jit/arch/ directory.
74 These macros are used by the JIT compiler to generate native
77 The platform invocation code is interesting, as it generates
78 CIL code on the fly to marshal parameters, and then this
79 code is in turned processed by the JIT engine.
81 Mono has now gone through three different JIT engines, these
84 * Original JIT engine: 2002, hard to port, hard to
85 implement optimizations.
87 * Second JIT engine, used up until Mono 2.0, very
88 portable, many new optimizations.
90 * Third JIT engine, replaced the code generation layer from
91 being based on a tree representation to be based on a linear
94 For more information on the code generation changes, see our
95 web site for the details on the Linear IL:
97 http://www.mono-project.com/Linear_IL
99 * Previous Experiences
101 Mono has built a JIT engine, which has been used to bootstrap
102 Mono since January, 2002. This JIT engine has reasonable
103 performance, and uses an tree pattern matching instruction
104 selector based on the BURS technology. This JIT compiler was
105 designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
107 The existing JIT compiler has three phases:
109 * Re-creation of the semantic tree from CIL
112 * Instruction selection, with a cost-driven
115 * Code generation and register allocation.
117 It is also hooked into the rest of the runtime to provide
118 services like marshaling, just-in-time compilation and
119 invocation of "internal calls".
121 This engine constructed a collection of trees, which we
122 referred to as the "forest of trees", this forest is created by
123 "hydrating" the CIL instruction stream.
125 The first step was to identify the basic blocks on the method,
126 and computing the control flow graph (cfg) for it. Once this
127 information was computed, a stack analysis on each basic block
128 was performed to create a forest of trees for each one of
131 So for example, the following statement:
137 Which would be represented in CIL as:
144 After the stack analysis would create the following tree:
146 (STIND_I4 ADDR_L[EBX|2] (
147 ADD (LDIND_I4 ADDR_L[ESI|1])
150 This tree contains information from the stack analysis: for
151 instance, notice that the operations explicitly encode the
152 data types they are operating on, there is no longer an
153 ambiguity on the types, because this information has been
156 At this point the JIT would pass the constructed forest of
157 trees to the architecture-dependent JIT compiler.
159 The architecture dependent code then performed register
160 allocation (optionally using linear scan allocation for
161 variables, based on life analysis).
163 Once variables had been assigned, a tree pattern matching with
164 dynamic programming is used (the tree pattern matcher is
165 custom build for each architecture, using a code
166 generator: monoburg). The instruction selector used cost
167 functions to select the best instruction patterns.
169 The instruction selector is able to produce instructions that
170 take advantage of the x86 instruction indexing instructions
173 One problem though is that the code emitter and the register
174 allocator did not have any visibility outside the current
175 tree, which meant that some redundant instructions were
176 generated. A peephole optimizer with this architecture was
177 hard to write, given the tree-based representation that is
180 This JIT was functional, but it did not provide a good
181 architecture to base future optimizations on. Also the
182 line between architecture neutral and architecture
183 specific code and optimizations was hard to draw.
185 The JIT engine supported two code generation modes to support
186 the two optimization modes for applications that host multiple
187 application domains: generate code that will be shared across
188 application domains, or generate code that will not be shared
189 across application domains.
191 * Second Generation JIT engine.
193 We wanted to support a number of features that were missing:
195 * Ahead-of-time compilation.
197 The idea is to allow developers to pre-compile their code
198 to native code to reduce startup time, and the working
199 set that is used at runtime in the just-in-time compiler.
201 Although in Mono this has not been a visible problem, we
202 wanted to pro-actively address this problem.
204 When an assembly (a Mono/.NET executable) is installed in
205 the system, it would then be possible to pre-compile the
206 code, and have the JIT compiler tune the generated code
207 to the particular CPU on which the software is
210 This is done in the Microsoft.NET world with a tool
213 * Have a good platform for doing code optimizations.
215 The design called for a good architecture that would
216 enable various levels of optimizations: some
217 optimizations are better performed on high-level
218 intermediate representations, some on medium-level and
219 some at low-level representations.
221 Also it should be possible to conditionally turn these on
222 or off. Some optimizations are too expensive to be used
223 in just-in-time compilation scenarios, but these
224 expensive optimizations can be turned on for
225 ahead-of-time compilations or when using profile-guided
226 optimizations on a subset of the executed methods.
228 * Reduce the effort required to port the Mono code
229 generator to new architectures.
231 For Mono to gain wide adoption in the Unix world, it is
232 necessary that the JIT engine works in most of today's
233 commercial hardware platforms.
235 * Features of the Second JIT engine.
237 The new JIT engine was architected by Dietmar Maurer and Paolo
238 Molaro, based on the new objectives.
240 Mono provides a number of services to applications running
241 with the new JIT compiler:
243 * Just-in-Time compilation of CLI code into native code.
245 * Ahead-of-Time compilation of CLI code, to reduce
246 startup time of applications.
248 A number of software development features are also available:
250 * Execution time profiling (--profile)
252 Generates a report of the times consumed by routines,
253 as well as the invocation times, as well as the
256 * Memory usage profiling (--profile)
258 Generates a report of the memory usage by a program
259 that is ran under the Mono JIT.
261 * Code coverage (--coverage)
265 People who are interested in developing and improving the Mini
266 JIT compiler will also find a few useful routines:
270 This is used to time the execution time for the JIT
271 when compiling a routine.
273 * Control Flow Graph and Dominator Tree drawing.
275 These are visual aids for the JIT developer: they
276 render representations of the Control Flow graph, and
277 for the more advanced optimizations, they draw the
278 dominator tree graph.
280 This requires Dot (from the graphwiz package) and Ghostview.
282 * Code generator regression tests.
284 The engine contains support for running regression
285 tests on the virtual machine, which is very helpful to
286 developers interested in improving the engine.
288 * Optimization benchmark framework.
290 The JIT engine will generate graphs that compare
291 various benchmarks embedded in an assembly, and run the
292 various tests with different optimization flags.
294 This requires Perl, GD::Graph.
298 This is probably the most important component of the new code
299 generation engine. The internals are relatively easy to
300 replace and update, even large passes can be replaced and
301 implemented differently.
305 Compiling a method begins with the `mini_method_to_ir' routine
306 that converts the CIL representation into a medium
307 intermediate representation.
309 The mini_method_to_ir routine performs a number of operations:
311 * Flow analysis and control flow graph computation.
313 Unlike the previous version, stack analysis and control
314 flow graphs are computed in a single pass in the
315 mini_method_to_ir function, this is done for performance
316 reasons: although the complexity increases, the benefit
317 for a JIT compiler is that there is more time available
318 for performing other optimizations.
320 * Basic block computation.
322 mini_method_to_ir populates the MonoCompile structure
323 with an array of basic blocks each of which contains
324 forest of trees made up of MonoInst structures.
328 Inlining is no longer restricted to methods containing
329 one single basic block, instead it is possible to inline
330 arbitrary complex methods.
332 The heuristics to choose what to inline are likely going
333 to be tuned in the future.
335 * Method to opcode conversion.
337 Some method call invocations like `call Math.Sin' are
338 transformed into an opcode: this transforms the call
339 into a semantically rich node, which is later inline
340 into an FPU instruction.
342 Various Array methods invocations are turned into
343 opcodes as well (The Get, Set and Address methods)
345 * Tail recursion elimination
349 The MonoInst structure holds the actual decoded instruction,
350 with the semantic information from the stack analysis.
351 MonoInst is interesting because initially it is part of a tree
352 structure, here is a sample of the same tree with the new JIT
355 (stind.i4 regoffset[0xffffffd4(%ebp)]
356 (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
359 This is a medium-level intermediate representation (MIR).
361 Some complex opcodes are decomposed at this stage into a
362 collection of simpler opcodes. Not every complex opcode is
363 decomposed at this stage, as we need to preserve the semantic
364 information during various optimization phases.
366 For example a NEWARR opcode carries the length and the type of
367 the array that could be used later to avoid type checking or
370 There are a number of operations supported on this
373 * Branch optimizations.
377 * Loop optimizations: the dominator trees are
378 computed, loops are detected, and their nesting
381 * Conversion of the method into static single assignment
384 * Dead code elimination.
386 * Constant propagation.
392 Once the above optimizations are optionally performed, a
393 decomposition phase is used to turn some complex opcodes into
394 internal method calls. In the initial version of the JIT
395 engine, various operations on longs are emulated instead of
396 being inlined. Also the newarr invocation is turned into a
399 At this point, after computing variable liveness, it is
400 possible to use the linear scan algorithm for allocating
401 variables to registers. The linear scan pass uses the
402 information that was previously gathered by the loop nesting
403 and loop structure computation to favor variables in inner
404 loops. This process updates the basic block `nesting' field
405 which is later used during liveness analysis.
407 Stack space is then reserved for the local variables and any
408 temporary variables generated during the various
411 ** Instruction selection: Only used up until Mono 2.0
413 At this point, the BURS instruction selector is invoked to
414 transform the tree-based representation into a list of
415 instructions. This is done using a tree pattern matcher that
416 is generated for the architecture using the `monoburg' tool.
418 Monoburg takes as input a file that describes tree patterns,
419 which are matched against the trees that were produced by the
420 engine in the previous stages.
422 The pattern matching might have more than one match for a
423 particular tree. In this case, the match selected is the one
424 whose cost is the smallest. A cost can be attached to each
425 rule, and if no cost is provided, the implicit cost is one.
426 Smaller costs are selected over higher costs.
428 The cost function can be used to select particular blocks of
429 code for a given architecture, or by using a prohibitive high
430 number to avoid having the rule match.
432 The various rules that our JIT engine uses transform a tree of
433 MonoInsts into a list of monoinsts:
435 +-----------------------------------------------------------+
437 | of ===> Instruction selection ===> of |
438 | MonoInst MonoInst. |
439 +-----------------------------------------------------------+
441 During this process various "types" of MonoInst kinds
442 disappear and turned into lower-level representations. The
443 JIT compiler just happens to reuse the same structure (this is
444 done to reduce memory usage and improve memory locality).
446 The instruction selection rules are split in a number of
447 files, each one with a particular purpose:
450 Contains the generic instruction selection
454 Contains x86 specific rules.
457 Contains PowerPC specific rules.
460 burg file for 64bit instructions on 32bit architectures.
463 burg file for 64bit architectures.
466 burg file for floating point instructions
468 For a given build, a set of those files would be included.
469 For example, for the build of Mono on the x86, the following
472 inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
474 ** Native method generation
476 The native method generation has a number of steps:
478 * Architecture specific register allocation.
480 The information about loop nesting that was
481 previously gathered is used here to hint the
484 * Generating the method prolog/epilog.
486 * Optionally generate code to introduce tracing facilities.
488 * Hooking into the debugger.
490 * Performing any pending fixups.
496 The actual code generation is contained in the architecture
497 specific portion of the compiler. The input to the code
498 generator is each one of the basic blocks with its list of
499 instructions that were produced in the instruction selection
502 During the instruction selection phase, virtual registers are
503 assigned. Just before the peephole optimization is performed,
504 physical registers are assigned.
506 A simple peephole and algebraic optimizer is ran at this
509 The peephole optimizer removes some redundant operations at
510 this point. This is possible because the code generation at
511 this point has visibility into the basic block that spans the
514 The algebraic optimizer performs some simple algebraic
515 optimizations that replace expensive operations with cheaper
516 operations if possible.
518 The rest of the code generation is fairly simple: a switch
519 statement is used to generate code for each of the MonoInsts,
520 in the mono/mini/mini-ARCH.c files, the method is called
521 "mono_arch_output_basic_block".
523 We always try to allocate code in sequence, instead of just using
524 malloc. This way we increase spatial locality which gives a massive
525 speedup on most architectures.
527 *** Ahead of Time compilation
529 Ahead-of-Time compilation is a new feature of our new
530 compilation engine. The compilation engine is shared by the
531 Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
534 The difference is on the set of optimizations that are turned
535 on for each mode: Just-in-Time compilation should be as fast
536 as possible, while Ahead-of-Time compilation can take as long
537 as required, because this is not done at a time critical
540 With AOT compilation, we can afford to turn all of the
541 computationally expensive optimizations on.
543 After the code generation phase is done, the code and any
544 required fixup information is saved into a file that is
545 readable by "as" (the native assembler available on all
546 systems). This assembly file is then passed to the native
547 assembler, which generates a loadable module.
549 At execution time, when an assembly is loaded from the disk,
550 the runtime engine will probe for the existence of a
551 pre-compiled image. If the pre-compiled image exists, then it
552 is loaded, and the method invocations are resolved to the code
553 contained in the loaded module.
555 The code generated under the AOT scenario is slightly
556 different than the JIT scenario. It generates code that is
557 application-domain relative and that can be shared among
560 This is the same code generation that is used when the runtime
561 is instructed to maximize code sharing on a multi-application
564 * SSA-based optimizations
566 SSA form simplifies many optimization because each variable
567 has exactly one definition site. This means that each
568 variable is only initialized once.
570 For example, code like this:
577 Is internally turned into:
584 In the presence of branches, like:
593 The code is turned into:
602 All uses of a variable are "dominated" by its definition
604 This representation is useful as it simplifies the
605 implementation of a number of optimizations like conditional
606 constant propagation, array bounds check removal and dead code
609 * Register allocation.
611 Global register allocation is performed on the medium
612 intermediate representation just before instruction selection
613 is performed on the method. Local register allocation is
614 later performed at the basic-block level on the
616 Global register allocation uses the following input:
618 1) set of register-sized variables that can be allocated to a
619 register (this is an architecture specific setting, for x86
620 these registers are the callee saved register ESI, EDI and
623 2) liveness information for the variables
625 3) (optionally) loop info to favor variables that are used in
628 During instruction selection phase, symbolic registers are
629 assigned to temporary values in expressions.
631 Local register allocation assigns hard registers to the
632 symbolic registers, and it is performed just before the code
633 is actually emitted and is performed at the basic block level.
634 A CPU description file describes the input registers, output
635 registers, fixed registers and clobbered registers by each
638 * BURG Code Generator Generator: Only used up to Mono 2.0
640 monoburg was written by Dietmar Maurer. It is based on the
641 papers from Christopher W. Fraser, Robert R. Henry and Todd
642 A. Proebsting: "BURG - Fast Optimal Instruction Selection and
643 Tree Parsing" and "Engineering a Simple, Efficient Code
644 Generator Generator".
646 The original BURG implementation is unable to work on DAGs, instead only
647 trees are allowed. Our monoburg implementations is able to generate tree
648 matcher which works on DAGs, and we use this feature in the new
649 JIT. This simplifies the code because we can directly pass DAGs and
650 don't need to convert them to trees.
652 * Adding IL opcodes: an excercise (from a post by Paolo Molaro)
654 mini.c is the file that read the IL code stream and decides
655 how any single IL instruction is implemented
656 (mono_method_to_ir () func), so you always have to add an
657 entry to the big switch inside the function: there are plenty
658 of examples in that file.
660 An IL opcode can be implemented in a number of ways, depending
661 on what it does and how it needs to do it.
663 Some opcodes are implemented using a helper function: one of
664 the simpler examples is the CEE_STELEM_REF implementation.
666 In this case the opcode implementation is written in a C
667 function. You will need to register the function with the jit
668 before you can use it (mono_register_jit_call) and you need to
669 emit the call to the helper using the mono_emit_jit_icall()
672 This is the simpler way to add a new opcode and it doesn't
673 require any arch-specific change (though it's limited to what
674 you can do in C code and the performance may be limited by the
677 Other opcodes can be implemented with one or more of the already
678 implemented low-level instructions.
680 An example is the OP_STRLEN opcode which implements
681 String.Length using a simple load from memory. In this case
682 you need to add a rule to the appropriate burg file,
683 describing what are the arguments of the opcode and what is,
684 if any, it's 'return' value.
686 The OP_STRLEN case is:
688 reg: OP_STRLEN (reg) {
689 MONO_EMIT_LOAD_MEMBASE_OP (s, tree, OP_LOADI4_MEMBASE, state->reg1,
690 state->left->reg1, G_STRUCT_OFFSET (MonoString, length));
693 The above means: the OP_STRLEN takes a register as an argument
694 and returns its value in a register. And the implementation
695 of this is included in the braces.
697 The opcode returns a value in an integer register
698 (state->reg1) by performing a int32 load of the length field
699 of the MonoString represented by the input register
700 (state->left->reg1): before the burg rules are applied, the
701 internal representation is based on trees, so you get the
702 left/right pointers (state->left and state->right
703 respectively, the result is stored in state->reg1).
705 This instruction implementation doesn't require arch-specific
706 changes (it is using the MONO_EMIT_LOAD_MEMBASE_OP which is
707 available on all platforms), and usually the produced code is
710 Next we have opcodes that must be implemented with new low-level
711 architecture specific instructions (either because of performance
712 considerations or because the functionality can't get implemented in
715 You also need a burg rule in this case, too. For example,
716 consider the OP_CHECK_THIS opcode (used to raise an exception
717 if the this pointer is null). The burg rule simply reads:
719 stmt: OP_CHECK_THIS (reg) {
720 mono_bblock_add_inst (s->cbb, tree);
723 Note that this opcode does not return a value (hence the
724 "stmt") and it takes a register as input.
726 mono_bblock_add_inst (s->cbb, tree) just adds the instruction
727 (the tree variable) to the current basic block (s->cbb). In
728 mini this is the place where the internal representation
729 switches from the tree format to the low-level format (the
730 list of simple instructions).
732 In this case the actual opcode implementation is delegated to
733 the arch-specific code. A low-level opcode needs an entry in
734 the machine description (the *.md files in mini/). This entry
735 describes what kind of registers are used if any by the
736 instruction, as well as other details such as constraints or
737 other hints to the low-level engine which are architecture
740 cpu-pentium.md, for example has the following entry:
742 checkthis: src1:b len:3
744 This means the instruction uses an integer register as a base
745 pointer (basically a load or store is done on it) and it takes
746 3 bytes of native code to implement it.
748 Now you just need to provide the low-level implementation for
749 the opcode in one of the mini-$arch.c files, in the
750 mono_arch_output_basic_block() function. There is a big switch
751 here too. The x86 implementation is:
754 /* ensure ins->sreg1 is not NULL */
755 x86_alu_membase_imm (code, X86_CMP, ins->sreg1, 0, 0);
758 If the $arch-codegen.h header file doesn't have the code to
759 emit the low-level native code, you'll need to write that as
762 Complex opcodes with register constraints may require other
763 changes to the local register allocator, but usually they are
768 Profile-based optimization is something that we are very
769 interested in supporting. There are two possible usage
772 * Based on the profile information gathered during
773 the execution of a program, hot methods can be compiled
774 with the highest level of optimizations, while bootstrap
775 code and cold methods can be compiled with the least set
776 of optimizations and placed in a discardable list.
778 * Code reordering: this profile-based optimization would
779 only make sense for pre-compiled code. The profile
780 information is used to re-order the assembly code on disk
781 so that the code is placed on the disk in a way that
784 This is the same principle under which SGI's cord program
787 The nature of the CIL allows the above optimizations to be
788 easy to implement and deploy. Since we live and define our
789 universe for these things, there are no interactions with
790 system tools required, nor upgrades on the underlying
791 infrastructure required.
793 Instruction scheduling is important for certain kinds of
794 processors, and some of the framework exists today in our
795 register allocator and the instruction selector to cope with
796 this, but has not been finished. The instruction selection
797 would happen at the same time as local register allocation. <