1 Mono JIT porting guide.
2 Paolo Molaro (lupus@ximian.com)
6 This documents describes the process of porting the mono JIT
7 to a new CPU architecture. The new mono JIT has been designed
8 to make porting easier though at the same time enable the port
9 to take full advantage from the new architecture features and
10 instructions. Knowledge of the mini architecture (described in
11 the mini-doc.txt file) is a requirement for understanding this
12 guide, as well as an earlier document about porting the mono
13 interpreter (available on the web site).
15 There are six main areas that a port needs to implement to
16 have a fully-functional JIT for a given architecture:
18 1) instruction selection
19 2) native code emission
20 3) call conventions and register allocation
23 6) minor helper methods
25 To take advantage of some not-so-common processor features
26 (for example conditional execution of instructions as may be
27 found on ARM or ia64), it may be needed to develop an
28 high-level optimization, but doing so is not a requirement for
29 getting the JIT to work.
31 We'll see in more details each of the steps required, note,
32 though, that a new port may just as well start from a
33 cut&paste of an existing port to a similar architecture (for
34 example from x86 to amd64, or from powerpc to sparc).
36 The architecture specific code is split from the rest of the
37 JIT, for example the x86 specific code and data is all
38 included in the following files in the distribution:
46 I suggest a similar split for other architectures as well.
48 Note that this document is still incomplete: some sections are
49 only sketched and some are missing, but the important info to
50 get a port going is already described.
53 * Architecture-specific instructions and instruction selection.
55 The JIT already provides a set of instructions that can be
56 easily mapped to a great variety of different processor
57 instructions. Sometimes it may be necessary or advisable to
58 add a new instruction that represent more closely an
59 instruction in the architecture. Note that a mini instruction
60 can be used to represent also a short sequence of CPU
61 low-level instructions, but note that each instruction
62 represents the minimum amount of code the instruction
63 scheduler will handle (i.e., the scheduler won't schedule the
64 instructions that compose the low-level sequence as individual
65 instructions, but just the whole sequence, as an indivisible
68 New instructions are created by adding a line in the
69 mini-ops.h file, assigning an opcode and a name. To specify
70 the input and output for the instruction, there are two
71 different places, depending on the context in which the
72 instruction gets used.
74 If the instruction is used in the tree representation, the
75 input and output types are defined by the BURG rules in the
76 *.brg files (the usual non-terminals are 'reg' to represent a
77 normal register, 'lreg' to represent a register or two that
78 hold a 64 bit value, freg for a floating point register).
80 If an instruction is used as a low-level CPU instruction, the
81 info is specified in a machine description file. The
82 description file is processed by the genmdesc program to
83 provide a data structure that can be easily used from C code
84 to query the needed info about the instruction.
86 As an example, let's consider the add instruction for both x86
90 add: dest:i src1:i src2:i len:2 clob:1
92 add: dest:i src1:i src2:i len:4
94 Note that the instruction takes two input integer registers on
95 both CPU, but on x86 the first source register is clobbered
96 (clob:1) and the length in bytes of the instruction differs.
98 Note that integer adds and floating point adds use different
99 opcodes, unlike the IL language (64 bit add is done with two
100 instructions on 32 bit architectures, using a add that sets
101 the carry and an add with carry).
103 A specific CPU port may assign any meaning to the clob field
104 for an instruction since the value will be processed in an
105 arch-specific file anyway.
107 See the top of the existing cpu-pentium.md file for more info
108 on other fields: the info may or may not be applicable to a
109 different CPU, in this latter case the info can be ignored.
111 The code in mini.c together with the BURG rules in inssel.brg,
112 inssel-float.brg and inssel-long32.brg provides general
113 purpose mappings from the tree representation to a set of
114 instructions that should be easily implemented in any
115 architecture. To allow for additional arch-specific
116 functionality, an arch-specific BURG file can be used: in this
117 file arch-specific instructions can be selected that provide
118 better performance than the general instructions or that
119 provide functionality that is needed by the JIT but that
120 cannot be expressed in a general enough way.
122 As an example, x86 has the special instruction "push" to make
123 it easier to implement the default call convention (passing
124 arguments on the stack): almost all the other architectures
125 don't have such an instruction (and don't need it anyway), so
126 we added a special rule in the inssel-x86.brg file for it.
128 So, one of the first things needed in a port is to write a
129 cpu-$(arch).md machine description file and fill it with the
130 needed info. As a start, only a few instructions can be
131 specified, like the ones required to do simple integer
132 operations. The default rules of the instruction selector will
133 emit the common instructions and so we're ready to go for the
134 next step in porting the JIT.
137 *) Native code emission
139 Since the first step in porting mono to a new CPU is to port
140 the interpreter, there should be already a file that allows
141 the emission of binary native code in a buffer for the
142 architecture. This file should be placed in the
148 The bulk of the code emission happens in the mini-$(arch).c
149 file, in a function called mono_arch_output_basic_block
150 (). This function takes a basic block, walks the list of
151 instructions in the block and emits the binary code for each.
152 Optionally a peephole optimization pass is done on the basic
153 block, but this can be left for later, when the port actually
156 This function is very simple, there is just a big switch on
157 the instruction opcode and in the corresponding case the
158 functions or macros to emit the binary native code are
159 used. Note that in this function the lengths of the
160 instructions are used to determine if the buffer for the code
163 To complete the code emission for a method, a few other
164 functions need implementing as well:
166 mono_arch_emit_prolog ()
167 mono_arch_emit_epilog ()
168 mono_arch_patch_code ()
170 mono_arch_emit_prolog () will emit the code to setup the stack
171 frame for a method, optionally call the callbacks used in
172 profiling and tracing, and move the arguments to their home
173 location (in a caller-save register if the variable was
174 allocated to one, or in a stack location if the argument was
175 passed in a volatile register and wasn't allocated a
176 non-volatile one). caller-save registers used by the function
177 are saved in the prolog as well.
179 mono_arch_emit_epilog () will emit the code needed to return
180 from the function, optionally calling the profiling or tracing
181 callbacks. At this point the basic blocks or the code that was
182 moved out of the normal flow for the function can be emitted
183 as well (this is usually done to provide better info for the
184 static branch predictor). In the epilog, caller-save
185 registers are restored if they were used.
187 Note that, to help exception handling and stack unwinding,
188 when there is a transition from managed to unmanaged code,
189 some special processing needs to be done (basically, saving
190 all the registers and setting up the links in the Last Managed
193 When the epilog has been emitted, the upper level code
194 arranges for the buffer of memory that contains the native
195 code to be copied in an area of executable memory and at this
196 point, instructions that use relative addressing need to be
197 patched to have the right offsets: this work is done by
198 mono_arch_patch_code ().
201 * Call conventions and register allocation
203 To account for the differences in the call conventions, a few functions need to
206 mono_arch_allocate_vars () assigns to both arguments and local
207 variables the offset relative to the frame register where they
208 are stored, dead variables are simply discarded. The total
209 amount of stack needed is calculated.
211 mono_arch_call_opcode () is the function that more closely
212 deals with the call convention on a given system. For each
213 argument to a function call, an instruction is created that
214 actually puts the argument where needed, be it the stack or a
215 specific register. This function can also re-arrange th order
216 of evaluation when multiple arguments are involved if needed
217 (like, on x86 arguments are pushed on the stack in reverse
218 order). The function needs to carefully take into accounts
219 platform specific issues, like how structures are returned as
220 well as the differences in size and/or alignment of managed
221 and corresponding unmanaged structures.
223 The other chunk of code that needs to deal with the call
224 convention and other specifics of a CPU, is the local register
225 allocator, implemented in a function named
226 mono_arch_local_regalloc (). The local allocator deals with a
227 basic block at a time and basically just allocates registers
228 for temporary values during expression evaluation, spilling
229 and unspilling as necessary.
231 The local allocator needs to take into account clobbering
232 information, both during simple instructions and during
233 function calls and it needs to deal with other
234 architecture-specific weirdnesses, like instructions that take
235 inputs only in specific registers or output only is some.
237 Some effort will be put later in moving most of the local
238 register allocator to a common file so that the code can be
239 shared more for similar, risc-like CPUs. The register
240 allocator does a first pass on the instructions in a block,
241 collecting liveness information and in a backward pass on the
242 same list performs the actual register allocation, inserting
243 the instructions needed to spill values, if necessary.
245 The cross-platform local register allocator is now implemented
246 and it is documented in the jit-regalloc file.
248 When this part of code is implemented, some testing can be
249 done with the generated code for the new architecture. Most
250 helpful is the use of the --regression command line switch to
251 run the regression tests (basic.cs, for example).
253 Note that the JIT will try to initialize the runtime, but it
254 may not be able yet to compile and execute complex code:
255 commenting most of the code in the mini_init() function in
256 mini.c is needed to let the JIT just compile the regression
257 tests. Also, using multiple -v switches on the command line
258 makes the JIT dump an increasing amount of information during
261 Values loaded into registers need to be extened as needed by
264 *) integers smaller than 4 bytes are extended to int32 values
265 *) 32 bit floats are extended to double precision (in particular
266 this means that currently all the floating point operations operate
271 To get better startup performance, the JIT actually compiles a
272 method only when needed. To achieve this, when a call to a
273 method is compiled, we actually emit a call to a magic
274 trampoline. The magic trampoline is a function written in
275 assembly that invokes the compiler to compile the given method
276 and jumps to the newly compiled code, ensuring the arguments
277 it received are passed correctly to the actual method.
279 Before jumping to the new code, though, the magic trampoline
280 takes care of patching the call site so that next time the
281 call will go directly to the method instead of the
282 trampoline. How does this all work?
284 mono_arch_create_jit_trampoline () creates a small function
285 that just preserves the arguments passed to it and adds an
286 additional argument (the method to compile) before calling the
287 generic trampoline. This small function is called the specific
288 trampoline, because it is method-specific (the method to
289 compile is hard-code in the instruction stream).
291 The generic trampoline saves all the arguments that could get
292 clobbered and calls a C function that will do two things:
294 *) actually call the JIT to compile the method
295 *) identify the calling code so that it can be patched to call directly
298 If the 'this' argument to a method is a boxed valuetype that
299 is passed to a method that expects just a pointer to the data,
300 an additional unboxing trampoline will need to be inserted as
306 Exception handling is likely the most difficult part of the
307 port, as it needs to deal with unwinding (both managed and
308 unmanaged code) and calling catch and filter blocks. It also
309 needs to deal with signals, because mono takes advantage of
310 the MMU in the CPU and of the operation system to handle
311 dereferences of the NULL pointer. Some of the function needed
312 to implement the mechanisms are:
314 mono_arch_get_throw_exception () returns a function that takes
315 an exception object and invokes an arch-specific function that
316 will enter the exception processing. To do so, all the
317 relevant registers need to be saved and passed on.
319 mono_arch_handle_exception () this function takes the
320 exception thrown and a context that describes the state of the
321 CPU at the time the exception was thrown. The function needs
322 to implement the exception handling mechanism, so it makes a
323 search for an handler for the exception and if none is found,
324 it follows the unhandled exception path (that can print a
325 trace and exit or just abort the current thread). The
326 difficulty here is to unwind the stack correctly, by restoring
327 the register state at each call site in the call chain,
328 calling finally, filters and handler blocks while doing so.
330 As part of exception handling a couple of internal calls need
331 to be implemented as well.
333 ves_icall_get_frame_info () returns info about a specific
336 mono_jit_walk_stack () walks the stack and calls a callback with info for
339 ves_icall_get_trace () return an array of StackFrame objects.
341 ** Code generation for filter/finally handlers
343 Filter and finally handlers are called from 2 different locations:
345 1.) from within the method containing the exception clauses
346 2.) from the stack unwinding code
348 To make this possible we implement them like subroutines,
349 ending with a "return" statement. The subroutine does not save
350 the base pointer, because we need access to the local
351 variables of the enclosing method. Its is possible that
352 instructions inside those handlers modify the stack pointer,
353 thus we save the stack pointer at the start of the handler,
354 and restore it at the end. We have to use a "call" instruction
355 to execute such finally handlers.
357 The MIR code for filter and finally handlers looks like:
361 OP_END_FINALLY | OP_ENDFILTER(reg)
363 OP_START_HANDLER: should save the stack pointer somewhere
364 OP_END_FINALLY: restores the stack pointers and returns.
365 OP_ENDFILTER (reg): restores the stack pointers and returns the value in "reg".
367 ** Calling finally/filter handlers
369 There is a special opcode to call those handler, its called
370 OP_CALL_HANDLER. It simple emits a call instruction.
372 Its a bit more complex to call handler from outside (in the
373 stack unwinding code), because we have to restore the whole
374 context of the method first. After that we simply emit a call
375 instruction to invoke the handler. Its usually possible to use
376 the same code to call filter and finally handlers (see
377 arch_get_call_filter).
379 ** Calling catch handlers
381 Catch handlers are always called from the stack unwinding
382 code. Unlike finally clauses or filters, catch handler never
383 return. Instead we simply restore the whole context, and
384 restart execution at the catch handler.
386 ** Passing Exception objects to catch handlers and filters.
388 We use a local variable to store exception objects. The stack
389 unwinding code must store the exception object into this
390 variable before calling catch handler or filter.
392 * Minor helper methods
394 A few minor helper methods are referenced from the arch-independent code.
397 *) mono_arch_cpu_optimizations ()
398 This function returns a mask of optimizations that
399 should be enabled for the current CPU and a mask of
400 optimizations that should be excluded, instead.
402 *) mono_arch_regname ()
403 Returns the name for a numeric register.
405 *) mono_arch_get_allocatable_int_vars ()
406 Returns a list of variables that can be allocated to
407 the integer registers in the current architecture.
409 *) mono_arch_get_global_int_regs ()
410 Returns a list of caller-save registers that can be
411 used to allocate variables in the current method.
413 *) mono_arch_instrument_mem_needs ()
414 *) mono_arch_instrument_prolog ()
415 *) mono_arch_instrument_epilog ()
416 Functions needed to implement the profiling interface.
419 * Writing regression tests
421 Regression tests for the JIT should be written for any bug
422 found in the JIT in one of the *.cs files in the mini
423 directory. Eventually all the operations of the JIT should be
424 tested (including the ones that get selected only when some
425 specific optimization is enabled).
428 * Platform specific optimizations
430 An example of a platform-specific optimization is the peephole
431 optimization: we look at a small window of code at a time and
432 we replace one or more instructions with others that perform
433 better for the given architecture or CPU.
435 * 64 bit support tips, by Zoltan Varga (vargaz@gmail.com)
437 For a 64-bit port of the Mono runtime, you will typically do
440 * need to use inssel-long.brg instead of
443 * need to implement lots of new opcodes:
444 OP_I<OP> is 32 bit op
445 OP_L<OP> and CEE_<OP> are 64 bit ops
448 The 64 bit version of an existing port might share the code
449 with the 32 bit port (for example SPARC/SPARV9), or it might
450 be separate (x86/AMD64).
452 That will depend on the similarities of the two instructions
455 The runtime and most parts of the JIT are 64 bit clean
456 at this point, so the only parts which require changing are
457 the arch dependent files.