docs/mini-porting.txt

   1                        Mono JIT porting guide.
   2                    Paolo Molaro (lupus@ximian.com)
   3
   4 * Introduction
   5
   6         This documents describes the process of porting the mono JIT
   7         to a new CPU architecture. The new mono JIT has been designed
   8         to make porting easier though at the same time enable the port
   9         to take full advantage from the new architecture features and
  10         instructions. Knowledge of the mini architecture (described in
  11         the mini-doc.txt file) is a requirement for understanding this
  12         guide, as well as an earlier document about porting the mono
  13         interpreter (available on the web site).
  14
  15         There are six main areas that a port needs to implement to
  16         have a fully-functional JIT for a given architecture:
  17
  18                 1) instruction selection
  19                 2) native code emission
  20                 3) call conventions and register allocation
  21                 4) method trampolines
  22                 5) exception handling
  23                 6) minor helper methods
  24
  25         To take advantage of some not-so-common processor features
  26         (for example conditional execution of instructions as may be
  27         found on ARM or ia64), it may be needed to develop an
  28         high-level optimization, but doing so is not a requirement for
  29         getting the JIT to work.
  30
  31         We'll see in more details each of the steps required, note,
  32         though, that a new port may just as well start from a
  33         cut&paste of an existing port to a similar architecture (for
  34         example from x86 to amd64, or from powerpc to sparc).
  35
  36         The architecture specific code is split from the rest of the
  37         JIT, for example the x86 specific code and data is all
  38         included in the following files in the distribution:
  39
  40                 mini-x86.h mini-x86.c
  41                 inssel-x86.brg
  42                 cpu-pentium.md
  43                 tramp-x86.c
  44                 exceptions-x86.c
  45
  46         I suggest a similar split for other architectures as well.
  47
  48         Note that this document is still incomplete: some sections are
  49         only sketched and some are missing, but the important info to
  50         get a port going is already described.
  51
  52
  53 * Architecture-specific instructions and instruction selection.
  54
  55         The JIT already provides a set of instructions that can be
  56         easily mapped to a great variety of different processor
  57         instructions.  Sometimes it may be necessary or advisable to
  58         add a new instruction that represent more closely an
  59         instruction in the architecture.  Note that a mini instruction
  60         can be used to represent also a short sequence of CPU
  61         low-level instructions, but note that each instruction
  62         represents the minimum amount of code the instruction
  63         scheduler will handle (i.e., the scheduler won't schedule the
  64         instructions that compose the low-level sequence as individual
  65         instructions, but just the whole sequence, as an indivisible
  66         block).
  67
  68         New instructions are created by adding a line in the
  69         mini-ops.h file, assigning an opcode and a name. To specify
  70         the input and output for the instruction, there are two
  71         different places, depending on the context in which the
  72         instruction gets used.
  73
  74         If the instruction is used in the tree representation, the
  75         input and output types are defined by the BURG rules in the
  76         *.brg files (the usual non-terminals are 'reg' to represent a
  77         normal register, 'lreg' to represent a register or two that
  78         hold a 64 bit value, freg for a floating point register).
  79
  80         If an instruction is used as a low-level CPU instruction, the
  81         info is specified in a machine description file. The
  82         description file is processed by the genmdesc program to
  83         provide a data structure that can be easily used from C code
  84         to query the needed info about the instruction.
  85
  86         As an example, let's consider the add instruction for both x86
  87         and ppc:
  88
  89         x86 version:
  90                 add: dest:i src1:i src2:i len:2 clob:1
  91         ppc version:
  92                 add: dest:i src1:i src2:i len:4
  93
  94         Note that the instruction takes two input integer registers on
  95         both CPU, but on x86 the first source register is clobbered
  96         (clob:1) and the length in bytes of the instruction differs.
  97
  98         Note that integer adds and floating point adds use different
  99         opcodes, unlike the IL language (64 bit add is done with two
 100         instructions on 32 bit architectures, using a add that sets
 101         the carry and an add with carry).
 102
 103         A specific CPU port may assign any meaning to the clob field
 104         for an instruction since the value will be processed in an
 105         arch-specific file anyway.
 106
 107         See the top of the existing cpu-pentium.md file for more info
 108         on other fields: the info may or may not be applicable to a
 109         different CPU, in this latter case the info can be ignored.
 110
 111         The code in mini.c together with the BURG rules in inssel.brg,
 112         inssel-float.brg and inssel-long32.brg provides general
 113         purpose mappings from the tree representation to a set of
 114         instructions that should be easily implemented in any
 115         architecture.  To allow for additional arch-specific
 116         functionality, an arch-specific BURG file can be used: in this
 117         file arch-specific instructions can be selected that provide
 118         better performance than the general instructions or that
 119         provide functionality that is needed by the JIT but that
 120         cannot be expressed in a general enough way.
 121
 122         As an example, x86 has the special instruction "push" to make
 123         it easier to implement the default call convention (passing
 124         arguments on the stack): almost all the other architectures
 125         don't have such an instruction (and don't need it anyway), so
 126         we added a special rule in the inssel-x86.brg file for it.
 127
 128         So, one of the first things needed in a port is to write a
 129         cpu-$(arch).md machine description file and fill it with the
 130         needed info. As a start, only a few instructions can be
 131         specified, like the ones required to do simple integer
 132         operations. The default rules of the instruction selector will
 133         emit the common instructions and so we're ready to go for the
 134         next step in porting the JIT.
 135
 136
 137 *) Native code emission
 138
 139         Since the first step in porting mono to a new CPU is to port
 140         the interpreter, there should be already a file that allows
 141         the emission of binary native code in a buffer for the
 142         architecture. This file should be placed in the
 143
 144                 mono/arch/$(arch)/
 145
 146         directory.
 147
 148         The bulk of the code emission happens in the mini-$(arch).c
 149         file, in a function called mono_arch_output_basic_block
 150         (). This function takes a basic block, walks the list of
 151         instructions in the block and emits the binary code for each.
 152         Optionally a peephole optimization pass is done on the basic
 153         block, but this can be left for later, when the port actually
 154         works.
 155
 156         This function is very simple, there is just a big switch on
 157         the instruction opcode and in the corresponding case the
 158         functions or macros to emit the binary native code are
 159         used. Note that in this function the lengths of the
 160         instructions are used to determine if the buffer for the code
 161         needs enlarging.
 162
 163         To complete the code emission for a method, a few other
 164         functions need implementing as well:
 165
 166                 mono_arch_emit_prolog ()
 167                 mono_arch_emit_epilog ()
 168                 mono_arch_patch_code ()
 169
 170         mono_arch_emit_prolog () will emit the code to setup the stack
 171         frame for a method, optionally call the callbacks used in
 172         profiling and tracing, and move the arguments to their home
 173         location (in a caller-save register if the variable was
 174         allocated to one, or in a stack location if the argument was
 175         passed in a volatile register and wasn't allocated a
 176         non-volatile one). caller-save registers used by the function
 177         are saved in the prolog as well.
 178
 179         mono_arch_emit_epilog () will emit the code needed to return
 180         from the function, optionally calling the profiling or tracing
 181         callbacks. At this point the basic blocks or the code that was
 182         moved out of the normal flow for the function can be emitted
 183         as well (this is usually done to provide better info for the
 184         static branch predictor).  In the epilog, caller-save
 185         registers are restored if they were used.
 186
 187         Note that, to help exception handling and stack unwinding,
 188         when there is a transition from managed to unmanaged code,
 189         some special processing needs to be done (basically, saving
 190         all the registers and setting up the links in the Last Managed
 191         Frame structure).
 192
 193         When the epilog has been emitted, the upper level code
 194         arranges for the buffer of memory that contains the native
 195         code to be copied in an area of executable memory and at this
 196         point, instructions that use relative addressing need to be
 197         patched to have the right offsets: this work is done by
 198         mono_arch_patch_code ().
 199
 200
 201 * Call conventions and register allocation
 202
 203         To account for the differences in the call conventions, a few functions need to
 204         be implemented.
 205
 206         mono_arch_allocate_vars () assigns to both arguments and local
 207         variables the offset relative to the frame register where they
 208         are stored, dead variables are simply discarded. The total
 209         amount of stack needed is calculated.
 210
 211         mono_arch_call_opcode () is the function that more closely
 212         deals with the call convention on a given system. For each
 213         argument to a function call, an instruction is created that
 214         actually puts the argument where needed, be it the stack or a
 215         specific register. This function can also re-arrange th order
 216         of evaluation when multiple arguments are involved if needed
 217         (like, on x86 arguments are pushed on the stack in reverse
 218         order). The function needs to carefully take into accounts
 219         platform specific issues, like how structures are returned as
 220         well as the differences in size and/or alignment of managed
 221         and corresponding unmanaged structures.
 222
 223         The other chunk of code that needs to deal with the call
 224         convention and other specifics of a CPU, is the local register
 225         allocator, implemented in a function named
 226         mono_arch_local_regalloc (). The local allocator deals with a
 227         basic block at a time and basically just allocates registers
 228         for temporary values during expression evaluation, spilling
 229         and unspilling as necessary.
 230
 231         The local allocator needs to take into account clobbering
 232         information, both during simple instructions and during
 233         function calls and it needs to deal with other
 234         architecture-specific weirdnesses, like instructions that take
 235         inputs only in specific registers or output only is some.
 236
 237         Some effort will be put later in moving most of the local
 238         register allocator to a common file so that the code can be
 239         shared more for similar, risc-like CPUs.  The register
 240         allocator does a first pass on the instructions in a block,
 241         collecting liveness information and in a backward pass on the
 242         same list performs the actual register allocation, inserting
 243         the instructions needed to spill values, if necessary.
 244
 245         The cross-platform local register allocator is now implemented
 246         and it is documented in the jit-regalloc file.
 247
 248         When this part of code is implemented, some testing can be
 249         done with the generated code for the new architecture. Most
 250         helpful is the use of the --regression command line switch to
 251         run the regression tests (basic.cs, for example).
 252
 253         Note that the JIT will try to initialize the runtime, but it
 254         may not be able yet to compile and execute complex code:
 255         commenting most of the code in the mini_init() function in
 256         mini.c is needed to let the JIT just compile the regression
 257         tests.  Also, using multiple -v switches on the command line
 258         makes the JIT dump an increasing amount of information during
 259         compilation.
 260
 261         Values loaded into registers need to be extened as needed by
 262         the ECMA specs:
 263
 264         *) integers smaller than 4 bytes are extended to int32 values
 265         *) 32 bit floats are extended to double precision (in particular
 266         this means that currently all the floating point operations operate
 267         on doubles)
 268
 269 * Method trampolines
 270
 271         To get better startup performance, the JIT actually compiles a
 272         method only when needed. To achieve this, when a call to a
 273         method is compiled, we actually emit a call to a magic
 274         trampoline. The magic trampoline is a function written in
 275         assembly that invokes the compiler to compile the given method
 276         and jumps to the newly compiled code, ensuring the arguments
 277         it received are passed correctly to the actual method.
 278
 279         Before jumping to the new code, though, the magic trampoline
 280         takes care of patching the call site so that next time the
 281         call will go directly to the method instead of the
 282         trampoline. How does this all work?
 283
 284         mono_arch_create_jit_trampoline () creates a small function
 285         that just preserves the arguments passed to it and adds an
 286         additional argument (the method to compile) before calling the
 287         generic trampoline. This small function is called the specific
 288         trampoline, because it is method-specific (the method to
 289         compile is hard-code in the instruction stream).
 290
 291         The generic trampoline saves all the arguments that could get
 292         clobbered and calls a C function that will do two things:
 293
 294         *) actually call the JIT to compile the method
 295         *) identify the calling code so that it can be patched to call directly
 296         the actual method
 297
 298         If the 'this' argument to a method is a boxed valuetype that
 299         is passed to a method that expects just a pointer to the data,
 300         an additional unboxing trampoline will need to be inserted as
 301         well.
 302
 303
 304 * Exception handling
 305
 306         Exception handling is likely the most difficult part of the
 307         port, as it needs to deal with unwinding (both managed and
 308         unmanaged code) and calling catch and filter blocks. It also
 309         needs to deal with signals, because mono takes advantage of
 310         the MMU in the CPU and of the operation system to handle
 311         dereferences of the NULL pointer. Some of the function needed
 312         to implement the mechanisms are:
 313
 314         mono_arch_get_throw_exception () returns a function that takes
 315         an exception object and invokes an arch-specific function that
 316         will enter the exception processing.  To do so, all the
 317         relevant registers need to be saved and passed on.
 318
 319         mono_arch_handle_exception () this function takes the
 320         exception thrown and a context that describes the state of the
 321         CPU at the time the exception was thrown. The function needs
 322         to implement the exception handling mechanism, so it makes a
 323         search for an handler for the exception and if none is found,
 324         it follows the unhandled exception path (that can print a
 325         trace and exit or just abort the current thread). The
 326         difficulty here is to unwind the stack correctly, by restoring
 327         the register state at each call site in the call chain,
 328         calling finally, filters and handler blocks while doing so.
 329
 330         As part of exception handling a couple of internal calls need
 331         to be implemented as well.
 332
 333         ves_icall_get_frame_info () returns info about a specific
 334         frame.
 335
 336         mono_jit_walk_stack () walks the stack and calls a callback with info for
 337         each frame found.
 338
 339         ves_icall_get_trace () return an array of StackFrame objects.
 340
 341 ** Code generation for filter/finally handlers
 342
 343         Filter and finally handlers are called from 2 different locations:
 344
 345                1.) from within the method containing the exception clauses
 346                2.) from the stack unwinding code
 347
 348         To make this possible we implement them like subroutines,
 349         ending with a "return" statement. The subroutine does not save
 350         the base pointer, because we need access to the local
 351         variables of the enclosing method. Its is possible that
 352         instructions inside those handlers modify the stack pointer,
 353         thus we save the stack pointer at the start of the handler,
 354         and restore it at the end. We have to use a "call" instruction
 355         to execute such finally handlers.
 356
 357         The MIR code for filter and finally handlers looks like:
 358
 359             OP_START_HANDLER
 360             ...
 361             OP_END_FINALLY | OP_ENDFILTER(reg)
 362
 363         OP_START_HANDLER: should save the stack pointer somewhere
 364         OP_END_FINALLY: restores the stack pointers and returns.
 365         OP_ENDFILTER (reg): restores the stack pointers and returns the value in "reg".
 366
 367 ** Calling finally/filter handlers
 368
 369         There is a special opcode to call those handler, its called
 370         OP_CALL_HANDLER. It simple emits a call instruction.
 371
 372         Its a bit more complex to call handler from outside (in the
 373         stack unwinding code), because we have to restore the whole
 374         context of the method first. After that we simply emit a call
 375         instruction to invoke the handler. Its usually possible to use
 376         the same code to call filter and finally handlers (see
 377         arch_get_call_filter).
 378
 379 ** Calling catch handlers
 380
 381         Catch handlers are always called from the stack unwinding
 382         code. Unlike finally clauses or filters, catch handler never
 383         return. Instead we simply restore the whole context, and
 384         restart execution at the catch handler.
 385
 386 ** Passing Exception objects to catch handlers and filters.
 387
 388         We use a local variable to store exception objects. The stack
 389         unwinding code must store the exception object into this
 390         variable before calling catch handler or filter.
 391
 392 * Minor helper methods
 393
 394         A few minor helper methods are referenced from the arch-independent code.
 395         Some of them are:
 396
 397         *) mono_arch_cpu_optimizations ()
 398                 This function returns a mask of optimizations that
 399                 should be enabled for the current CPU and a mask of
 400                 optimizations that should be excluded, instead.
 401
 402         *) mono_arch_regname ()
 403                 Returns the name for a numeric register.
 404
 405         *) mono_arch_get_allocatable_int_vars ()
 406                 Returns a list of variables that can be allocated to
 407                 the integer registers in the current architecture.
 408
 409         *) mono_arch_get_global_int_regs ()
 410                 Returns a list of caller-save registers that can be
 411                 used to allocate variables in the current method.
 412
 413         *) mono_arch_instrument_mem_needs ()
 414         *) mono_arch_instrument_prolog ()
 415         *) mono_arch_instrument_epilog ()
 416                 Functions needed to implement the profiling interface.
 417
 418 * Testing the port
 419
 420     The JIT has a set of regression tests in *.cs files inside the
 421     mini directory.
 422
 423     The usual method of testing a port is by compiling these tests on
 424     another machine with a working runtime by typing 'make rcheck',
 425     then copying TestDriver.dll and *.exe to the mini directory. The
 426     tests can be run by typing:
 427
 428         ./mono --regression <exe file name>
 429
 430     The suggested order for working through these tests is the
 431     following:
 432
 433         - basic.exe
 434         - basic-long.exe
 435         - basic-float.exe
 436         - basic-calls.exe
 437         - objects.exe
 438         - arrays.exe
 439         - exceptions.exe
 440         - iltests.exe
 441         - generics.exe
 442
 443 * Writing regression tests
 444
 445         Regression tests for the JIT should be written for any bug
 446         found in the JIT in one of the *.cs files in the mini
 447         directory. Eventually all the operations of the JIT should be
 448         tested (including the ones that get selected only when some
 449         specific optimization is enabled).
 450
 451
 452 * Platform specific optimizations
 453
 454         An example of a platform-specific optimization is the peephole
 455         optimization: we look at a small window of code at a time and
 456         we replace one or more instructions with others that perform
 457         better for the given architecture or CPU.
 458
 459 * 64 bit support tips, by Zoltan Varga (vargaz@gmail.com)
 460
 461         For a 64-bit port of the Mono runtime, you will typically do
 462         the following:
 463
 464                 * need to use inssel-long.brg instead of
 465                   inssel-long32.brg.
 466
 467                 * need to implement lots of new opcodes:
 468                        OP_I<OP> is 32 bit op
 469                        OP_L<OP> and CEE_<OP> are 64 bit ops
 470
 471
 472         The 64 bit version of an existing port might share the code
 473         with the 32 bit port (for example SPARC/SPARV9), or it might
 474         be separate (x86/AMD64).
 475
 476         That will depend on the similarities of the two instructions
 477         sets/ABIs etc.
 478
 479         The runtime and most parts of the JIT are 64 bit clean
 480         at this point, so the only parts which require changing are
 481         the arch dependent files.
 482
 483 * Function descriptors
 484
 485         Some ABIs, like those for IA64 and PPC64, don't use direct
 486         function pointers, but so called function descriptors.  A
 487         function descriptor is a short data structure which contains
 488         at least a pointer to the code of the function and a pointer
 489         to a GOT/TOC, which needs to be loaded into a specific
 490         register prior to jumping to the function.  Global variables
 491         and large constants are accessed through that register.
 492
 493         Mono does not need function descriptors for the JITted code,
 494         but we need to handle them when calling unmanaged code and we
 495         need to create them when passing managed code to unmanaged
 496         code.
 497
 498         mono_create_ftnptr() creates a function descriptor for a piece
 499         of generated code within a specific domain.
 500
 501         mono_get_addr_from_ftnptr() returns the pointer to the native
 502         code in a function descriptor.  Never use this function to
 503         generate a jump to a function without loading the GOT/TOC
 504         register unless the function descriptor was created by
 505         mono_create_ftnptr().
 506
 507         See the sources for IA64 and PPC64 on when to create and when
 508         to dereference function descriptors.  On PPC64 function
 509         descriptors for various generated helper functions (in
 510         exceptions-ppc.c and tramp-ppc.c) are generated in front of
 511         the code they refer to (see ppc_create_pre_code_ftnptr()).  On
 512         IA64 they are created separately.
 513
 514 * Emulated opcodes
 515
 516         Mini has code for emulating quite a few opcodes, most notably
 517         operations on longs, int/float conversions and atomic
 518         operations.  If an architecture wishes such an opcode to be
 519         emulated, mini produces icalls instead of those opcodes.  This
 520         should only be considered when the operation cannot be
 521         implemented efficiently and thus the overhead occured by the
 522         icall is not relatively large.  Emulation of operations is
 523         controlled by #defines in the arch header, but the naming is
 524         not consistent.  They usually start with MONO_ARCH_EMULATE_,
 525         MONO_ARCH_NO_EMULATE_ and MONO_ARCH_HAVE_.
 526
 527 * Generic code sharing
 528
 529         Generic code sharing is optional.  See the file
 530         "generic-sharing" for information on how to support it on an
 531         architecture.