docs/mini-doc.txt

   1
   2                A new JIT compiler for the Mono Project
   3
   4            Miguel de Icaza (miguel@{ximian.com,gnome.org}),
   5            Paolo Molaro (lupus@{ximian.com,debian.org})
   6
   7         This documents overall design of the Mono JIT up to version
   8         2.0.   After Mono 2.0 the JIT engine was upgraded from
   9         a tree-based intermediate representation to a linear
  10         intermediate representation.
  11
  12         The Linear IL is documented here:
  13
  14             http://www.mono-project.com/Linear_IL
  15
  16 * Abstract
  17
  18         Mini is a new compilation engine for the Mono runtime.  The
  19         new engine is designed to bring new code generation
  20         optimizations, portability and pre-compilation.
  21
  22         In this document we describe the design decisions and the
  23         architecture of the new compilation engine.
  24
  25 * Introduction
  26
  27         Mono is a Open Source implementation of the .NET Framework: it
  28         is made up of a runtime engine that implements the ECMA Common
  29         Language Infrastructure (CLI), a set of compilers that target
  30         the CLI and a large collection of class libraries.
  31
  32         This article discusses the new code generation facilities that
  33         have been added to the Mono runtime.
  34
  35         First we discuss the overall architecture of the Mono runtime,
  36         and how code generation fits into it; Then we discuss the
  37         development and basic architecture of our first JIT compiler
  38         for the ECMA CIL framework.  The next section covers the
  39         objectives for the work on the new JIT compiler, then we
  40         discuss the new features available in the new JIT compiler,
  41         and finally a technical description of the new code generation
  42         engine.
  43
  44 * Architecture of the Mono Runtime
  45
  46         The Mono runtime is an implementation of the ECMA Common
  47         Language Infrastructure (CLI), whose aim is to be a common
  48         platform for executing code in multiple languages.
  49
  50         Languages that target the CLI generate images that contain
  51         code in high-level intermediate representation called the
  52         "Common Intermediate Language".  This intermediate language is
  53         rich enough to allow for programs and pre-compiled libraries
  54         to be reflected.  The execution environment allows for an
  55         object oriented execution environment with single inheritance
  56         and multiple interface implementations.
  57
  58         This runtime provides a number of services for programs that
  59         are targeted to it: Just-in-Time compilation of CIL code into
  60         native code, garbage collection, thread management, I/O
  61         routines, single, double and decimal floating point,
  62         asynchronous method invocation, application domains, and a
  63         framework for building arbitrary RPC systems (remoting) and
  64         integration with system libraries through the Platform Invoke
  65         functionality.
  66
  67         The focus of this document is on the services provided by the
  68         Mono runtime to transform CIL bytecodes into code that is
  69         native to the underlying architecture.
  70
  71         The code generation interface is a set of macros that allow a
  72         C programmer to generate code on the fly, this is done
  73         through a set of macros found in the mono/jit/arch/ directory.
  74         These macros are used by the JIT compiler to generate native
  75         code.
  76
  77         The platform invocation code is interesting, as it generates
  78         CIL code on the fly to marshal parameters, and then this
  79         code is in turned processed by the JIT engine.
  80
  81         Mono has now gone through three different JIT engines, these
  82         are:
  83
  84         * Original JIT engine: 2002, hard to port, hard to
  85           implement optimizations.
  86
  87         * Second JIT engine, used up until Mono 2.0, very
  88           portable, many new optimizations.
  89
  90         * Third JIT engine, replaced the code generation layer from
  91           being based on a tree representation to be based on a linear
  92           representation.
  93
  94         For more information on the code generation changes, see our
  95         web site for the details on the Linear IL:
  96
  97             http://www.mono-project.com/Linear_IL
  98
  99 * Previous Experiences
 100
 101         Mono has built a JIT engine, which has been used to bootstrap
 102         Mono since January, 2002.  This JIT engine has reasonable
 103         performance, and uses an tree pattern matching instruction
 104         selector based on the BURS technology.  This JIT compiler was
 105         designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
 106
 107         The existing JIT compiler has three phases:
 108
 109                 * Re-creation of the semantic tree from CIL
 110                   byte-codes.
 111
 112                 * Instruction selection, with a cost-driven
 113                   engine.
 114
 115                 * Code generation and register allocation.
 116
 117         It is also hooked into the rest of the runtime to provide
 118         services like marshaling, just-in-time compilation and
 119         invocation of "internal calls".
 120
 121         This engine constructed a collection of trees, which we
 122         referred to as the "forest of trees", this forest is created by
 123         "hydrating" the CIL instruction stream.
 124
 125         The first step was to identify the basic blocks on the method,
 126         and computing the control flow graph (cfg) for it.  Once this
 127         information was computed, a stack analysis on each basic block
 128         was performed to create a forest of trees for each one of
 129         them.
 130
 131         So for example, the following statement:
 132
 133                int a, b;
 134                ...
 135                b = a + 1;
 136
 137         Which would be represented in CIL as:
 138
 139                         ldloc.0
 140                         ldc.i4.1
 141                         add
 142                         stloc.1
 143
 144         After the stack analysis would create the following tree:
 145
 146                (STIND_I4 ADDR_L[EBX|2] (
 147                          ADD (LDIND_I4 ADDR_L[ESI|1])
 148                          CONST_I4[1]))
 149
 150         This tree contains information from the stack analysis: for
 151         instance, notice that the operations explicitly encode the
 152         data types they are operating on, there is no longer an
 153         ambiguity on the types, because this information has been
 154         inferred.
 155
 156         At this point the JIT would pass the constructed forest of
 157         trees to the architecture-dependent JIT compiler.
 158
 159         The architecture dependent code then performed register
 160         allocation (optionally using linear scan allocation for
 161         variables, based on life analysis).
 162
 163         Once variables had been assigned, a tree pattern matching with
 164         dynamic programming is used (the tree pattern matcher is
 165         custom build for each architecture, using a code
 166         generator: monoburg). The instruction selector used cost
 167         functions to select the best instruction patterns.
 168
 169         The instruction selector is able to produce instructions that
 170         take advantage of the x86 instruction indexing instructions
 171         for example.
 172
 173         One problem though is that the code emitter and the register
 174         allocator did not have any visibility outside the current
 175         tree, which meant that some redundant instructions were
 176         generated.  A peephole optimizer with this architecture was
 177         hard to write, given the tree-based representation that is
 178         used.
 179
 180         This JIT was functional, but it did not provide a good
 181         architecture to base future optimizations on.  Also the
 182         line between architecture neutral and architecture
 183         specific code and optimizations was hard to draw.
 184
 185         The JIT engine supported two code generation modes to support
 186         the two optimization modes for applications that host multiple
 187         application domains: generate code that will be shared across
 188         application domains, or generate code that will not be shared
 189         across application domains.
 190
 191 * Second Generation JIT engine.
 192
 193         We wanted to support a number of features that were missing:
 194
 195            * Ahead-of-time compilation.
 196
 197              The idea is to allow developers to pre-compile their code
 198              to native code to reduce startup time, and the working
 199              set that is used at runtime in the just-in-time compiler.
 200
 201              Although in Mono this has not been a visible problem, we
 202              wanted to pro-actively address this problem.
 203
 204              When an assembly (a Mono/.NET executable) is installed in
 205              the system, it would then be possible to pre-compile the
 206              code, and have the JIT compiler tune the generated code
 207              to the particular CPU on which the software is
 208              installed.
 209
 210              This is done in the Microsoft.NET world with a tool
 211              called ngen.exe
 212
 213            * Have a good platform for doing code optimizations.
 214
 215              The design called for a good architecture that would
 216              enable various levels of optimizations: some
 217              optimizations are better performed on high-level
 218              intermediate representations, some on medium-level and
 219              some at low-level representations.
 220
 221              Also it should be possible to conditionally turn these on
 222              or off.  Some optimizations are too expensive to be used
 223              in just-in-time compilation scenarios, but these
 224              expensive optimizations can be turned on for
 225              ahead-of-time compilations or when using profile-guided
 226              optimizations on a subset of the executed methods.
 227
 228            * Reduce the effort required to port the Mono code
 229              generator to new architectures.
 230
 231              For Mono to gain wide adoption in the Unix world, it is
 232              necessary that the JIT engine works in most of today's
 233              commercial hardware platforms.
 234
 235 * Features of the Second JIT engine.
 236
 237         The new JIT engine was architected by Dietmar Maurer and Paolo
 238         Molaro, based on the new objectives.
 239
 240         Mono provides a number of services to applications running
 241         with the new JIT compiler:
 242
 243              * Just-in-Time compilation of CLI code into native code.
 244
 245              * Ahead-of-Time compilation of CLI code, to reduce
 246                startup time of applications.
 247
 248         A number of software development features are also available:
 249
 250              * Execution time profiling (--profile)
 251
 252                Generates a report of the times consumed by routines,
 253                as well as the invocation times, as well as the
 254                callers.
 255
 256              * Memory usage profiling (--profile)
 257
 258                Generates a report of the memory usage by a program
 259                that is ran under the Mono JIT.
 260
 261              * Code coverage (--coverage)
 262
 263              * Execution tracing.
 264
 265         People who are interested in developing and improving the Mini
 266         JIT compiler will also find a few useful routines:
 267
 268              * Compilation times
 269
 270                This is used to time the execution time for the JIT
 271                when compiling a routine.
 272
 273              * Control Flow Graph and Dominator Tree drawing.
 274
 275                These are visual aids for the JIT developer: they
 276                render representations of the Control Flow graph, and
 277                for the more advanced optimizations, they draw the
 278                dominator tree graph.
 279
 280                This requires Dot (from the graphwiz package) and Ghostview.
 281
 282              * Code generator regression tests.
 283
 284                The engine contains support for running regression
 285                tests on the virtual machine, which is very helpful to
 286                developers interested in improving the engine.
 287
 288              * Optimization benchmark framework.
 289
 290                The JIT engine will generate graphs that compare
 291                various benchmarks embedded in an assembly, and run the
 292                various tests with different optimization flags.
 293
 294                This requires Perl, GD::Graph.
 295
 296 * Flexibility
 297
 298         This is probably the most important component of the new code
 299         generation engine.  The internals are relatively easy to
 300         replace and update, even large passes can be replaced and
 301         implemented differently.
 302
 303 * New code generator
 304
 305         Compiling a method begins with the `mini_method_to_ir' routine
 306         that converts the CIL representation into a medium
 307         intermediate representation.
 308
 309         The mini_method_to_ir routine performs a number of operations:
 310
 311             * Flow analysis and control flow graph computation.
 312
 313               Unlike the previous version, stack analysis and control
 314               flow graphs are computed in a single pass in the
 315               mini_method_to_ir function, this is done for performance
 316               reasons: although the complexity increases, the benefit
 317               for a JIT compiler is that there is more time available
 318               for performing other optimizations.
 319
 320             * Basic block computation.
 321
 322               mini_method_to_ir populates the MonoCompile structure
 323               with an array of basic blocks each of which contains
 324               forest of trees made up of MonoInst structures.
 325
 326             * Inlining
 327
 328               Inlining is no longer restricted to methods containing
 329               one single basic block, instead it is possible to inline
 330               arbitrary complex methods.
 331
 332               The heuristics to choose what to inline are likely going
 333               to be tuned in the future.
 334
 335             * Method to opcode conversion.
 336
 337               Some method call invocations like `call Math.Sin' are
 338               transformed into an opcode: this transforms the call
 339               into a semantically rich node, which is later inline
 340               into an FPU instruction.
 341
 342               Various Array methods invocations are turned into
 343               opcodes as well (The Get, Set and Address methods)
 344
 345             * Tail recursion elimination
 346
 347         Basic blocks ****
 348
 349         The MonoInst structure holds the actual decoded instruction,
 350         with the semantic information from the stack analysis.
 351         MonoInst is interesting because initially it is part of a tree
 352         structure, here is a sample of the same tree with the new JIT
 353         engine:
 354
 355                  (stind.i4 regoffset[0xffffffd4(%ebp)]
 356                            (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
 357                                  iconst[1]))
 358
 359         This is a medium-level intermediate representation (MIR).
 360
 361         Some complex opcodes are decomposed at this stage into a
 362         collection of simpler opcodes.  Not every complex opcode is
 363         decomposed at this stage, as we need to preserve the semantic
 364         information during various optimization phases.
 365
 366         For example a NEWARR opcode carries the length and the type of
 367         the array that could be used later to avoid type checking or
 368         array bounds check.
 369
 370         There are a number of operations supported on this
 371         representation:
 372
 373                 * Branch optimizations.
 374
 375                 * Variable liveness.
 376
 377                 * Loop optimizations: the dominator trees are
 378                   computed, loops are detected, and their nesting
 379                   level computed.
 380
 381                 * Conversion of the method into static single assignment
 382                   form (SSA form).
 383
 384                 * Dead code elimination.
 385
 386                 * Constant propagation.
 387
 388                 * Copy propagation.
 389
 390                 * Constant folding.
 391
 392         Once the above optimizations are optionally performed, a
 393         decomposition phase is used to turn some complex opcodes into
 394         internal method calls.  In the initial version of the JIT
 395         engine, various operations on longs are emulated instead of
 396         being inlined.  Also the newarr invocation is turned into a
 397         call to the runtime.
 398
 399         At this point, after computing variable liveness, it is
 400         possible to use the linear scan algorithm for allocating
 401         variables to registers.  The linear scan pass uses the
 402         information that was previously gathered by the loop nesting
 403         and loop structure computation to favor variables in inner
 404         loops.   This process updates the basic block `nesting' field
 405         which is later used during liveness analysis.
 406
 407         Stack space is then reserved for the local variables and any
 408         temporary variables generated during the various
 409         optimizations.
 410
 411 ** Instruction selection: Only used up until Mono 2.0
 412
 413         At this point, the BURS instruction selector is invoked to
 414         transform the tree-based representation into a list of
 415         instructions.  This is done using a tree pattern matcher that
 416         is generated for the architecture using the `monoburg' tool.
 417
 418         Monoburg takes as input a file that describes tree patterns,
 419         which are matched against the trees that were produced by the
 420         engine in the previous stages.
 421
 422         The pattern matching might have more than one match for a
 423         particular tree.  In this case, the match selected is the one
 424         whose cost is the smallest.  A cost can be attached to each
 425         rule, and if no cost is provided, the implicit cost is one.
 426         Smaller costs are selected over higher costs.
 427
 428         The cost function can be used to select particular blocks of
 429         code for a given architecture, or by using a prohibitive high
 430         number to avoid having the rule match.
 431
 432         The various rules that our JIT engine uses transform a tree of
 433         MonoInsts into a list of monoinsts:
 434
 435         +-----------------------------------------------------------+
 436         | Tree                                           List       |
 437         | of           ===> Instruction selection ===>   of         |
 438         | MonoInst                                       MonoInst.  |
 439         +-----------------------------------------------------------+
 440
 441         During this process various "types" of MonoInst kinds
 442         disappear and turned into lower-level representations.  The
 443         JIT compiler just happens to reuse the same structure (this is
 444         done to reduce memory usage and improve memory locality).
 445
 446         The instruction selection rules are split in a number of
 447         files, each one with a particular purpose:
 448
 449                 inssel.brg
 450                         Contains the generic instruction selection
 451                         patterns.
 452
 453                 inssel-x86.brg
 454                         Contains x86 specific rules.
 455
 456                 inssel-ppc.brg
 457                         Contains PowerPC specific rules.
 458
 459                 inssel-long32.brg
 460                         burg file for 64bit instructions on 32bit architectures.
 461
 462                 inssel-long.brg
 463                         burg file for 64bit architectures.
 464
 465                 inssel-float.brg
 466                         burg file for floating point instructions
 467
 468         For a given build, a set of those files would be included.
 469         For example, for the build of Mono on the x86, the following
 470         set is used:
 471
 472             inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
 473
 474 ** Native method generation
 475
 476         The native method generation has a number of steps:
 477
 478                 * Architecture specific register allocation.
 479
 480                   The information about loop nesting that was
 481                   previously gathered is used here to hint the
 482                   register allocator.
 483
 484                 * Generating the method prolog/epilog.
 485
 486                 * Optionally generate code to introduce tracing facilities.
 487
 488                 * Hooking into the debugger.
 489
 490                 * Performing any pending fixups.
 491
 492                 * Code generation.
 493
 494 *** Code Generation
 495
 496         The actual code generation is contained in the architecture
 497         specific portion of the compiler.  The input to the code
 498         generator is each one of the basic blocks with its list of
 499         instructions that were produced in the instruction selection
 500         phase.
 501
 502         During the instruction selection phase, virtual registers are
 503         assigned.  Just before the peephole optimization is performed,
 504         physical registers are assigned.
 505
 506         A simple peephole and algebraic optimizer is ran at this
 507         stage.
 508
 509         The peephole optimizer removes some redundant operations at
 510         this point.  This is possible because the code generation at
 511         this point has visibility into the basic block that spans the
 512         original trees.
 513
 514         The algebraic optimizer performs some simple algebraic
 515         optimizations that replace expensive operations with cheaper
 516         operations if possible.
 517
 518         The rest of the code generation is fairly simple: a switch
 519         statement is used to generate code for each of the MonoInsts,
 520         in the mono/mini/mini-ARCH.c files, the method is called
 521         "mono_arch_output_basic_block".
 522
 523         We always try to allocate code in sequence, instead of just using
 524         malloc. This way we increase spatial locality which gives a massive
 525         speedup on most architectures.
 526
 527 *** Ahead of Time compilation
 528
 529         Ahead-of-Time compilation is a new feature of our new
 530         compilation engine.  The compilation engine is shared by the
 531         Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
 532         (AOT).
 533
 534         The difference is on the set of optimizations that are turned
 535         on for each mode: Just-in-Time compilation should be as fast
 536         as possible, while Ahead-of-Time compilation can take as long
 537         as required, because this is not done at a time critical
 538         time.
 539
 540         With AOT compilation, we can afford to turn all of the
 541         computationally expensive optimizations on.
 542
 543         After the code generation phase is done, the code and any
 544         required fixup information is saved into a file that is
 545         readable by "as" (the native assembler available on all
 546         systems). This assembly file is then passed to the native
 547         assembler, which generates a loadable module.
 548
 549         At execution time, when an assembly is loaded from the disk,
 550         the runtime engine will probe for the existence of a
 551         pre-compiled image.  If the pre-compiled image exists, then it
 552         is loaded, and the method invocations are resolved to the code
 553         contained in the loaded module.
 554
 555         The code generated under the AOT scenario is slightly
 556         different than the JIT scenario.  It generates code that is
 557         application-domain relative and that can be shared among
 558         multiple thread.
 559
 560         This is the same code generation that is used when the runtime
 561         is instructed to maximize code sharing on a multi-application
 562         domain scenario.
 563
 564 * SSA-based optimizations
 565
 566         SSA form simplifies many optimization because each variable
 567         has exactly one definition site.  This means that each
 568         variable is only initialized once.
 569
 570         For example, code like this:
 571
 572             a = 1
 573             ..
 574             a = 2
 575             call (a)
 576
 577         Is internally turned into:
 578
 579             a1 = 1
 580             ..
 581             a2 = 2
 582             call (a2)
 583
 584         In the presence of branches, like:
 585
 586             if (x)
 587                  a = 1
 588             else
 589                  a = 2
 590
 591             call (a)
 592
 593         The code is turned into:
 594
 595             if (x)
 596                  a1 = 1;
 597             else
 598                  a2 = 2;
 599             a3 = phi (a1, a2)
 600             call (a3)
 601
 602         All uses of a variable are "dominated" by its definition
 603
 604         This representation is useful as it simplifies the
 605         implementation of a number of optimizations like conditional
 606         constant propagation, array bounds check removal and dead code
 607         elimination.
 608
 609 * Register allocation.
 610
 611         Global register allocation is performed on the medium
 612         intermediate representation just before instruction selection
 613         is performed on the method.  Local register allocation is
 614         later performed at the basic-block level on the
 615
 616         Global register allocation uses the following input:
 617
 618         1) set of register-sized variables that can be allocated to a
 619         register (this is an architecture specific setting, for x86
 620         these registers are the callee saved register ESI, EDI and
 621         EBX).
 622
 623         2) liveness information for the variables
 624
 625         3) (optionally) loop info to favor variables that are used in
 626         inner loops.
 627
 628         During instruction selection phase, symbolic registers are
 629         assigned to temporary values in expressions.
 630
 631         Local register allocation assigns hard registers to the
 632         symbolic registers, and it is performed just before the code
 633         is actually emitted and is performed at the basic block level.
 634         A CPU description file describes the input registers, output
 635         registers, fixed registers and clobbered registers by each
 636         operation.
 637
 638 * BURG Code Generator Generator: Only used up to Mono 2.0
 639
 640        monoburg was written by Dietmar Maurer. It is based on the
 641        papers from Christopher W. Fraser, Robert R. Henry and Todd
 642        A. Proebsting: "BURG - Fast Optimal Instruction Selection and
 643        Tree Parsing" and "Engineering a Simple, Efficient Code
 644        Generator Generator".
 645
 646        The original BURG implementation is unable to work on DAGs, instead only
 647        trees are allowed. Our monoburg implementations is able to generate tree
 648        matcher which works on DAGs, and we use this feature in the new
 649        JIT. This simplifies the code because we can directly pass DAGs and
 650        don't need to convert them to trees.
 651
 652 * Adding IL opcodes: an excercise (from a post by Paolo Molaro)
 653
 654         mini.c is the file that read the IL code stream and decides
 655         how any single IL instruction is implemented
 656         (mono_method_to_ir () func), so you always have to add an
 657         entry to the big switch inside the function: there are plenty
 658         of examples in that file.
 659
 660         An IL opcode can be implemented in a number of ways, depending
 661         on what it does and how it needs to do it.
 662
 663         Some opcodes are implemented using a helper function: one of
 664         the simpler examples is the CEE_STELEM_REF implementation.
 665
 666         In this case the opcode implementation is written in a C
 667         function.  You will need to register the function with the jit
 668         before you can use it (mono_register_jit_call) and you need to
 669         emit the call to the helper using the mono_emit_jit_icall()
 670         function.
 671
 672         This is the simpler way to add a new opcode and it doesn't
 673         require any arch-specific change (though it's limited to what
 674         you can do in C code and the performance may be limited by the
 675         function call).
 676
 677         Other opcodes can be implemented with one or more of the already
 678         implemented low-level instructions.
 679
 680         An example is the OP_STRLEN opcode which implements
 681         String.Length using a simple load from memory.  In this case
 682         you need to add a rule to the appropriate burg file,
 683         describing what are the arguments of the opcode and what is,
 684         if any, it's 'return' value.
 685
 686         The OP_STRLEN case is:
 687
 688         reg: OP_STRLEN (reg) {
 689                 MONO_EMIT_LOAD_MEMBASE_OP (s, tree, OP_LOADI4_MEMBASE, state->reg1,
 690                         state->left->reg1, G_STRUCT_OFFSET (MonoString, length));
 691         }
 692
 693         The above means: the OP_STRLEN takes a register as an argument
 694         and returns its value in a register.  And the implementation
 695         of this is included in the braces.
 696
 697         The opcode returns a value in an integer register
 698         (state->reg1) by performing a int32 load of the length field
 699         of the MonoString represented by the input register
 700         (state->left->reg1): before the burg rules are applied, the
 701         internal representation is based on trees, so you get the
 702         left/right pointers (state->left and state->right
 703         respectively, the result is stored in state->reg1).
 704
 705         This instruction implementation doesn't require arch-specific
 706         changes (it is using the MONO_EMIT_LOAD_MEMBASE_OP which is
 707         available on all platforms), and usually the produced code is
 708         fast.
 709
 710         Next we have opcodes that must be implemented with new low-level
 711         architecture specific instructions (either because of performance
 712         considerations or because the functionality can't get implemented in
 713         other ways).
 714
 715         You also need a burg rule in this case, too. For example,
 716         consider the OP_CHECK_THIS opcode (used to raise an exception
 717         if the this pointer is null). The burg rule simply reads:
 718
 719         stmt: OP_CHECK_THIS (reg) {
 720                 mono_bblock_add_inst (s->cbb, tree);
 721         }
 722
 723         Note that this opcode does not return a value (hence the
 724         "stmt") and it takes a register as input.
 725
 726         mono_bblock_add_inst (s->cbb, tree) just adds the instruction
 727         (the tree variable) to the current basic block (s->cbb). In
 728         mini this is the place where the internal representation
 729         switches from the tree format to the low-level format (the
 730         list of simple instructions).
 731
 732         In this case the actual opcode implementation is delegated to
 733         the arch-specific code.  A low-level opcode needs an entry in
 734         the machine description (the *.md files in mini/). This entry
 735         describes what kind of registers are used if any by the
 736         instruction, as well as other details such as constraints or
 737         other hints to the low-level engine which are architecture
 738         specific.
 739
 740         cpu-pentium.md, for example has the following entry:
 741
 742         checkthis: src1:b len:3
 743
 744         This means the instruction uses an integer register as a base
 745         pointer (basically a load or store is done on it) and it takes
 746         3 bytes of native code to implement it.
 747
 748         Now you just need to provide the low-level implementation for
 749         the opcode in one of the mini-$arch.c files, in the
 750         mono_arch_output_basic_block() function. There is a big switch
 751         here too. The x86 implementation is:
 752
 753                 case OP_CHECK_THIS:
 754                         /* ensure ins->sreg1 is not NULL */
 755                         x86_alu_membase_imm (code, X86_CMP, ins->sreg1, 0, 0);
 756                         break;
 757
 758         If the $arch-codegen.h header file doesn't have the code to
 759         emit the low-level native code, you'll need to write that as
 760         well.
 761
 762         Complex opcodes with register constraints may require other
 763         changes to the local register allocator, but usually they are
 764         not needed.
 765
 766 * Future
 767
 768         Profile-based optimization is something that we are very
 769         interested in supporting.  There are two possible usage
 770         scenarios:
 771
 772            * Based on the profile information gathered during
 773              the execution of a program, hot methods can be compiled
 774              with the highest level of optimizations, while bootstrap
 775              code and cold methods can be compiled with the least set
 776              of optimizations and placed in a discardable list.
 777
 778            * Code reordering: this profile-based optimization would
 779              only make sense for pre-compiled code.  The profile
 780              information is used to re-order the assembly code on disk
 781              so that the code is placed on the disk in a way that
 782              increments locality.
 783
 784              This is the same principle under which SGI's cord program
 785              works.
 786
 787         The nature of the CIL allows the above optimizations to be
 788         easy to implement and deploy.  Since we live and define our
 789         universe for these things, there are no interactions with
 790         system tools required, nor upgrades on the underlying
 791         infrastructure required.
 792
 793         Instruction scheduling is important for certain kinds of
 794         processors, and some of the framework exists today in our
 795         register allocator and the instruction selector to cope with
 796         this, but has not been finished.  The instruction selection
 797         would happen at the same time as local register allocation. <