mcs/docs/compiler

   1                        The Internals of the Mono C# Compiler
   2
   3                                 Miguel de Icaza
   4                               (miguel@ximian.com)
   5                                       2002
   6
   7 * Abstract
   8
   9         The Mono C# compiler is a C# compiler written in C# itself.
  10         Its goals are to provide a free and alternate implementation
  11         of the C# language.  The Mono C# compiler generates ECMA CIL
  12         images through the use of the System.Reflection.Emit API which
  13         enable the compiler to be platform independent.
  14
  15 * Overview: How the compiler fits together
  16
  17         The compilation process is managed by the compiler driver (it
  18         lives in driver.cs).
  19
  20         The compiler reads a set of C# source code files, and parses
  21         them.  Any assemblies or modules that the user might want to
  22         use with his project are loaded after parsing is done.
  23
  24         Once all the files have been parsed, the type hierarchy is
  25         resolved.  First interfaces are resolved, then types and
  26         enumerations.
  27
  28         Once the type hierarchy is resolved, every type is populated:
  29         fields, methods, indexers, properties, events and delegates
  30         are entered into the type system.
  31
  32         At this point the program skeleton has been completed.  The
  33         next process is to actually emit the code for each of the
  34         executable methods.  The compiler drives this from
  35         RootContext.EmitCode.
  36
  37         Each type then has to populate its methods: populating a
  38         method requires creating a structure that is used as the state
  39         of the block being emitted (this is the EmitContext class) and
  40         then generating code for the topmost statement (the Block).
  41
  42         Code generation has two steps: the first step is the semantic
  43         analysis (Resolve method) that resolves any pending tasks, and
  44         guarantees that the code is correct.  The second phase is the
  45         actual code emission.  All errors are flagged during in the
  46         "Resolution" process.
  47
  48         After all code has been emitted, then the compiler closes all
  49         the types (this basically tells the Reflection.Emit library to
  50         finish up the types), resources, and definition of the entry
  51         point are done at this point, and the output is saved to
  52         disk.
  53
  54         The following list will give you an idea of where the
  55         different pieces of the compiler live:
  56
  57         Infrastructure:
  58
  59             driver.cs:
  60                 This drives the compilation process: loading of
  61                 command line options; parsing the inputs files;
  62                 loading the referenced assemblies; resolving the type
  63                 hierarchy and emitting the code.
  64
  65             codegen.cs:
  66
  67                 The state tracking for code generation.
  68
  69             attribute.cs:
  70
  71                 Code to do semantic analysis and emit the attributes
  72                 is here.
  73
  74             rootcontext.cs:
  75
  76                 Keeps track of the types defined in the source code,
  77                 as well as the assemblies loaded.
  78
  79             typemanager.cs:
  80
  81                 This contains the MCS type system.
  82
  83             report.cs:
  84
  85                 Error and warning reporting methods.
  86
  87             support.cs:
  88
  89                 Assorted utility functions used by the compiler.
  90
  91         Parsing
  92
  93             cs-tokenizer.cs:
  94
  95                 The tokenizer for the C# language, it includes also
  96                 the C# pre-processor.
  97
  98             cs-parser.jay, cs-parser.cs:
  99
 100                 The parser is implemented using a C# port of the Yacc
 101                 parser.  The parser lives in the cs-parser.jay file,
 102                 and cs-parser.cs is the generated parser.
 103
 104             location.cs:
 105
 106                 The `location' structure is a compact representation
 107                 of a file, line, column where a token, or a high-level
 108                 construct appears.  This is used to report errors.
 109
 110         Expressions:
 111
 112             ecore.cs
 113
 114                 Basic expression classes, and interfaces most shared
 115                 code and static methods are here.
 116
 117             expression.cs:
 118
 119                 Most of the different kinds of expressions classes
 120                 live in this file.
 121
 122             assign.cs:
 123
 124                 The assignment expression got its own file.
 125
 126             constant.cs:
 127
 128                 The classes that represent the constant expressions.
 129
 130             literal.cs
 131
 132                 Literals are constants that have been entered manually
 133                 in the source code, like `1' or `true'.  The compiler
 134                 needs to tell constants from literals apart during the
 135                 compilation process, as literals sometimes have some
 136                 implicit extra conversions defined for them.
 137
 138             cfold.cs:
 139
 140                 The constant folder for binary expressions.
 141
 142         Statements
 143
 144             statement.cs:
 145
 146                 All of the abstract syntax tree elements for
 147                 statements live in this file.  This also drives the
 148                 semantic analysis process.
 149
 150             iterators.cs:
 151
 152                 Contains the support for implementing iterators from
 153                 the C# 2.0 specification.
 154
 155         Declarations, Classes, Structs, Enumerations
 156
 157             decl.cs
 158
 159                 This contains the base class for Members and
 160                 Declaration Spaces.   A declaration space introduces
 161                 new names in types, so classes, structs, delegates and
 162                 enumerations derive from it.
 163
 164             class.cs:
 165
 166                 Methods for holding and defining class and struct
 167                 information, and every member that can be in these
 168                 (methods, fields, delegates, events, etc).
 169
 170                 The most interesting type here is the `TypeContainer'
 171                 which is a derivative of the `DeclSpace'
 172
 173             delegate.cs:
 174
 175                 Handles delegate definition and use.
 176
 177             enum.cs:
 178
 179                 Handles enumerations.
 180
 181             interface.cs:
 182
 183                 Holds and defines interfaces.  All the code related to
 184                 interface declaration lives here.
 185
 186             parameter.cs:
 187
 188                 During the parsing process, the compiler encapsulates
 189                 parameters in the Parameter and Parameters classes.
 190                 These classes provide definition and resolution tools
 191                 for them.
 192
 193             pending.cs:
 194
 195                 Routines to track pending implementations of abstract
 196                 methods and interfaces.  These are used by the
 197                 TypeContainer-derived classes to track whether every
 198                 method required is implemented.
 199
 200
 201 * The parsing process
 202
 203         All the input files that make up a program need to be read in
 204         advance, because C# allows declarations to happen after an
 205         entity is used, for example, the following is a valid program:
 206
 207         class X : Y {
 208                 static void Main ()
 209                 {
 210                         a = "hello"; b = "world";
 211                 }
 212                 string a;
 213         }
 214
 215         class Y {
 216                 public string b;
 217         }
 218
 219         At the time the assignment expression `a = "hello"' is parsed,
 220         it is not know whether a is a class field from this class, or
 221         its parents, or whether it is a property access or a variable
 222         reference.  The actual meaning of `a' will not be discovered
 223         until the semantic analysis phase.
 224
 225 ** The Tokenizer and the pre-processor
 226
 227         The tokenizer is contained in the file `cs-tokenizer.cs', and
 228         the main entry point is the `token ()' method.  The tokenizer
 229         implements the `yyParser.yyInput' interface, which is what the
 230         Yacc/Jay parser will use when fetching tokens.
 231
 232         Token definitions are generated by jay during the compilation
 233         process, and those can be references from the tokenizer class
 234         with the `Token.' prefix.
 235
 236         Each time a token is returned, the location for the token is
 237         recorded into the `Location' property, that can be accessed by
 238         the parser.  The parser retrieves the Location properties as
 239         it builds its internal representation to allow the semantic
 240         analysis phase to produce error messages that can pin point
 241         the location of the problem.
 242
 243         Some tokens have values associated with it, for example when
 244         the tokenizer encounters a string, it will return a
 245         LITERAL_STRING token, and the actual string parsed will be
 246         available in the `Value' property of the tokenizer.   The same
 247         mechanism is used to return integers and floating point
 248         numbers.
 249
 250         C# has a limited pre-processor that allows conditional
 251         compilation, but it is not as fully featured as the C
 252         pre-processor, and most notably, macros are missing.  This
 253         makes it simple to implement in very few lines and mesh it
 254         with the tokenizer.
 255
 256         The `handle_preprocessing_directive' method in the tokenizer
 257         handles all the pre-processing, and it is invoked when the '#'
 258         symbol is found as the first token in a line.
 259
 260         The state of the pre-processor is contained in a Stack called
 261         `ifstack', this state is used to track the if/elif/else/endif
 262         nesting and the current state.  The state is encoded in the
 263         top of the stack as a number of values `TAKING',
 264         `TAKEN_BEFORE', `ELSE_SEEN', `PARENT_TAKING'.
 265
 266 ** Locations
 267
 268         Locations are encoded as a 32-bit number (the Location
 269         struct) that map each input source line to a linear number.
 270         As new files are parsed, the Location manager is informed of
 271         the new file, to allow it to map back from an int constant to
 272         a file + line number.
 273
 274         Prior to parsing/tokenizing any source files, the compiler
 275         generates a list of all the source files and then reserves the
 276         low N bits of the location to hold the source file, where N is
 277         large enough to hold at least twice as many source files as were
 278         specified on the command line (to allow for a #line in each file).
 279         The upper 32-N bits are the line number in that file.
 280
 281         The token 0 is reserved for ``anonymous'' locations, ie. if we
 282         don't know the location (Location.Null).
 283
 284         The tokenizer also tracks the column number for a token, but
 285         this is currently not being used or encoded.  It could
 286         probably be encoded in the low 9 bits, allowing for columns
 287         from 1 to 512 to be encoded.
 288
 289 * The Parser
 290
 291         The parser is written using Jay, which is a port of Berkeley
 292         Yacc to Java, that I later ported to C#.
 293
 294         Many people ask why the grammar of the parser does not match
 295         exactly the definition in the C# specification.  The reason is
 296         simple: the grammar in the C# specification is designed to be
 297         consumed by humans, and not by a computer program.  Before
 298         you can feed this grammar to a tool, it needs to be simplified
 299         to allow the tool to generate a correct parser for it.
 300
 301         In the Mono C# compiler, we use a class for each of the
 302         statements and expressions in the C# language.  For example,
 303         there is a `While' class for the the `while' statement, a
 304         `Cast' class to represent a cast expression and so on.
 305
 306         There is a Statement class, and an Expression class which are
 307         the base classes for statements and expressions.
 308
 309 ** Namespaces
 310
 311         Using list.
 312
 313 * Internal Representation
 314
 315 ** Expressions
 316
 317         Expressions in the Mono C# compiler are represented by the
 318         `Expression' class.  This is an abstract class that particular
 319         kinds of expressions have to inherit from and override a few
 320         methods.
 321
 322         The base Expression class contains two fields: `eclass' which
 323         represents the "expression classification" (from the C#
 324         specs) and the type of the expression.
 325
 326         Expressions have to be resolved before they are can be used.
 327         The resolution process is implemented by overriding the
 328         `DoResolve' method.  The DoResolve method has to set the
 329         `eclass' field and the `type', perform all error checking and
 330         computations that will be required for code generation at this
 331         stage.
 332
 333         The return value from DoResolve is an expression.  Most of the
 334         time an Expression derived class will return itself (return
 335         this) when it will handle the emission of the code itself, or
 336         it can return a new Expression.
 337
 338         For example, the parser will create an "ElementAccess" class
 339         for:
 340
 341                 a [0] = 1;
 342
 343         During the resolution process, the compiler will know whether
 344         this is an array access, or an indexer access.  And will
 345         return either an ArrayAccess expression or an IndexerAccess
 346         expression from DoResolve.
 347
 348
 349
 350 *** The Expression Class
 351
 352         The utility functions that can be called by all children of
 353         Expression.
 354
 355 ** Constants
 356
 357         Constants in the Mono C# compiler are represented by the
 358         abstract class `Constant'.  Constant is in turn derived from
 359         Expression.  The base constructor for `Constant' just sets the
 360         expression class to be an `ExprClass.Value', Constants are
 361         born in a fully resolved state, so the `DoResolve' method
 362         only returns a reference to itself.
 363
 364         Each Constant should implement the `GetValue' method which
 365         returns an object with the actual contents of this constant, a
 366         utility virtual method called `AsString' is used to render a
 367         diagnostic message.  The output of AsString is shown to the
 368         developer when an error or a warning is triggered.
 369
 370         Constant classes also participate in the constant folding
 371         process.  Constant folding is invoked by those expressions
 372         that can be constant folded invoking the functionality
 373         provided by the ConstantFold class (cfold.cs).
 374
 375         Each Constant has to implement a number of methods to convert
 376         itself into a Constant of a different type.  These methods are
 377         called `ConvertToXXXX' and they are invoked by the wrapper
 378         functions `ToXXXX'.  These methods only perform implicit
 379         numeric conversions.  Explicit conversions are handled by the
 380         `Cast' expression class.
 381
 382         The `ToXXXX' methods are the entry point, and provide error
 383         reporting in case a conversion can not be performed.
 384
 385 ** Constant Folding
 386
 387         The C# language requires constant folding to be implemented.
 388         Constant folding is hooked up in the Binary.Resolve method.
 389         If both sides of a binary expression are constants, then the
 390         ConstantFold.BinaryFold routine is invoked.
 391
 392         This routine implements all the binary operator rules, it
 393         is a mirror of the code that generates code for binary
 394         operators, but that has to be evaluated at runtime.
 395
 396         If the constants can be folded, then a new constant expression
 397         is returned, if not, then the null value is returned (for
 398         example, the concatenation of a string constant and a numeric
 399         constant is deferred to the runtime).
 400
 401 ** Side effects
 402
 403         a [i++]++
 404         a [i++] += 5;
 405
 406 ** Statements
 407
 408 * The semantic analysis
 409
 410         Hence, the compiler driver has to parse all the input files.
 411         Once all the input files have been parsed, and an internal
 412         representation of the input program exists, the following
 413         steps are taken:
 414
 415                 * The interface hierarchy is resolved first.
 416                   As the interface hierarchy is constructed,
 417                   TypeBuilder objects are created for each one of
 418                   them.
 419
 420                 * Classes and structure hierarchy is resolved next,
 421                   TypeBuilder objects are created for them.
 422
 423                 * Constants and enumerations are resolved.
 424
 425                 * Method, indexer, properties, delegates and event
 426                   definitions are now entered into the TypeBuilders.
 427
 428                 * Elements that contain code are now invoked to
 429                   perform semantic analysis and code generation.
 430
 431 * Output Generation
 432
 433 ** Code Generation
 434
 435         The EmitContext class is created any time that IL code is to
 436         be generated (methods, properties, indexers and attributes all
 437         create EmitContexts).
 438
 439         The EmitContext keeps track of the current namespace and type
 440         container.  This is used during name resolution.
 441
 442         An EmitContext is used by the underlying code generation
 443         facilities to track the state of code generation:
 444
 445                 * The ILGenerator used to generate code for this
 446                   method.
 447
 448                 * The TypeContainer where the code lives, this is used
 449                   to access the TypeBuilder.
 450
 451                 * The DeclSpace, this is used to resolve names through
 452                   RootContext.LookupType in the various statements and
 453                   expressions.
 454
 455         Code generation state is also tracked here:
 456
 457                 * CheckState:
 458
 459                   This variable tracks the `checked' state of the
 460                   compilation, it controls whether we should generate
 461                   code that does overflow checking, or if we generate
 462                   code that ignores overflows.
 463
 464                   The default setting comes from the command line
 465                   option to generate checked or unchecked code plus
 466                   any source code changes using the checked/unchecked
 467                   statements or expressions.  Contrast this with the
 468                   ConstantCheckState flag.
 469
 470                 * ConstantCheckState
 471
 472                   The constant check state is always set to `true' and
 473                   cant be changed from the command line.  The source
 474                   code can change this setting with the `checked' and
 475                   `unchecked' statements and expressions.
 476
 477                 * IsStatic
 478
 479                   Whether we are emitting code inside a static or
 480                   instance method
 481
 482                 * ReturnType
 483
 484                   The value that is allowed to be returned or NULL if
 485                   there is no return type.
 486
 487                 * ReturnLabel
 488
 489                   A `Label' used by the code if it must jump to it.
 490                   This is used by a few routines that deals with exception
 491                   handling.
 492
 493                 * HasReturnLabel
 494
 495                   Whether we have a return label defined by the toplevel
 496                   driver.
 497
 498                 * ContainerType
 499
 500                   Points to the Type (extracted from the
 501                   TypeContainer) that declares this body of code
 502                   summary>
 503
 504
 505                 * IsConstructor
 506
 507                   Whether this is generating code for a constructor
 508
 509                 * CurrentBlock
 510
 511                   Tracks the current block being generated.
 512
 513                 * ReturnLabel;
 514
 515                   The location where return has to jump to return the
 516                   value
 517
 518         A few variables are used to track the state for checking in
 519         for loops, or in try/catch statements:
 520
 521                 * InFinally
 522
 523                   Whether we are in a Finally block
 524
 525                 * InTry
 526
 527                   Whether we are in a Try block
 528
 529                 * InCatch
 530
 531                   Whether we are in a Catch block
 532
 533                 * InUnsafe
 534                   Whether we are inside an unsafe block
 535
 536         Methods exposed by the EmitContext:
 537
 538                 * EmitTopBlock()
 539
 540                   This emits a toplevel block.
 541
 542                   This routine is very simple, to allow the anonymous
 543                   method support to roll its two-stage version of this
 544                   routine on its own.
 545
 546                 * NeedReturnLabel ():
 547
 548                   This is used to flag during the resolution phase that
 549                   the driver needs to initialize the `ReturnLabel'
 550
 551 * Anonymous Methods
 552
 553         The introduction of anonymous methods in the compiler changed
 554         various ways of doing things in the compiler.  The most
 555         significant one is the hard split between the resolution phase
 556         and the emission phases of the compiler.
 557
 558         For instance, routines that referenced local variables no
 559         longer can safely create temporary variables during the
 560         resolution phase: they must do so from the emission phase,
 561         since the variable might have been "captured", hence access to
 562         it can not be done with the local-variable operations from the runtime.
 563
 564         The code emission is in:
 565
 566                 EmitTopBlock ()
 567
 568         Which drives the process, it first resolves the topblock, then
 569         emits the required metadata (local variable definitions) and
 570         finally emits the code.
 571
 572 * Miscellaneous
 573
 574 ** Error Processing.
 575
 576         Errors are reported during the various stages of the
 577         compilation process.  The compiler stops its processing if
 578         there are errors between the various phases.  This simplifies
 579         the code, because it is safe to assume always that the data
 580         structures that the compiler is operating on are always
 581         consistent.
 582
 583         The error codes in the Mono C# compiler are the same as those
 584         found in the Microsoft C# compiler, with a few exceptions
 585         (where we report a few more errors, those are documented in
 586         mcs/errors/errors.txt).  The goal is to reduce confusion to
 587         the users, and also to help us track the progress of the
 588         compiler in terms of the errors we report.
 589
 590         The Report class provides error and warning display functions,
 591         and also keeps an error count which is used to stop the
 592         compiler between the phases.
 593
 594         A couple of debugging tools are available here, and are useful
 595         when extending or fixing bugs in the compiler.  If the
 596         `--fatal' flag is passed to the compiler, the Report.Error
 597         routine will throw an exception.  This can be used to pinpoint
 598         the location of the bug and examine the variables around the
 599         error location.
 600
 601         Warnings can be turned into errors by using the `--werror'
 602         flag to the compiler.
 603
 604         The report class also ignores warnings that have been
 605         specified on the command line with the `--nowarn' flag.
 606
 607         Finally, code in the compiler uses the global variable
 608         RootContext.WarningLevel in a few places to decide whether a
 609         warning is worth reporting to the user or not.
 610
 611 * Debugging the compiler
 612
 613         Sometimes it is convenient to find *how* a particular error
 614         message is being reported from, to do that, you might want to use
 615         the --fatal flag to mcs.  The flag will instruct the compiler to
 616         abort with a stack trace execution when the error is reported.
 617
 618         You can use this with -warnaserror to obtain the same effect
 619         with warnings.
 620
 621 * Editing the compiler sources
 622
 623         The compiler sources are intended to be edited with 134 columns of width
 624