docs/book/ch08_pct.pod

   1 =pod
   2
   3 =head0 PCT: Parrot Compiler Tools
   4
   5 Z<CHP-8>
   6
   7 So far we've talked a lot about low-level Parrot programming with
   8 PIR and PASM. However, the true power of Parrot is it's ability to
   9 host programs written in high level languages such as Perl 6,
  10 Python, Ruby, Tcl, and PHP. In order to write code in these languages
  11 developers need there to be compilers that convert from the language
  12 into PIR or PASM (or even directly convert to Parrot Bytecode).
  13 People who have worked on compilers before may be anticipating us
  14 to use terms like "Lex and Yacc" here, but we promise that we wont.
  15
  16 Instead of traditional lexical analyzers and parser-generators that
  17 have been the mainstay of compiler designers for decades, Parrot
  18 uses an advanced set of parsing tools called the Parrot Compiler
  19 Tools (PCT)X<Parrot Compiler Tools>. PCT uses a subset of the Perl 6
  20 programming language called I<Not Quite Perl>X<Not Quite Perl> (NQP)
  21 and an implementation of the Perl 6 Grammar Engine X<Perl 6 Grammar
  22 Engine> (PGE) to build compilers for Parrot. Instead of using
  23 traditional low-level languages to write compilers, we can use a
  24 modern dynamic language like Perl 6 to write it instead. On a more
  25 interesting note, this means that the Perl 6 compiler is itself
  26 being written in Perl 6, a mind-boggling process known as
  27 C<bootstrapping>.
  28
  29 =head1 PCT Overview
  30
  31 PCT is a collection of classes which handle the creation of a
  32 compiler and driver program for a high-level language. The
  33 C<PCT::HLLCompiler> class handles building the compiler front end
  34 while the C<PCT::Grammar>  and C<PCT::Grammar::Actions> classes handle
  35 building the parser and lexical analyzer. Creating a new HLL compiler
  36 is as easy as subclassing these three entities with methods specific
  37 to that high-level language.
  38
  39 =head2 Grammars and Action Files
  40
  41 Creating a compiler using PCT requires three basic files, plus any
  42 additional files needed to implement the languages logic and library:
  43
  44 =over 4
  45
  46 =item* A main file
  47
  48 The main file should contain the C<:main> function that is the driver
  49 program for the compiler. Here, a new C<PCT::HLLCompiler> object is
  50 instantiated, libraries are loaded, and necessary special global
  51 variables are created. The driver program is typically written in PIR,
  52 although thankfully they tend to be very short. Most of the action
  53 happens elsewhere.
  54
  55 =item* A parser file
  56
  57 The grammar for the high level language is specified using the Perl 6
  58 grammar engine (PGE) and is stored in a C<.pg> file. This file should
  59 subclass the C<PCT::Grammar> class and implement all the necessary
  60 rules to successfully parse the language.
  61
  62 =item* An actions file
  63
  64 Actions files are written in NQP. They take match objects generated by
  65 the grammar file and convert them into an Abstract Syntax Tree (AST)
  66 X<Abstract Syntax Tree;Parrot Abstract Syntax Tree;AST;PAST>
  67 which is converted by PCT into PIR for compiling and execution.
  68 The PIR implementation of these AST trees and nodes is called the
  69 Parrot Abstract Syntax Tree (PAST).
  70
  71 =back
  72
  73 =head2 C<make_language_shell.pl>
  74
  75 =head2 Parsing Fundamentals
  76
  77 Compilers typically consist of three components: The lexical analyzer,
  78 the parser, and the code generator. The lexical analyzer converts the
  79 HLL input file into individual tokens. A token may consist of an
  80 individual punctuation mark("+"), an indentifier ("myVar"), or a keyword
  81 ("while"), or any other artifact that cannot be sensibly broken down.
  82 The parser takes a stream of these input tokens, and attempts to match
  83 them against a given pattern, or grammar. The matching process orders
  84 the input tokens into an abstract syntax tree (AST), which is a form that
  85 the computer can easily work with. This AST is passed to the code
  86 generator which converts it into code of the target language. For
  87 something like the GCC C compiler, the target language is machine code.
  88 For PCT and Parrot, the target language is PIR and PBC.
  89
  90 Parsers come in two general varieties: Top-down and bottom-up. Top-down
  91 parsers start with a top-level rule, a rule which is supposed to
  92 represent the entire input. It attempts to match various combination of
  93 subrules until the entire input is matched. Bottom-down parsers, on the
  94 other hand, start with individual tokens from the lexical analyzer and
  95 attempt to combine them together into larger and larger patterns until
  96 they produce a top-level token.
  97
  98 PGE itself is a top-down parser, although it also contains a bottom-up
  99 I<operator precidence> parser, for things like mathematical expressions
 100 where bottom-up methods are more efficient. We'll discuss both, and the
 101 methods for switching between the two, throughout this chapter.
 102
 103 =head1 Driver Programs
 104
 105 =head2 C<HLLCompiler> class
 106
 107
 108 =head1 Parrot Grammar Engine
 109
 110 The Parrot Grammar Engine X<Parrot Grammar Engine;PGE> is an
 111 implementation of the Perl 6 grammar syntax in Parrot. The grammar
 112 engine in Perl 6 is much more advanced and flexible then the regular
 113 expression syntax of Perl 5. Since most other languages who implement
 114 regular expressions use a copy (or a subset) of Perl 5's regular
 115 expressions, PGE is much more powerful then those too.
 116
 117 PGE uses a recursive descent algorithm N<it also uses an operator
 118 precedence parser, when you ask it nicely>which should be very familar
 119 to users of the Perl 5 module C<Parse::RecDescent>. In fact, both were
 120 originally designed by the same developer, Damian Conway. Recursive
 121 Descent, for those who get into the algorithmic details, is a top-down
 122 parsing algorithm, unlike the top-down LALR algorithm used in parser-
 123 generators like yacc and Bison. Most programmers won't notice the
 124 difference between the two for most applications, although there are
 125 some specific places where the behavior will be different.
 126
 127 =head2 Rules and Actions
 128
 129 A recursive descent parser, like the one used in PGE, is a top-down
 130 parser. This means it attempts to start at the highest-level rule and
 131 work it's way down to the individual input tokens in order to match
 132 the given input. Once the parser has matched the entire input N<a
 133 source code file, or a line of input at the terminal in interactive
 134 mode> the parse is considered successful and the generated AST is
 135 delivered to the code generator for conversion into PIR.
 136
 137 The parser is formed by creating a I<grammar>. Grammars consist of
 138 three primary components: rules, tokens, and protoregexes.
 139
 140 =over 4
 141
 142 =item* Rules
 143
 144 Rules are the most basic grammar element. Rules may call subrules and
 145 may also contain arbitrary whitespace, which is ignored.
 146
 147 =item* Tokens
 148
 149 Tokens represent basic regular expressions. They may not call subrules
 150 and whitespace is treated literally. {{I don't think this is right, but
 151 I'm adding it here anyway as a placeholder -- Whiteknight}}
 152
 153 =item* Protoregex
 154
 155 A protoregex is like a rule or a token, except it can be overloaded
 156 dynamically.
 157
 158 =back
 159
 160 When a rule matches a sequence of input tokens, an associated method
 161 in NQP is called to convert that match into an AST node. This node
 162 is then inserted into the I<parse tree>.
 163
 164 =head2 Basic Rules
 165
 166 Let's start off with a simple rule:
 167
 168  rule persons_name {
 169     <first_name> <last_name>
 170  }
 171
 172 We also define the two name tokens as:
 173
 174  token first_name { <alpha>+ }
 175  token last_name { <alpha>+ }
 176
 177 The special token C<< <alpha> >> is a built-in construct that only
 178 accepts upper case and lower case letters. The "+" after the
 179 C<< <alpha> >> tag is a short way of saying "one or more". Our rule
 180 C<persons_name> would match something like C<Darth Vader> N<Actually,
 181 it matches a lot of things that aren't people's names>but wouldn't
 182 match something like C<C 3P0>. Notice that the rule above would match
 183 C<Jar Jar Binks>, but not the way you would expect: It would match the
 184 first "Jar" as C<< <first_name> >> and the second "Jar" as
 185 C<< <last_name> >> and wouldn't match "Binks" at all.
 186
 187 In PGE. The top-level rule which starts the match process and must
 188 successfully match in order for the compiler to continue is always
 189 called C<TOP>. Without a TOP rule, your compiler won't do anything
 190 N<Actually, it will print out an error saying that you have no
 191 TOP rule, and then it will exit, but that isn't anything that we
 192 would want it to do>.
 193
 194 Here's an example TOP rule that uses our definition for
 195 C<< <persons_name> >> above and matches a file that contains a comma-
 196 separated list of names:
 197
 198  rule TOP {
 199     [ <persons_name> ',' ]*
 200  }
 201
 202 this example shows another new construct, the square brackets. Square
 203 brackets are ways to group things together. The star at the end means
 204 that we take all the things inside the brackets zero or more times.
 205 This is similar to the plus, except the plus matches one or more times.
 206 Notice, however, that the above rule always matches a comma at the end,
 207 so we would need to have something like:
 208
 209  Darth Vader, Luke Skywalker,
 210
 211 Instead of something more natural like:
 212
 213  Darth Vader, Luke Skywalker
 214
 215 We can modify the rule a little bit so that it always ends with a name
 216 instead of a comma:
 217
 218  rule TOP {
 219     [ <persons_name> ',' ]* <persons_name>
 220  }
 221
 222 Now we don't need a trailing comma, but at the same time we can't match
 223 an empty file because it always expects to have at least one name at the
 224 end. If we still want to match empty files successfully, we need to make
 225 the whole rule optional:
 226
 227  rule TOP {
 228     [ [ <persons_name> ',' ]* <persons_name> ]?
 229  }
 230
 231 We've grouped the whole rule together in another set of brackets, and
 232 put a "?" question mark at the end. The question mark means zero or
 233 one of the prior item.
 234
 235 The symbols "*" (zero or more), "+" (one or more) and "?" are called
 236 I<quantifiers>, and allow an item in the rule to match a variable
 237 number of times. These aren't the only quantifiers, but they are the
 238 most common. We will talk about other quantifiers later on.
 239
 240 =head2 Calling Actions
 241
 242 We haven't covered actions yet, but it's still important now to talk
 243 about how we will call them when we are ready. We call an action
 244 by inserting the C<{*}> token into the rule. When the C<{*}> rule is
 245 encountered, PGE calls the associated action method with the current
 246 match object as an argument. Let's take our C<persons_name> rule
 247 from above, and sprinkle liberally with action calls:
 248
 249  rule persons_name {
 250     {*} <first_name> {*} <last_name> {*}
 251  }
 252
 253 The first call to the action method contains an empty match object
 254 because the parser hasn't had a chance to match anything yet. The
 255 second call contains only the first name of the match. The third and
 256 final call contains both the matched first and last name. Notice that
 257 if the match fails halfway through, we still call the actions where
 258 we succeeded, but do not call the actions after the failure. So, if
 259 we try to match the string "Leia", the action is called before the
 260 name and after the first name. When the rule tries to match the last
 261 name, it fails because no last name is provided, and the third action
 262 method call is never made.
 263
 264 =head2 Alternations and Keys
 265
 266 In addition to sub-rules, groups, and quantifiers, we also are able to
 267 take alternations between options that are either-or. The vertical bar
 268 token "|" can be used to distinguish between options where only one
 269 may match, Here's an example:
 270
 271  rule hero {
 272     ['Luke' | 'Leia'] 'Skywalker'
 273  }
 274
 275 This rule will match either "Luke Skywalker" or "Leia Skywalker" but
 276 won't match "Luke Leia Skywalker" N<or anything else, for that matter>.
 277 With things like alternations, if we want to call an action method it's
 278 helpful to distinguish which combination we matched:
 279
 280  rule hero {
 281     [
 282       'Luke' {*}    #= Luke
 283     | 'Leia' {*}    #= Leia
 284     ]
 285     'Skywalker'
 286  }
 287
 288 This is the same rule, except now it passes two arguments to it's
 289 action method: the match object and the name of the person who
 290 got matched.
 291
 292 =head2 Warning: Left Recursion
 293
 294 Getting into all the nitty-gritty theory behind parsers is well beyond
 295 the scope of this book. However, there is one potential pitfall that
 296 developers should be made aware of that is not immediately obvious.
 297 Like functions in ordinary procedural or functional languages, the
 298 methods in the PGE parser grammar can call themselves recursively.
 299 Consider the following rules derived in part from the grammar for the
 300 C programming language:
 301
 302  rule if_statement {
 303     'if' <condition> '{' <statement>* '}' <else_block>?
 304  }
 305
 306  rule statement {
 307     <if_statement> | <expression>
 308  }
 309
 310  rule else_block {
 311     'else' '{' <statements>* '}'
 312  }
 313
 314 Notice that an C<if_statement> can contain a list of C<statement>s, and
 315 that each statement may itself be an C<if_statement>? This is called
 316 I<recursion> X<Recursion>, and is part of where the "Recursive Descent"
 317 algorithm gets it's name from.
 318
 319 Now, let's look at a more direct example of a comma-separated list of
 320 integer digits to form an array. We can define this recursively as
 321 follows:
 322
 323  rule list {
 324      <list> ',' <digit> | <digit>
 325  }
 326
 327 The intention is that if there is only one digit, we match the second
 328 option in the alternation, and if there are more digits we can match
 329 them recursively in the first alternation. However, take a close look
 330 at the insideous result. The recursive descent parser enters the C<list>
 331 rule. It's first option is to enter the list rule again, so it does.
 332 Recursive descent is a depth-first algorithm, and it will continue to
 333 descent down a particular path until it finds a successful match or a
 334 match failure. In this case, it matches C<list> and then it matches
 335 C<list> again, then it matches C<list> again, and so on and so forth.
 336 What we have created is an infinite loop pattern called I<left recursion>.
 337
 338 Left recursion is caused when the left-most item of the left-most
 339 alternation is a recursion. The rule above can be easily resolved
 340 by writing:
 341
 342  rule list {
 343     <digit> | <list> ',' <digit>
 344  }
 345
 346 Or even
 347
 348  rule list {
 349     <digit> ',' <list> | <digit>
 350  }
 351
 352 Both of these two options make sure the left-most item in our rule is not
 353 a recursion, therefore preventing left recursion.
 354
 355 Here is a more tricky example where the left recursion is hidden from
 356 view:
 357
 358  rule term {
 359     <expression> '*' <term> | <digit>
 360  }
 361
 362  rule expression {
 363     <term> '+' <expression> | <term>
 364  }
 365
 366 This is a very limited subset of mathematical equations that we might like
 367 to write, and even in this small subset we have this same problem: To
 368 match a C<term>, the parser first tries to match an C<expression>, which
 369 in turn matches a C<term> and then an C<expression> ...
 370
 371 Left recursion is not the only problem you can run into with a recursive
 372 descent grammar, but it's one that's likely going to come up relatively
 373 often for new language designers, and one that is not always likely to
 374 generate useful error messages.
 375
 376 =head2 Operator Precidence Parser
 377
 378 =head1 Actions and NQP
 379
 380 =head2 NQP Basics
 381
 382 =head2 The Match Object C<$/>
 383
 384 =head2 Making Trees
 385
 386
 387 =cut