docs/pct/pct_optable_guide.pod

   1 # Copyright (C) 2008, Parrot Foundation.
   2 # $Id$
   3
   4 =head1 NAME
   5
   6 pct_optable_guide.pod - A Guide to Using an Operator Parsing Table in PGE-based grammars.
   7
   8 =head1 DESCRIPTION
   9
  10 This document describes how to use an operator parsing table in grammars
  11 written for the Parrot Grammar Engine. The C<mk_language_shell.pl> script
  12 will generate a very simple grammar that already includes an operator
  13 table. This document may help you to understand the syntax of defining
  14 an optable.
  15
  16 =head1 WHY AN OPTABLE?
  17
  18 Even in simple languages such as C<languages/abc>, parse trees can become very
  19 big for even very simple expressions.
  20
  21 Consider the following example grammar:
  22
  23  token addop { '+' | '-' }
  24  token mulop { '*' | '/' }
  25
  26  rule expression {
  27     <mulexpr> [<addop> <mulexpr>]*
  28  }
  29
  30  rule mulexpr {
  31     <primary> [<mulop> <primary>]*
  32  }
  33
  34  rule primary {
  35     | <integer>
  36     ...
  37  }
  38
  39
  40 An expression such as C<1 + 2> will result in a parse tree like this:
  41
  42  expression
  43    mulexpr
  44       primary
  45          integer
  46             1
  47    addop
  48       +
  49    mulexpr
  50       primary
  51          integer
  52             2
  53
  54 This may not seem very big, but remember it's only for C<1 + 2>!
  55 Also note that you have to add at least one rule per precedence level,
  56 so the rules for parsing simple statements can increase the size of
  57 your grammar dramatically.
  58
  59 In order to prevent very big parse trees (which is rather inefficient),
  60 it's much better to do bottom-up parsing for expressions such as these.
  61
  62
  63 =head1 HOW TO USE AN OPTABLE?
  64
  65 Perl 6 rules are run as ordinary subroutines (well, in fact, they are
  66 C<methods> of the class representing the grammar you're writing), and
  67 parsing will be done by invoking a rule's subrules. A subrule can also
  68 contain subrules. Therefore, the generated parser is a top-down parser,
  69 starting at the C<TOP> rule, and invoking rules that represent I<lower>
  70 parts of the grammar.
  71
  72 =head2 Switching Top-down and Bottom-up parsing
  73
  74 An optable parses operators bottom-up, which is the other well-known
  75 parsing approach, implemented popular parser generators such as Yacc.
  76 At some point, when using an optable, your parser must switch from
  77 top-down parsing to bottom-up, and after parsing the operators (and
  78 related operands), the parser switches back from bottom-up to top-down
  79 again.
  80
  81 Bottom up parsing is very well suited for expressions that consist of
  82 terms, unary and binary operators with no two terms in a row.
  83
  84 In order to define the entry point of the bottom-up, operator parsing
  85 table, you should define a rule. Below is an example that states that
  86 an C<expression> is parsed by the operator parser:
  87
  88  rule expression is optable { ... }
  89
  90 The C<{ ... }> is legal Perl 6 syntax. See C<Synopsis 6: subroutines>
  91 for details.
  92
  93 In order to define what an operand is, a special rule is define, named
  94 C<term:>.
  95
  96  proto 'term:' is tighter('infix:*')
  97                is parsed(&simple_expression) { ... }
  98
  99 This rule states that whenever an operand is expected, it is parsed by
 100 the rule named C<simple_expression>. In other words, this is the point
 101 where the parser switches back from bottom-up to top-down.
 102
 103 Be sure to add the C<tighter> clause, to make sure that your I<terms>
 104 are parsed correctly. The grammar compiler (PGE) will not notify you
 105 if you don't do this.
 106
 107 =head2 Defining operators
 108
 109 In between these two rules defining the entry point and the I<exit> point
 110 of the operator parser, operators are declared, along with their
 111 precedence. This typically look as follows:
 112
 113  proto 'infix:+' is looser('infix:*') { ... }
 114
 115 This defines the symbol C<+> as an C<infix> operator. The C<is looser>
 116 specification defines the precedence of the C<+> operator in relation
 117 to the C<*> operator.
 118
 119 =head3 Operator precedence
 120
 121 Operator precedence can be expressed using the following traits:
 122
 123 =over 4
 124
 125 =item * looser( OP )
 126
 127 States that the precedence of the current operator being defined is looser
 128 than C<op>.
 129
 130 =item * equiv( OP )
 131
 132 States that the precedence of the current operator being defined is equal
 133 to the precedence of operator C<OP>.
 134
 135 =item * tighter( OP )
 136
 137 States that the precedence of the current operator being defined is
 138 tighter than the precedence of the operator C<OP>.
 139
 140 =back
 141
 142 Be sure that C<OP> is defined I<before> referencing it in a operator
 143 precedence statement as discussed in this section.
 144
 145 =head3 Where's the operator?
 146
 147 In Perl 6, operators are just subroutines with special names and scoping
 148 (see C<Synopsis 6: subroutines>).
 149
 150 When operators such as C<infix:+> are defined, the bottom-up parser will
 151 generate a C<call> instruction to a subroutine with the same name, that is,
 152 C<infix:+>. Therefore, you need to write a subroutine for each operator
 153 you define. The subroutine for the C<infix:+> operator could look like this:
 154
 155 =begin PIR_INVALID
 156
 157  .sub 'infix:+'
 158     .param pmc a      # left operand
 159     .param pmc b      # right operand
 160     n_add $P0, a, b   # create a new PMC object with the value a+b
 161     .return ($P0)
 162  .end
 163
 164 =end PIR_INVALID
 165
 166 Whenever an expression such as C<42 + 1> is parsed, this will result in a
 167 call to C<infix:+> with the operands C<42> and C<1>. The C<infix:+> sub
 168 will create a new object, and assign the result of C<42 + 1> to it, after
 169 which this new object is returned. Note that the C<n_> prefix of the C<add>
 170 instruction implies that a new object is created.
 171
 172 You might think that it's somewhat overkill to write and call a subroutine
 173 for such a simple operation, and that this could be done simpler. Well,
 174 you're right. If the implementation of the C<+> operator is as simple as
 175 in this case, there's a simpler way to implement the exact same behavior.
 176 This can be accomplished using a the C<pirop> trait, discussed below.
 177
 178 =head3 Infix, prefix, postfix
 179
 180 Besides C<infix> operators, operators can be C<prefix> or C<postfix>
 181 Operators can even be declared to be both C<prefix> and C<infix>. A common
 182 example is the C<-> operator. This is fine, because each operator has
 183 a certain precedence, which is kept track of by the bottom-up parser.
 184
 185 In order to define a prefix C<++> operator, you could write this:
 186
 187  proto 'prefix:++' ... # specify precedence
 188
 189 Besides these well-known mathematical operators, some languages have
 190 built-in operators with C<normal> names. For instance, Ruby defines
 191 an operator called C<defined?>, which queries its operand whether it
 192 is defined. Defining such special operators are not special in any way,
 193 just declared them as any other operator:
 194
 195  proto 'prefix:defined?' ...
 196
 197 Note that C<defined?> is a prefix operator, which would look like this
 198 in code:
 199
 200  defined? $x
 201
 202 =head3 Ternary operators
 203
 204 Most languages only have one ternary operator: C<? :>. An example is shown
 205 below:
 206
 207  $income > 100000 ? print("I'm rich!") : print("I'll keep my job for now")
 208
 209 To declare a ternary operator, you write:
 210
 211  proto ternary:<? :> is looser('infix:+') { ... }
 212
 213 =head3 Circumfix operators
 214
 215 Brackets can be considered operators as well. There are two types:
 216
 217 =over 4
 218
 219 =item * circumfix
 220
 221 C<circumfix> operators can enclose other terms and operators. For instance
 222 many languages define a rule C<expression> similar to the following:
 223
 224  expression: literal
 225            | identifier
 226            | '(' expression ')'
 227            | expression ['+'|'-'|...] expression
 228
 229 When defining an operator table, the third alternative specifying a
 230 parenthesized expression can be expressed as follows:
 231
 232  proto circumfix:<( )> ...
 233
 234 This means that operands (which are parsed through the special C<term:> rule)
 235 can be enclosed in parenthesis, like so:
 236
 237  (1 + 2) * 3
 238
 239 This is legal, because the C<( )> is just another operator handled by the
 240 operator table.
 241
 242 =item * postcircumfix
 243
 244 Postcircumfix brackets are useful for subroutine invocation or indexing. Of
 245 course, this fully depends on the specification of your language. Sometimes,
 246 you need a different rule to define subroutine invocation syntax. This is the
 247 case when arguments can be other objects than operands of normal operators
 248 (which, again, are defined by the C<term:> rule).
 249
 250 An example to handle indexing (assuming the index is an operand as any other
 251 operator's operand) is this:
 252
 253  proto postcircumfix:<[ ]> ...
 254
 255 which, given the following rules:
 256
 257  rule expression is optable { ... }
 258
 259  rule 'term:' (...) is parsed(&simple_expression) { ... }
 260
 261  rule simple_expression {
 262     | <literal>
 263     | <ident>
 264  }
 265
 266 allows us to write this:
 267
 268  foo["hello"]
 269
 270 Here, C<"hello"> is a literal and C<foo> is an identifier.
 271
 272
 273 =back
 274
 275 =head3 The C<assoc> trait
 276
 277 Operators have a certain associacity. For instance, the C<+> operator
 278 is usually said to be C<left> associated, while the exponential operator
 279 (often written as C<^>) is usually C<right> associated. Consider this
 280 example:
 281
 282  10 - 5 - 2
 283
 284 Should this be parsed as:
 285
 286  ((10 - 5) - 2)  # 5-2=3
 287
 288 or as:
 289
 290  (10 - (5 - 2))  # 10-3=7
 291
 292 In other words, does the C<-> I<associate> with the left or right operand?
 293 According to standard rules, the C<-> operator is C<left> associated,
 294 which means the first option is correct.
 295
 296 This associecity is expressed using the C<assoc> trait, like so:
 297
 298  proto 'infix:-' is assoc('left') ...
 299
 300 If you don't specify the association, it defaults to C<left>. C<right>
 301 association works the other way around. In mathematics, the power operator
 302 is right associated.
 303
 304 There is a third associacity in Perl 6. Generally a list (using the comma
 305 operator) is not nested at all:
 306
 307  1, 2, 3, 4
 308
 309 should neither be parsed as
 310
 311  (((1, 2), 3), 4)
 312
 313 nor as
 314
 315  (1, (2, (3, 4)))
 316
 317 You can achieve that with the C<is assoc('none')> trait.
 318
 319 =head3 The C<pirop> trait
 320
 321 Some operators can be perfectly mapped to a specific Parrot instruction,
 322 for instance the C<n_add> op that we introduced earlier. By default, an
 323 operator is implemented as a subroutine call, which is obviously not as
 324 efficient as a single Parrot instruction. Therefore, you can specify the
 325 Parrot instruction using the C<pirop> trait. You can do this as follows:
 326
 327  proto 'infix:+' ... is pirop('n_add') { ... }
 328
 329 This will not work for all Parrot ops, though. Certain instructions
 330 leave their result in an C<I> register (one of the four types of Parrot
 331 registers). PCT currently only supports PMC registers (C<P> registers).
 332 Operators such as C<==> and C<!=> must therefore be implemented as
 333 subroutine calls.
 334
 335 =head3 The C<pasttype> trait
 336
 337 Some operators have behavior that can be implemented by certain C<PAST>
 338 nodes. For instance, many languages define the semantics of the C<and>
 339 operator to be the following:
 340
 341  A and B
 342
 343  1 evaluate A
 344  2 if result(1) is false return false
 345  3 evaluate B
 346  4 return result(3)
 347
 348 As soon as the result of A is found to be false, there is no need to
 349 evaluate B, as the final result can never be true (as this is the C<and>
 350 operator). So, C<B> is evaluated only if C<A> is true.
 351
 352 This is very similar to the PAST::Op node with the C<if> C<:pasttype>:
 353
 354  if (A) {
 355    B
 356  }
 357
 358 Therefore, an operator such as C<and> can be implemented as an C<if>
 359 statement. In order to specify that the C<and> operator must be handled
 360 as a C<PAST::Op( :pasttype('if') )> node, you can use the C<pasttype> trait,
 361 like so:
 362
 363  proto 'infix:and' ... is pasttype('if') { ... }
 364
 365 The C<or> operator is similar to the semantics of a
 366 PAST::Op( :pasttype('unless') node:
 367
 368  A or B
 369
 370  1 evaluate A
 371  2 if result(1) is true return true
 372  3 evaluate B
 373  4 return result(3)
 374
 375 So, C<unless> A is true, evaluate B. Hence, the C<or> operator could
 376 be implemented like so:
 377
 378  proto 'infix:or' ... is pasttype('unless') { ... }
 379
 380 =head3 Special characters
 381
 382 Some operators, such as the shift operators contain the "<" or ">" character.
 383 As operators can be specified as, for instance, infix:<+>, defining an operator
 384 such as "<<" will make the rule parser (the parser generator) confused.
 385
 386 You have two options. Either use quoted names for the operator subroutine names,
 387 like so:
 388
 389  proto 'infix:>>' is equiv('infix:<<') { ... }
 390
 391 Or, use so-called French quotes to do this. This looks as follows:
 392
 393  proto infix:«>>»  is equiv(infix:«<<») { ... }
 394
 395
 396 =head1 FAQ
 397
 398 =over 4
 399
 400 =item * Why are some operators quoted and some not?
 401
 402 That's a matter of taste. You can define the operator C<+> in various ways:
 403
 404  proto infix:+
 405  proto infix:<+>
 406  proto 'infix:+'
 407
 408 Note that C<< 'infix:<+>' >> is not allowed.
 409
 410 =item * I get an error message saying:
 411
 412  "Can only replace inside string or index after end of string"
 413
 414 This error occurs if the operator table contains a misspelling somewhere. Make sure
 415 that operators containing the character "<" or ">" are quoted using French quotes
 416 (« and »), or quote the proto name, as in C<< 'infix:<' >>.
 417 The best solution to solve this is to comment out the whole table except for a
 418 single operator, and try to find the error using a binary search. Or, better yet,
 419 set up your operator table incrementally, while keeping tests passing.
 420
 421 =item * Help! I defined my operator table, but some operators are not recognized.
 422
 423 Be sure that there are no spelling errors. For instance, on first sight the following
 424 looks correct:
 425
 426  proto 'infix:-' is equal('infix:+') { ... }
 427
 428 However, C<equal> is not a valid precedence specifier, it should be spelled as C<equiv>.
 429 The PGE will not warn you for such errors.
 430
 431 =item * How many operator tables can I use in my language?
 432
 433 {{ XXX I think one only? Anybody? }}
 434
 435 =item * How does an optable parser work?
 436
 437 If you really want to know, check out compiler literature.
 438 It's something with stacks and emitting instructions for operators with C<tighter>
 439 precedence (and their operands) first.
 440
 441 =back
 442
 443 =head1 SEE ALSO
 444
 445 =over 4
 446
 447 =item * L<docs\pct\gettingstarted.pod>
 448
 449 =item * L<docs\pct\past_building_blocks.pod>
 450
 451 =item * L<http://dev.perl.org/perl6/doc/design/syn/S06.html>
 452
 453 =back
 454
 455 =cut