docs/manual.txt

   1 Gazelle Manual
   2 ==============
   3 Joshua Haberman <joshua@reverberate.org>
   4 v0.3, October 2008
   5 :toc:
   6
   7 Gazelle is a system for parsing formal languages.  A formal language is any
   8 language with formally-specified, non-ambiguous syntax.  This includes but
   9 is not limited to:
  10
  11 * programming languages (C, C++, Java, etc.)
  12 * data languages (XML, JSON, YAML, etc.)
  13 * text-based protocols (HTTP, SMTP, IMAP, etc.)
  14 * configuration files
  15 * logfiles
  16
  17 Gazelle defines a grammar language that is intended to give you enough tools to
  18 parse real-world languages and all of their idiosynchracies, while still being
  19 efficient enough to use in time-sensitive and/or memory-constrained situations.
  20 Gazelle takes a multi-paradigm approach.  While Gazelle is based around
  21 context-free grammars, it also supports:
  22
  23 * repetition operators like `?` (optional), `*` (zero or more), and `+`
  24   (one or more).
  25 * ambiguity-resolution operators like `/` (prioritized choice, as in
  26   Parsing Expression Grammars) and greedy/non-greedy repetition.
  27 * '(planned)' operator-precedence parsing, so that you can define infix
  28   expressions in terms of their operators, precedences, and associativity.
  29 * '(planned)' an imperative language (Lua) to use as an "escape hatch" when
  30   none of the other formalisms address a language's idiosynchracy.
  31
  32 Gazelle also puts a primary focus on making its grammars reusable.  Once a
  33 grammar is written for a language, it should be possible to use it unmodified,
  34 from any host language that Gazelle supports, for anything from a compiler to
  35 the syntax highlighting component of a text editor.  That is Gazelle's single
  36 most fundamental design criterion.
  37
  38 An Introductory Tour
  39 --------------------
  40
  41 This section offers a quick tour of Gazelle and its capabilities.  It assumes
  42 you have already compiled Gazelle successfully; the instructions for doing so
  43 are in the README.
  44
  45 Hello, World
  46 ~~~~~~~~~~~~
  47
  48 For our "Hello, World" example, we'll write a grammar for a language that
  49 is simply an integer surrounded by an arbitrary number of balanced parentheses.
  50 To begin, use a text editor to write this grammar to `hello.gzl`.
  51
  52 ------------------------------------------------
  53 hello -> "(" hello ")" | .digits=/[0-9]+/;
  54 ------------------------------------------------
  55
  56 This text is a complete Gazelle grammar.  It defines a single rule called "hello",
  57 which expands to either another (recursive) hello surrounded by parentheses, or
  58 a string of one or more numeric digits (and this terminal is named "digits").
  59
  60 This language will match any of the following inputs:
  61
  62 ------------------------------------------------
  63 (5)
  64 (((((((((((100)))))))))))
  65 0
  66 ------------------------------------------------
  67
  68 But none of these inputs will match:
  69
  70 ------------------------------------------------
  71 (()
  72 )(
  73 0)
  74 ------------------------------------------------
  75
  76 The first step is to compile the text version of the grammar to bytecode.
  77 The Gazelle compiler analyzes the input grammar and builds a series of
  78 finite automata (state machines) that can do the real parsing.  It writes
  79 this compiled version of the grammar to bytecode.
  80
  81 ------------------------------------------------
  82 $ . lua_path
  83 $ ./compiler/gzlc --help
  84 gzlc -- Gazelle grammar compiler.
  85 Gazelle v0.3  http://www.reverberate.org/gazelle/
  86
  87 Usage: gzlc [options] input-file
  88
  89   -h, --help         you're looking at it.
  90
  91   -d,                dump detailed output about the grammar to
  92                      html/index.html.
  93
  94   -k <depth>         Maximum LL(k) to consider (by default, uses a
  95                      heuristic that attempts to determine if the
  96                      grammar is LL(k) for *any* k).
  97
  98   --no-minimize-rtns
  99                      skip the minimization step for RTNs.  This results
 100                      in larger RTNs, but may be necessary if minimization
 101                      is taking too long (this should only occur in
 102                      artificially-complicated grammars).
 103
 104   -o <file>          output filename.  Default is input filename
 105                      with extension replaced with .gzc
 106
 107   -v, --verbose      dump information about compilation process and
 108                      output statistics.
 109
 110   --version          dump Gazelle version
 111 $ ./compiler/gzlc hello.gzl
 112 $
 113 ------------------------------------------------
 114
 115 You won't see any output -- that means that the grammar was compiled
 116 successfully.  Just like `gcc`, `gzlc` is silent when things succeed.  The
 117 default output filename is the input filename with 'gzl' replaced with 'gzc'.
 118
 119 ------------------------------------------------
 120 Tallis:gazelle joshua$ ls -l hello*
 121 -rw-r--r--  1 joshua  joshua  212 Sep 30 21:11 hello.gzc
 122 -rw-r--r--  1 joshua  joshua   43 Sep 30 21:11 hello.gzl
 123 ------------------------------------------------
 124
 125 If you want to see a bit more verbose output, run `gzlc` with `-v`.
 126
 127 ------------------------------------------------
 128 $ ./compiler/gzlc -v hello.gzl
 129 Gazelle v0.3
 130 Opening input file 'hello.gzl'...
 131 Parsing grammar...
 132 Assigning RTN transition priorities...
 133 Convering RTN NFAs to DFAs...
 134 Minimizing RTN DFAs...
 135 Doing LL(*) lookahead calculations...
 136 Generating lexer DFAs...
 137 Writing to output file 'hello.gzc'...
 138 Writing 4 strings...
 139 Writing 1 IntFAs...
 140   4 states, 4 transitions
 141 Writing 0 GLAs...
 142 Writing 1 RTNs...
 143 ------------------------------------------------
 144
 145 Now, the magic moment where we do our first actual parsing.  We'll write some
 146 valid input text for this language, then use `gzlparse` to actually parse it.
 147
 148 ------------------------------------------------
 149 $ echo -n '((((500))))' > hello_text
 150 $ ./utilities/gzlparse --help
 151 gzlparse -- A command-line tool for parsing input text.
 152 Gazelle 0.3  http://www.reverberate.org/gazelle/.
 153
 154 Usage: gzlparse [OPTIONS] GRAMMAR.gzc INFILE
 155 Input file can be '-' for stdin.
 156
 157   --dump-json    Dump a parse tree in JSON as text is parsed.
 158   --dump-total   When parsing finishes, print the number of bytes parsed.
 159   --help         You're looking at it.
 160
 161 $ ./utilities/gzlparse hello.gzc hello_text
 162 $
 163 ------------------------------------------------
 164
 165 Again, by default Gazelle doesn't print any output when things were successful.
 166 If there had been a parse error, Gazelle would have printed an error and exited.
 167 If you want to see more output, the flags `--dump-json` and `--dump-total` will
 168 print a parse tree in JSON format and a total byte count, respectively.
 169
 170 ------------------------------------------------
 171 $ ./utilities/gzlparse --dump-json --dump-total hello.gzc hello_text
 172 {"parse_tree":
 173   {"rule":"hello", "start": 0, "children": [
 174     {"terminal": "(", "slotname": "(", "slotnum": 0, "offset": 0, "len": 1},
 175     {"rule":"hello", "start": 1, "slotname":"hello", "slotnum":1, "children": [
 176       {"terminal": "(", "slotname": "(", "slotnum": 0, "offset": 1, "len": 1},
 177       {"rule":"hello", "start": 2, "slotname":"hello", "slotnum":1, "children": [
 178         {"terminal": "(", "slotname": "(", "slotnum": 0, "offset": 2, "len": 1},
 179         {"rule":"hello", "start": 3, "slotname":"hello", "slotnum":1, "children": [
 180           {"terminal": "(", "slotname": "(", "slotnum": 0, "offset": 3, "len": 1},
 181           {"rule":"hello", "start": 4, "slotname":"hello", "slotnum":1, "children": [
 182             {"terminal": "digits", "slotname": "digits", "slotnum": 3, "offset": 4, "len": 3}
 183           ], "len": 3},
 184           {"terminal": ")", "slotname": ")", "slotnum": 2, "offset": 4, "len": 4}
 185         ], "len": 5},
 186         {"terminal": ")", "slotname": ")", "slotnum": 2, "offset": 8, "len": 1}
 187       ], "len": 7},
 188       {"terminal": ")", "slotname": ")", "slotnum": 2, "offset": 9, "len": 1}
 189     ], "len": 9},
 190     {"terminal": ")", "slotname": ")", "slotnum": 2, "offset": 10, "len": 1}
 191   ], "len": 11}
 192 }
 193 11 bytes parsed.
 194 $
 195 ------------------------------------------------
 196
 197 The next thing you will want to do is use `gzlc` to produce an HTML dump of
 198 the grammar as the compiler sees it.  This is an invaluable way to check
 199 and be sure that the compiler is seeing things the way you meant it to.
 200 If not, you might have misspecified the grammar, or you may have found a
 201 bug in the compiler (and these certainly exist).
 202
 203 Doing this HTML dump requires that Graphviz is installed.  If ImageMagick is
 204 installed, this will also be used to create thumbnails.
 205
 206 ------------------------------------------------
 207 $ ./compiler/gzlc -d hello.gzl
 208 ------------------------------------------------
 209
 210 Now open `html/index.html` in a web browser to view the HTML dump.  You will
 211 see an illustration of the "hello" rule, and you should see that it corresponds
 212 to your intuitive understanding of this rule.  There are two paths through the
 213 rule -- either `"(", hello, ")"` or `digits`.
 214
 215 You will also see the state machine that does the lexing: it recognizes three
 216 different tokens: a left parenthesis, a right parenthesis, and one called
 217 "digits" that is a string of repeated numbers.  Notice that Gazelle built a
 218 lexer for you based on the grammar you entered.  You don't have to give Gazelle
 219 a separate lexer specification.
 220
 221 Creating a lexer was trivial in the above example because none of the different
 222 token types conflict.  Because left parenthesis, right parenthesis, and
 223 "digits" are all distinct and do not conflict, Gazelle simply combined them all
 224 into a single DFA that breaks the input into tokens.  But consider a more
 225 complicated example: a grammar to describe files of comma-separated values,
 226 where the values can either be strings or numbers.  For example:
 227
 228 ------------------------------------------------
 229 1,5,"X","42","Hello, world!\n",31415926
 230 ------------------------------------------------
 231
 232 What makes this example more complicated is that outside of quotes, commas and
 233 digits mean one thing, but inside quotes they mean another.  Here is a grammar
 234 to describe this format:
 235
 236 ------------------------------------------------
 237 // A CSV line is any number of comma-separated values separated by commas,
 238 // ending with a newline.
 239 csv_line -> csv_value *(,);
 240
 241 // A comma-separated value is either a number or a quoted string.
 242 csv_value -> .number=/[0-9]+/ | '"' string_fragment* '"';
 243
 244 // Inside a string, there can be either regular characters or a
 245 // backslash-escaped character.
 246 string_fragment -> .chars=/[^\\"]+/ | .escaped_char=/\\./;
 247 ------------------------------------------------
 248
 249 If you compile this grammar with `gzlc -d` and view the HTML output, you will
 250 see that Gazelle has generated two different IntFAs (DFAs used by the lexer) --
 251 one that is used inside strings that one that is used the rest of the time.
 252 Inside a string the text "42" will be lexed as the token "chars", but outside
 253 a string it will be lexed as the token "number."
 254
 255 Where to go from here
 256 ~~~~~~~~~~~~~~~~~~~~~
 257
 258 This concludes the tour.  Next you can read more about Gazelle's input language
 259 and experiment with your own grammars.
 260
 261 Gazelle's major limitation at the moment is that it cannot support removing
 262 whitespace or comments from the input.  A feature to address this is coming,
 263 but for the moment please bear with me!
 264
 265 I hope you enjoy Gazelle!
 266
 267 Writing Grammars for Gazelle
 268 ----------------------------
 269
 270 This chapter focuses on the syntax of Gazelle grammars, and the considerations
 271 of making your grammar easily parseable.
 272
 273 [[X1]]
 274 Rules
 275 ~~~~~
 276
 277 Rules are at the heart of every grammar; they describe
 278 the patterns of all the constructs in your language.  Here is a simple
 279 example that describes an assignment statement.
 280
 281 ------------------------------------------------
 282 assign -> ident "=" expr;
 283 ------------------------------------------------
 284
 285 This declares that an assignment statement is an identifier followed by the
 286 string ``='' followed by an expression.  We say that `assign` is the rule's
 287 'name'; `ident`, `"="`, and `expr` are its 'components'.  We assume here that
 288 ``ident'' and ``expr'' are defined elsewhere in the grammar.
 289
 290 For every rule you write, Gazelle builds a graph that describes the
 291 possible paths that the input can take through this rule.  Since this
 292 rule is very simple and doesn't have any kind of repetition, the
 293 graph is just a single path.
 294
 295 ["simplerule1.png", "graph 1: assign -> ident \"=\" expr;"]
 296 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 297 assign -> ident "=" expr;
 298 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 299
 300
 301 Alternation
 302 ^^^^^^^^^^^
 303
 304 Often you will have a rule that can be expanded in more than one way.
 305 For example, you might have a boolean expression that can be either ``true'' or
 306 ``false.''  You can specify alternation to Gazelle with the vertical
 307 bar (`|`).
 308
 309 ------------------------------------------------
 310 boolexpr -> "true" | "false";
 311 ------------------------------------------------
 312
 313 As you would expect, this produces a rule graph with two edges -- one
 314 for each option.
 315
 316 ["altrule1.png", "graph for: boolexpr -> \"true\" | \"false\";"]
 317 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 318 boolexpr -> "true" | "false";
 319 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 320
 321 You can use alternation freely throughout your rules.  Alternation has
 322 the lowest precedence, but you can override that with parentheses. Here
 323 is a more complicated rule that demonstrates more extensive use of
 324 alternation.
 325
 326 ------------------------------------------------
 327 name -> (fname (mname | minitial) | nickname) surname;
 328 ------------------------------------------------
 329
 330 The graph for this rule is naturally a bit more complicated, but
 331 looking at it should convince you that the possible paths through this
 332 rule are what you intended.
 333
 334 ["altrule2.png", "graph for: name -> (fname (mname | minitial) | nickname) surname"]
 335 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 336 name -> (fname (mname | minitial) | nickname) surname;
 337 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 338
 339
 340 Repetition
 341 ^^^^^^^^^^
 342
 343 Repetition is our other main tool for rule-writing.  Repetition isn't
 344 strictly necessary for writing context-free grammars (it can always be
 345 expressed as recursion), but using it can make grammars easier to
 346 write, understand, and ultimately parse.
 347
 348 Gazelle offers these repetition modifiers; place them after the
 349 component you want to modify.
 350
 351 `?`::
 352 The `?` modifier specifies 0 or 1 occurrences of the previous component.
 353 It corresponds to square brackets (`[]`) in EBNF.
 354
 355 `*`::
 356 The `*` modifier specifies 0 or more occurrences of the previous component.
 357 It corresponds to curly brackets (`{}`) in EBNF.
 358
 359 `+`::
 360 The `+` modifier specifies 1 or more occurrences of the previous component.
 361
 362 `*(sep)`::
 363 The `\*(sep)` modifier specifies 0 or more occurrences of the previous
 364 component, where each occurrence is separated by `sep`.  It is a more
 365 straightforward way of writing `(component (sep component)*)?`.  `sep` can be any
 366 valid component (or in unusual cases, expression of components) that can
 367 appear on a right-hand-side of a rule.
 368
 369
 370 `+(sep)`::
 371 The `\+(sep)` modifier specifies 1 or more occurrences of the previous
 372 component, where each occurrence is separated by `sep`.  It is a more
 373 straightforward way of writing `component (sep component)*`.  `sep` can be any
 374 valid component (or in unusual cases, expression of components) that can appear
 375 on a right-hand-side of a rule.
 376
 377 Here is an example that uses repetition; it is the definition of an
 378 object in JSON:
 379
 380 ------------------------------------------------
 381 object   -> "{" (string ":" value) *(",") "}";
 382 ------------------------------------------------
 383
 384 ["reprule1.png", "Graph for the above JSON object expression."]
 385 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 386 object   -> "{" (string ":" value) *(",") "}";
 387 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 388
 389 [IMPORTANT]
 390 One important limitation of the repetition operators is that they
 391 cannot be nested within a single rule.
 392 For example, you cannot write `expr -> (mult_term \+("*")) \+("+");`
 393 This isn't a weakness --
 394 it is a design decision that makes the process of pulling a
 395 parse tree apart much more sane.  If you are tempted to nest
 396 repetitions, make the innermost repetition into its own rule.
 397
 398
 399 Components
 400 ~~~~~~~~~~
 401
 402 So far we have neglected to describe in detail what can appear as a
 403 component on the right-hand-side of a rule.  In the previous examples we
 404 have seen strings and nonterminals -- in this section we flesh out
 405 exactly what can appear there.
 406
 407 There are four fundamental kinds of components that can appear on a
 408 right-hand-side.
 409
 410 nonterminals::
 411 Nonterminals are symbols that are defined by rules, as described in the
 412 <<X1, Rules>> section.  Each nonterminal that appears on the right-hand-side
 413 of a rule must at some point be defined by appearing in the left-hand-side
 414 of a rule.  The definition need not precede the use.
 415
 416 strings::
 417 A literal string that we expect to see at this point in the input.
 418
 419 regular expressions::
 420 A 'pattern' of characters we expect to see at this point in the input.
 421
 422 named regular expressions::
 423 You can name a regex, and later refer to that regex by name.
 424
 425 Strings and regular expressions are referred to collectively as
 426 ``tokens''  or ``nonterminals'' because they represent actual strings of
 427 the input text.  Gazelle can take the tokens defined in the grammar and
 428 use them to create a lexer-like component of the parser automatically.
 429
 430 Component Names
 431 ^^^^^^^^^^^^^^^
 432
 433 All rule components must have a name that is unique within that rule.
 434 Names can be specified explicitly with the syntax:
 435
 436 ----------------------------------------------
 437 rule -> .name=other_rule;
 438 rule -> .name="string";
 439 rule -> .name=/regex/;
 440 rule -> .name = /regex/;  // the whitespace is insignificant.
 441 ----------------------------------------------
 442
 443 The component names are important when it comes to pulling apart the
 444 resulting parse tree, which is why every component must have one.
 445 Strings, rules, and named regular expressions have default names:
 446
 447 ----------------------------------------------
 448 rule -> other_rule;        // name defaults to "other_rule"
 449 rule -> "string";          // name defaults to "string"
 450 rule -> .name=/regex/;     // no default!  name must be specified!
 451 ----------------------------------------------
 452
 453 The last case is important: regular expressions must have explicit
 454 names.  This is for two reasons: one is that we don't have a good way to
 455 generate a reasonable default name for a regular expression.  The second
 456 is that if regular expressions could be specified without an explicit
 457 name, Gazelle's grammar would be ambiguous:
 458
 459 ----------------------------------------------
 460 s -> a / b;
 461 a -> c / d;
 462 ----------------------------------------------
 463
 464 Are these two rules, or one rule that has a long, multi-line regex in
 465 it?  To avoid this ambiguity, every regular expression inside a rule
 466 must have an explicit name.
 467
 468
 469 Strings
 470 ^^^^^^^
 471
 472 When a string appears on a right-hand-side, it specifies exactly what string
 473 we expect to see at this point in the input.  Strings can be either single
 474 or double quoted.  Backslashes can be used to quote the next character, but
 475 no special backslash sequences are currently recognized (this clearly needs
 476 to be fixed in future versions of Gazelle).
 477
 478 [[X2]]
 479 Regular Expressions
 480 ^^^^^^^^^^^^^^^^^^^
 481
 482 Regular expressions (regexes for short) describe patterns of text.  Users of
 483 languages like Perl, Python, or Ruby, or of tools like `grep`, will find the
 484 regular expressions in Gazelle very familiar.  The main difference that you
 485 will notice is that whitespace in Gazelle regular expressions is insignificant
 486 (as you would get in Perl with `/x`).
 487
 488 Regular expressions in Gazelle are delimited with `/` (forward slash):
 489
 490 ----------------------------------------------
 491 /<body of regex>/
 492 ----------------------------------------------
 493
 494 In a regular expression, characters match themselves with the following
 495 exceptions:
 496
 497 `.`::
 498 A single dot matches any character (including a newline -- as if doing
 499 a multiline match in Perl, Python, or Ruby).
 500
 501 `\`::
 502 A backslash always escapes the next character, causing it to lose any
 503 special meaning it would otherwise have.
 504
 505 spaces and newlines::
 506 To increase readability, whitespace inside regular expressions is
 507 insignificant .  You are encouraged to use whitespace
 508 to make your regular expressions more readable.
 509
 510 `[]`::
 511 A left square bracket begins a character class, and a right square
 512 bracket ends it.
 513
 514 `{}`::
 515 Curly brackets allow you to specify repetition with limits.  The high
 516 and low limits are separated by a comma, and either limit can be
 517 omitted to mean 0 and infinite, respectively.
 518
 519 `?`::
 520 A question mark specifies 0 or 1 repetition.
 521
 522 `*`::
 523 An asterisk specifies 0 or more repetition.
 524
 525 `+`::
 526 A plus sign specifies 1 or more repetition.
 527
 528 /////////////////////////////////////
 529 TODO: figure out and describe character classes (\w, :alpha:, etc)
 530 /////////////////////////////////////
 531
 532 Named Regular Expressions
 533 ^^^^^^^^^^^^^^^^^^^^^^^^^
 534
 535 A regular expression can be named and referred to later by using the
 536 'named regular expression' syntax.
 537
 538 -------------------------------------------
 539 regex: /<some regex>/;
 540 -------------------------------------------
 541
 542 This appears to be almost identical to declaring a rule with a single
 543 regex on the right-hand-side, like so:
 544
 545 -------------------------------------------
 546 rule -> .regex=/<some regex>/;
 547 -------------------------------------------
 548
 549 The primary difference is that the named regular expression syntax avoids creating a
 550 trivial and useless nonterminal graph.  While Gazelle could detect the case where
 551 a nonterminal graph is trivial, it is important that Gazelle preserves the structure
 552 of the grammar as specified, so that client programs can hook into this structure
 553 at parse time.
 554
 555 Special Commands
 556 ~~~~~~~~~~~~~~~~
 557
 558 Definitions of nonterminals and tokens will be the bulk of most syntax
 559 files; however there are several commands that give Gazelle important
 560 information about the grammar that isn't expressed in rules.
 561
 562 `@start`
 563 ^^^^^^^^
 564
 565 The `@start` command tells Gazelle what the top-level nonterminal for a grammar
 566 is.  If `@start` is not specified, Gazelle assumes that the first nonterminal
 567 in the grammar is the start symbol.  The syntax of this command is simply:
 568
 569 -------------------------------------------
 570 @start nonterm;
 571 -------------------------------------------
 572
 573 `@start` is not required, and may not appear more than once per grammar.
 574
 575 [[X3]]
 576 Ambiguity Resolution
 577 ~~~~~~~~~~~~~~~~~~~~
 578
 579 Some grammars are 'ambiguous', meaning that some input strings could be
 580 parsed in more than one way.  The most common example of this is the
 581 classic if/then/else case:
 582
 583 -------------------------------------------
 584 stmt -> "if" expr "then" stmt ("else" stmt)? | expr;
 585 expr -> "true" | "false";
 586 -------------------------------------------
 587
 588 If you try to compile this grammar, Gazelle will give you this error
 589 message:
 590
 591 -------------------------------------------
 592 Gazelle cannot handle this grammar.
 593 It is not Strong-LL or full-LL (and may be ambiguous, but we don't know).
 594 -------------------------------------------
 595
 596 Unfortunately, Gazelle cannot tell you with certainty that the problem is an
 597 ambiguous grammar -- detecting whether a grammar is ambiguous is undecidable
 598 (impossible to compute in all cases).  In the future a more detailed error
 599 message will make clear the root of the problem, which is that Gazelle cannot
 600 decide whether to match an "else" or not with input text like:
 601
 602 -------------------------------------------
 603 if true then
 604   if true then
 605     true
 606 else false
 607 -------------------------------------------
 608
 609 As the grammar is specified above, the "else" in this example could go
 610 with either if.  This ambiguity exists in many language's grammars, but
 611 is explicitly resolved by saying "an `else` goes with the most recent
 612 `if`".  You can express this ambiguity resolution directly to Gazelle by
 613 using its ambiguity resolution operators.
 614
 615 `/`::
 616 A slash indicates 'prioritized choice', as in Parsing Expression
 617 Grammars (PEGs).  Use it in place of `|` (non-prioritized choice) when
 618 you want to indicate that the first alternative should be used in cases
 619 where both alternatives match.
 620
 621 `+`::
 622 Greedy repetition: add to another repetition operator (`+`, `*`, or `?`)
 623 to make the repetition prefer to keep matching.
 624
 625 `-`::
 626 Non-greedy repetition: add to another repetition operator (`+`, `*`, or `?`)
 627 to make the repetition prefer to stop matching.
 628
 629 Here are examples of these operators in use:
 630
 631 -------------------------------------------
 632 // Match "else" to its most recent "if"
 633 stmt -> "if" expr "then" stmt ("else" stmt)?+ | expr;
 634
 635 // "X" "Y" matches the first alternative, but with more than one "X"
 636 // match the second alternative instead.
 637 example -> "X" "Y" / "X"+ "Y";
 638 -------------------------------------------
 639
 640 Gazelle can only support some resolutions of ambiguity.  For example, we
 641 can't use Gazelle to resolve the if/then/else problem the other way:
 642
 643 -------------------------------------------
 644 // WON'T WORK.  Throws the error below:
 645 stmt -> "if" expr "then" stmt ("else" stmt)?- | expr;
 646 expr -> "true" | "false";
 647
 648 Gazelle cannot support this resolution of the ambiguity in rule stmt.
 649 -------------------------------------------
 650
 651 Because of Gazelle's overriding goal to ensure that the resulting parser
 652 is both efficient and as "on-line" as possible, Gazelle can't support
 653 all grammars or ambiguity resolutions.  If you wanted to resolve the
 654 "else" problem the opposite way, you would need to write custom code
 655 that makes the decision of whether or not to match the "else", because
 656 the logic to do so is nontrivial.
 657
 658 Unfortunately, the same is true if you try to resolve if/then/else with
 659 prioritized choice:
 660
 661 -------------------------------------------
 662 // WON'T WORK.  Throws the error below:
 663 stmt -> "if" expr "then" stmt "else" stmt / "if" expr "then" stmt | expr;
 664 expr -> "true" | "false";
 665
 666 Language is probably not LL(k) or LL(*): when calculating lookahead for
 667 a state in stmt, one lookahead language was nonregular, others were not
 668 all fixed
 669 -------------------------------------------
 670
 671 The problem in this case is that you're trying to make Gazelle figure
 672 out before it's parsed an "if" whether or not it will have an "else",
 673 but that requires unbounded lookahead that has to count (it has to match
 674 the "ifs" that it sees with "elses").
 675
 676 The moral of the story is this: in most cases you should avoid ambiguities in
 677 your grammar.  In some cases you can't avoid them and these ambiguity
 678 resolution operators will be a satisfactory solution (if/then/else is the most
 679 likely case for this).  In other cases ambiguity resolution won't suffice and
 680 you'll have to get down and dirty with some imperative Lua code.
 681
 682 Regular Expressions vs. Repetition Within Rules
 683 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 684
 685 You will notice that regular expressions and rules both give you the
 686 `?`, `*`, and `+` operators.  Given this, there are are many situations where you
 687 will find that you can use either regular expressions or rules.  For example,
 688 consider this regular expression that describes a number:
 689
 690 -----------------------------------------------
 691 number:  /(-)?  [0-9]*/
 692 -----------------------------------------------
 693
 694 You could express the same pattern using a rule instead:
 695
 696 ----------------------------------------------------------------------
 697 number -> "-"? ("0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9")*;
 698 ----------------------------------------------------------------------
 699
 700 In this case, the rule version is much longer and more verbose, because the
 701 regular expression offers a useful tool that strings in rules do not offer:
 702 the character class.  In addition, the second form could make the grammar
 703 as a whole more expensive to parse by requiring more tokens of lookahead.
 704 Since a number is no longer a single token, it cannot be used as a
 705 single unit of lookahead.
 706
 707 Does this mean that we should prefer regular expressions whenever we can?
 708 Not always.  Consider this example of a quoted string that contain
 709 backslash escapes:
 710
 711 -----------------------------------------------------
 712 "Hello, world!\n\tThis string will be parsed by Gazelle!"
 713 -----------------------------------------------------
 714
 715 We could write a Gazelle grammar to parse this using a single regular
 716 expression.
 717
 718 -----------------------------------------------------
 719 string -> .string=/" ([^\\"]|\\.)* "/;
 720 -----------------------------------------------------
 721
 722 However, if we write it this way, we lose the information about where
 723 each backslash escape was in the string, because regular expressions in
 724 Gazelle do not have capture groups.  In Gazelle, you find boundaries for
 725 subexpressions by making those subexpressions their own components.  In
 726 this case, writing the pattern using rules is superior, because it
 727 preserves all the important information about where each sequence was
 728 seen.
 729
 730 -----------------------------------------------------
 731 // Written this way, each str_frag is either a backslash sequence
 732 // (like \n) or a string of characters.  You don't have to re-parse
 733 // the str_frag's later to find the backslash sequences.
 734 string   -> '"' str_frag* '"';
 735 str_frag -> /[^\\"]+/ | /\\./;
 736 -----------------------------------------------------
 737
 738 So what general rule can we extrapolate from this?  *In general, you
 739 should only use regular expressions to match strings that don't have
 740 any logical sub-parts.*  If you find that your regular expression has
 741 sub-parts that you might want to pull apart later, make each sub-part
 742 into its own component.
 743
 744 Dealing with Errors in your Grammars
 745 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 746
 747 Gazelle uses an algorithm known as LL(*) for parsing.  The benefit of this
 748 algorithm is that it is very fast and uses very little memory.  The drawback
 749 is that it cannot parse all grammars.  Using all the tools that Gazelle
 750 gives you, it should be able to parse all real-world 'languages', it just
 751 might require you to rearrange your grammar a bit.
 752
 753 For anyone with unpleasant memories of trying to debug LALR(1) parsers like
 754 yacc or Bison, it bears mentioning that debugging Gazelle grammars should
 755 eventually be significantly easier than debugging Bison grammars.  The LL(*)
 756 grammar analysis that Gazelle performs is far easier for humans to understand
 757 than LALR tables.  However currently some of Gazelle's error messages do not
 758 give you the amount of information you really need to debug the problem.  This
 759 will improve in future versions of Gazelle.
 760
 761 What follows is a list of errors your grammar can have, what they mean, and
 762 how to fix them.
 763
 764 Left Recursion
 765 ^^^^^^^^^^^^^^
 766
 767 As a top-down parser, Gazelle cannot handle left-recursion.  Left-recursion
 768 is simply a rule whose first component is a recursive invocation of the
 769 same rule.  Examples are:
 770
 771 -----------------------------------------------------
 772 // Right hand side begins with "list".
 773 list -> list "," "item" | "item";
 774
 775 // still left-recursive: even though "list" isn't the first component
 776 // on the right-hand side, it still can occur first in the rule's pattern.
 777 list -> "item" | list "," "item";
 778
 779 // still left-recursive: even though "prefix" can come before list, it
 780 // can also be omitted, leaving "list" as the first component in the rule.
 781 list -> "item" | "prefix"? list "," "item";
 782
 783 // Two or more rules can be left-recursive together.
 784 rule_a -> rule_b "foo";
 785 rule_b -> rule_a | "bar";
 786 -----------------------------------------------------
 787
 788 Trying to compile any of these grammars will cause Gazelle to throw the error:
 789
 790 -----------------------------------------------------
 791 Grammar is not LL(*): it is left-recursive!
 792 -----------------------------------------------------
 793
 794 In the future this message will give more information about which rule(s)
 795 were left-recursive.  For the moment you have to figure this out yourself.
 796
 797 Eliminating left-recursion is not difficult.  The most common reason for
 798 writing left-recursion is to express some kind of list:
 799
 800 -----------------------------------------------------
 801 list -> list "," "item" | "item";
 802 -----------------------------------------------------
 803
 804 The way to fix this is to express the list using repetition operators instead
 805 of left-recursion.
 806
 807 -----------------------------------------------------
 808 list -> "item" *(,);
 809 -----------------------------------------------------
 810
 811 Ambiguity
 812 ^^^^^^^^^
 813
 814 The earlier section on <<X3, ambiguity resolution>> discussed ambiguity some.
 815 A grammar is 'ambiguous' if there is input text that could be interpreted more
 816 than one way according to the grammar.  Examples are:
 817
 818 -----------------------------------------------------
 819 s -> "X" | "X";    // Does a single X match the first or the second component?
 820 s -> "X"? "X"?;    // Does a single X match the first or the second component?
 821
 822 // When X is matched, have we gone a -> b -> c, or just a -> c?
 823 a -> b | c;
 824 b -> c;
 825 c -> "X";
 826 -----------------------------------------------------
 827
 828 In these cases, Gazelle knows that the grammar is ambiguous and can tell
 829 you so in the error message:
 830
 831 -----------------------------------------------------
 832 Ambiguous grammar for state starting in rule s, paths={{"term", "X"}} AND {{"term", "X"}}
 833 -----------------------------------------------------
 834
 835 This is telling you simply that in matching a single terminal "X" Gazelle was
 836 able to take two different paths through the grammar and end up in the same
 837 place (the definition of an ambiguity).  Here is the error message from the
 838 last example above of three rules:
 839
 840 -----------------------------------------------------
 841 Ambiguous grammar for state starting in rule a,
 842 paths={{"enter", "b"}, {"enter", "c"}, {"term", "X"}} AND {{"enter", "c"}, {"term", "X"}}
 843 -----------------------------------------------------
 844
 845 As you can see, Gazelle saw that it could either enter `b` and then `c` or just
 846 enter `c`.
 847
 848 To fix problems of ambiguity, you need to rewrite the grammar such that
 849 the given input text cannot be interpreted in more than one way.  Another option
 850 is to use the <<X3, ambiguity resolution operators>>.
 851
 852 Note that while Gazelle can detect some cases where the grammar is ambiguous
 853 (such as the above examples), in other cases it may only detect that the grammar
 854 is not LL(*) but not realize that this is because of an ambiguity.  This is
 855 the case in the if/then/else example.
 856
 857 Rule has no non-recursive alternative
 858 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 859
 860 Every rule must have at least one non-recursive alternative.  For example
 861 this rule is invalid, because it can never stop recursing:
 862
 863 -----------------------------------------------------
 864 s -> "X" s;
 865 -----------------------------------------------------
 866
 867 Gazelle complains with this error message:
 868
 869 -----------------------------------------------------
 870 Invalid grammar: rule s had no non-recursive alternative
 871 -----------------------------------------------------
 872
 873 The solution is simple: every rule must have one alternative that stops
 874 recursing.  For example, you can fix the above rule like so:
 875
 876 -----------------------------------------------------
 877 s -> "X" s?;
 878 -----------------------------------------------------
 879
 880 Like left-recursion, two or more rules can work together to create this
 881 error:
 882
 883 -----------------------------------------------------
 884 a -> "X" b;
 885 b -> "Y" a;
 886 -----------------------------------------------------
 887
 888 Grammar is not LL(*)
 889 ^^^^^^^^^^^^^^^^^^^^
 890
 891 These can be the trickiest errors to solve -- more information about solving
 892 them will be in future versions of the manual.
 893
 894
 895 The C Runtime
 896 -----------
 897
 898 This section will document the API for the parsing runtime, and explained
 899 how it can be used for event-based parsing, syntax trees, ASTs, and
 900 whitespace-preserving transformations.  It will also discuss best
 901 practices for wrapping the C runtime in other languages.  It will
 902 'not' offer documentation about each language's wrappers -- I think
 903 that belongs in separate manuals, but I may change my mind about that.
 904
 905
 906 The Gazelle Algorithm
 907 ---------------------
 908
 909 This section describes Gazelle's parsing algorithm.  Unlike LALR(1) tables with
 910 their reputation for being mysterious and hard to understand, the top-down
 911 algorithm that Gazelle uses should be easy for the average programmer to
 912 understand.  Gazelle users are encouraged to read this section to understand
 913 the way your grammars are analyzed and parsed.
 914
 915 Gazelle's algorithm takes a great deal of inspiration from ANTLR, the LL(*)
 916 parser generator written by Terence Parr.  While Gazelle seeks to augment the
 917 theory that Terence developed, the foundations of Gazelle are strongly based in
 918 ANTLR's achievements.  Like ANTLR, Gazelle is a top-down parser that models its
 919 lookahead as possibly-cyclic state machines.  Gazelle's algorithm for
 920 generating this lookahead is based on ANTLR's algorithm.
 921
 922 Compared to ANTLR, Gazelle adds the following capabilities:
 923
 924 * Gazelle can handle tail recursion when generating lookahead.  This allows
 925   it to handle grammars that ANTLR currently cannot (though these grammars
 926   are probably better expressed using explicit repetition than tail recursion).
 927 * Gazelle integrates the lexer more closely to the parser than ANTLR does.
 928   In Gazelle, the parser essentially calls into the lexer, which allows it
 929   to pass on information about what tokens it is expecting to see.  This allows,
 930   context-sensitive lexing, such as languages that allow keywords as identifiers
 931   or languages that embed other languages inside them (where the two languages
 932   have different lexical vocabularies).
 933 * '(planned)' Gazelle will support operator-precedence expression parsing, which will make it
 934   much easier to describe infix operator expressions, especially if there
 935   are many levels of precedence.  It is already known (if not widely
 936   advertised) that it is possible to "sandwich" an operator precedence
 937   parsing algorithm such as the shunting yard algorithm between two
 938   top-down parsers -- indeed, this strategy is used in the current
 939   version of GCC's C/C++ parser -- but this feature has not been offered
 940   (to the author's knowledge) in a top-down parsing tool.
 941
 942 Compared to ANTLR, Gazelle does not yet support syntactic or semantic predicates,
 943 but these (or capabilities like them) are planned for the future.
 944
 945
 946 High Level Overview
 947 ~~~~~~~~~~~~~~~~~~~
 948
 949 Gazelle's parsing algorithm boils down to nothing more than a stack of state
 950 machines.  Each byte of the input transtions these state machines one or more
 951 times, and the entire state of the parse can be represented by the stack of
 952 state machines that are currently active and which state each one is in.
 953
 954 These are the four different kinds of state machines.  Each is explained in
 955 detail in the following sections:
 956
 957 Rule Graph or Recursive Transition Network (RTN)::
 958 a DFA is built representing each grammar rule, and this DFA transitions
 959 once for each token of input.  A runtime stack maintains the list of
 960 recursed nonterminals.
 961
 962 Grammar Lookahead Automaton (GLA)::
 963 The LL(k) lookahead information for any nontrivial RTN state is encoded
 964 in a DFA known as a GLA.  The final states of GLAs indicate what RTN
 965 transition should be taken.  We choose this approach rather than using
 966 a table because lookup information is typically extremely sparse, not
 967 even remotely approaching its theoretical `O(2^k)` upper limit.
 968
 969 Character-level Integer DFA (IntFA)::
 970 Serving the role usually played by separate lexers, IntFAs transition
 971 on individual characters of the input stream and hit a final state
 972 when a full token is recognized.  The appropriate IntFA is chosen based
 973 on the current state of the current GLA or RTN  For example, if the RTN
 974 state indicates we are expecting a variable name, we can select an IntFA
 975 that will not recognize keywords, thereby accommodating languages that
 976 allow variables named after keywords.
 977
 978 Character decoding DFA::
 979 For any multi-byte encoding, the lowest-level DFA is one that examines
 980 the input byte-by-byte and assembles the bytes into characters.  (TODO:
 981 is this the right mechanism for handling things like C trigraphs also?)
 982
 983 In addition, operator-precedence parsing is supported by sandwiching the
 984 shunting yard algorithm between layers of the RTN stack.
 985
 986 Language rules that cannot be represented using pure context-free grammars
 987 (and all but the simplest languages have such complications) are accommodated
 988 with a variety of strategies that let the grammar implementor "hook in"
 989 to the above process with imperative code.
 990
 991
 992
 993 Rules and Rule Graphs
 994 ~~~~~~~~~~~~~~~~~~~~~
 995
 996 Gazelle's algorithm is based around making each rule in a grammar into a graph
 997 that resembles a http://en.wikipedia.org/wiki/Syntax_diagram[syntax diagram].
 998 Syntax diagrams are a common way to describe syntax in a way that is easy
 999 for humans to understand visually.  Here are some examples of such diagrams:
1000
1001 .JSON object from `json.org`
1002 ====
1003 image::json_object.gif[JSON Object]
1004 ====
1005
1006 .UPDATE statement from the SQLite documentation
1007 ====
1008 image::sqlite-update-stmt.gif[SQLite UPDATE]
1009 ====
1010
1011 Syntax diagrams like the ones above visually specify what syntax patterns are
1012 allowed by the language.  You read them by following the the lines from one
1013 grammar element to the next.  If there is a fork in the road you can take
1014 either path, since both options are valid according to the language.  Syntax
1015 diagrams are also known as "railroad diagrams" because they are like a network
1016 of railroad tracks that have switches at all the decision points.
1017
1018 Gazelle turns each rule into a similar kind of diagram, but Gazelle's diagrams
1019 are represented a little bit differently.  Here is Gazelle's rendering of the
1020 same JSON rule as the first example:
1021
1022 ["json.png", "Gazelle's graph of the JSON rule above"]
1023 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1024 object -> "{" (string ":" value) *(,) "}";
1025 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1026
1027 Gazelle calls these "rule graphs" or RTNs (for "Recursive Transition Network").
1028 Rule graphs are state machines, and are one of the several kinds of state
1029 machines Gazelle uses to track its parsing state as a whole.  You can view
1030 the rule graphs for any Gazelle grammar by compiling the grammar with `gzlc -d`
1031 and viewing the HTML output.
1032
1033 In rule graphs the grammar elements (like "string", "value," etc.) are on the
1034 edges instead of being the nodes.  Also, in rule graphs the lines (the graph
1035 edges) don't have decisions or "forks in the road" -- the decisions come from
1036 deciding what edge to take out of a state.  Finally, in a rule graph you reach
1037 the "end" of the rule by transitioning to a final state, which is drawn as a
1038 state with a double circle.  When the parser reaches this state, the rule it
1039 was parsing can be considered finished (but need not be -- final states can
1040 also have transitions out of them).
1041
1042 While the rule graph may look more chaotic than the nicely-laid-out railroad
1043 diagram above, Gazelle's rule graph has one important advantage.  In Gazelle's
1044 rule graph, the states represent real states that the parser can be in.  The
1045 states in a railroad diagram are not as easy to identify.  At any time you can
1046 inspect Gazelle's parsing state and see what rule graph state it is in.  This
1047 is also the basis of how Gazelle can save and restore its parsing state.
1048
1049 The edges of rule graphs can be both terminals and nonterminals.  Terminals
1050 are physical tokens of input like `{` and `}`, while nonterminals are references
1051 to other rules like 'string' and 'value' above.
1052
1053 When the parser takes a nonterminal transition, it temporarily stops processing
1054 the current rule and moves to a sub-rule corresponding to that nonterminal.  To
1055 remember where it was, the parser pushes the destination state of the
1056 nonterminal transition onto a runtime stack.  This whole process is easy to
1057 understand when viewed in the context of an example, which will be included
1058 in future versions of the manual.
1059
1060
1061 Grammar Lookahead Automaton
1062 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1063
1064 You may have noticed that we glossed over one important detail in the previous
1065 section.  How do we decide what transition in the rule graph to take?  Rule graphs
1066 can be nondeterministic: there can be more than one transition out of a state for
1067 the same terminal.  There can also be final states that have transitions out of
1068 them.  This means that while Gazelle is parsing, there can be more that one
1069 option for what transition to take.  How does Gazelle pick the right one?
1070
1071 There are two situations where knowing what transition to take is difficult.
1072 First, if any of the transitions that lead out of our RTN state are
1073 nonterminals (references to other rules), we need a way to know what strings in
1074 the input imply what transitions.  For example, consider the rule:
1075
1076 ["hardrule1.png", "graph 3: word -> numbers | letters;"]
1077 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1078 word -> numbers | letters;
1079 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1080
1081 If we are in the start state for `word` and we see a "5" in the input, what
1082 transition should we take?  You might guess that we should take `numbers`,
1083 but that's entirely dependent on the definition of `numbers` and `letters`.
1084 We need a way to "see into" those definitions, so we know what transition
1085 to take based on what we see in the input.
1086
1087 The second difficulty is similar, but slightly different.  What if we are
1088 in a final state that has transitions out of it?  Consider this rule:
1089
1090 ["hardrule2.png", "graph 3: sentence -> word*;"]
1091 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1092 sentence -> word*;
1093 gzl-rtn-graph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1094
1095 If we are in the start state for `sentence`, should we return from the
1096 current rule since we are in a final state, or should we stay and match
1097 more words?  To answer this question we need not only the ability to "see
1098 into" `word`, but also the ability to "see up from" whatever set of
1099 rules we are currently recursed into.
1100
1101 This set of problems turns out to be somewhat difficult to solve -- in fact, it
1102 is exactly the problem that LL parsing aims to solve.  Part of what makes it
1103 tricky is that depending on how the grammar is written, we might not have
1104 enough information at this point in the text to possibly know what transition
1105 to take.  We always have to look at the next token of input to decide, but
1106 in some cases we might have to look ahead 2, 3, or even more tokens before we
1107 have enough information to make a decision.  It is even possible to write
1108 grammars where the token that tells us the difference is 'arbitrarily' far
1109 away.  These classes of grammars are known as `LL(k)` if we have to look
1110 `k` tokens ahead to make a decision, or `LL(*)` if that decision can be
1111 arbitrarily far away.
1112
1113 [NOTE]
1114 This description of `LL(k)` only tells part of the story.  LL algorithms
1115 are further divided into "full" LL, Strong-LL, and Simple-LL.  Nearly
1116 all practical "LL" algorithms are actually Strong-LL, and this is also
1117 the case for Gazelle.  Though Strong-LL parsers can only parse a subset
1118 of the grammars of a full LL parser, they are significantly simpler and
1119 more efficient, and in practice provide enough power.  For more
1120 information on the subject, I heartily recommend the book
1121 http://www.amazon.com/Parsing-Techniques-Practical-Monographs-Computer/dp/038720248X[Parsing
1122 Techniques: A Practical Guide] by Grune and Jacobs.
1123
1124 So how do we decide what transition to take in the rule graph?  The answer is
1125 that Gazelle analyzes the grammar at compile time, and figures out for each
1126 rule graph state what tokens we expect to see here, and what transition each
1127 series of tokens implies.  This information is encoded into a DFA known as a
1128 'GLA' -- a Grammar Lookahead Automaton.  If after analyzing the grammar Gazelle
1129 determines that it is possible to build a GLA for every rule graph state, it
1130 builds them and encodes it into the byte-code.  If it is 'not' possible to
1131 build the GLA (which implies that that the grammar is not `LL(k)`), Gazelle
1132 interrupts the compilation with an error.
1133
1134 When Gazelle is actually parsing, it stops at every rule graph state and tries
1135 to determine what transition to take next.  To make this decision, Gazelle
1136 finds the GLA for this rule graph state and runs it for the tokens of the input
1137 until the it hits a final state.  That final state will tell the rule graph
1138 what transition to take.
1139
1140 TODO: flesh out this section with more examples
1141
1142
1143
1144 Character-level Integer DFA
1145 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1146
1147 In the previous section, we assumed that we had a stream of tokens to feed
1148 into the GLA.  But where do these tokens come from?  The answer is that
1149 another layer of DFA's processes the characters of the input text and
1150 breaks them into tokens.  These DFAs are known as IntFAs, since the
1151 characters of the input text are represented as integers (this is not
1152 a particularly good name, and may change).  IntFAs in
1153 Gazelle fulfill the role traditionally performed by separate lexers.
1154
1155 Most programmers have some familiarity with regular expressions, because
1156 they are pervasive in modern programming.  With Gazelle you describe
1157 tokens using strings and regular expressions, and Gazelle compiles these
1158 into DFAs using well-known algorithms (unlike what most scripting
1159 languages do today -- they use an algorithm with much worse worst-case
1160 performance, as Russ Cox argues in his article
1161 http://swtch.com/~rsc/regexp/regexp1.html[Regular Expression Matching Can Be
1162 Simple And Fast]).
1163
1164 Lexing using Gazelle has one key advantage over independent lexers: the Gazelle
1165 lexing component knows what sorts of language constructs it is expecting to see
1166 at this point in the input.  A traditional lexer just breaks the input into
1167 words without having any idea what sort of construct it's in.  For example,
1168 suppose you are a traditional C lexer and you see the input:
1169
1170 ---------------------------------------------------------
1171 x = while;
1172 ---------------------------------------------------------
1173
1174 A traditional C lexer in this case will return the tokens:
1175
1176 * `IDENTIFIER` (x)
1177 * `EQUALS`
1178 * `WHILE`
1179 * `SEMICOLON`
1180
1181 It treats "while" as a keyword, even though it is in a context where
1182 "while" doesn't make sense.  Of course, this is required by the C standard,
1183 since keywords cannot be variable names, but suppose we were dealing with
1184 a language that did not have this restriction.  What do we do?  Our traditional
1185 lexer design is too weak to return the token `WHILE` in contexts where it makes
1186 sense and an `IDENTIFIER` named "while" in contexts where an identifier is
1187 expected.  Another strategy will be required, adding complexity to the parser.
1188
1189 Gazelle, on the other hand, can handle this situation just fine, because
1190 parsing and lexing are combined.  The key is that every GLA state knows what
1191 tokens are valid at this point in the input, so it can choose an IntFA that
1192 only recognizes the valid tokens.  In the above example, when a Gazelle
1193 parser/lexer reached `while` in the input, the current GLA state would select
1194 an IntFA that recognizes `IDENTIFIER` tokens (because they make sense here) but
1195 not a `WHILE` keyword token (because it does not make sense here).  Therefore
1196 languages that allow keywords as variables names are easily and naturally
1197 supported.
1198
1199 Of course, Gazelle also needs to support languages where keywords are not
1200 allowed to be variable names.  This can be easily achieved by making the
1201 regular expression for identifiers explicitly exclude all keywords.  This more
1202 explicitly describes the rules of the language anyway -- when a language
1203 specifications includes text like "variable names may not be any of these
1204 reserved words," you can express this directly to Gazelle by explicitly listing
1205 the patterns an identifier is not allowed to have.
1206
1207
1208 Character decoding DFA
1209 ~~~~~~~~~~~~~~~~~~~~~~
1210
1211 There is one final detail that is not handled by any of the previous layers:
1212 where do the characters come from that are fed into the IntFAs?  In the case of
1213 ASCII or other single-byte encodings, the answer is simple: the bytes 'are' the
1214 characters.  However, more and more programming languages are moving towards
1215 allowing their source to be in encodings other than ASCII, most commonly
1216 Unicode.  We need a character decoding step to handle the intracacies of
1217 multi-byte encodings.
1218
1219 Many applications handle Unicode encodings such as UTF-8 without representing
1220 their character decoding stage as a DFA, so you may wonder why Gazelle takes
1221 this route.  The benefit of using a DFA is that it provides a way to sanely
1222 process partial characters.  If we are parsing text that is coming off the
1223 network, and the final byte of the buffer ends mid-character, what should we
1224 do?  We could drop it and re-process it later, but that requires either Gazelle
1225 or the application to store it for later re-processing.  Worse, it throws
1226 away the work we have already done making sense of this byte or bytes.
1227 By using a DFA, we can advance the state of the collective machine for every
1228 single byte of input, whether or not it is a full character.  When we
1229 are fed more bytes, we pick up exactly where we left off.
1230
1231 It is also possible that introducing custom processing at this stage could
1232 provide an easy and efficient way to process very difficult language corner
1233 cases like C's trigraphs.
1234