doc/site/syntactic_recognition.html

   1 <html><head><link type="text/css" href="./screen.css" rel="stylesheet" />
   2           <script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
   3           </script>
   4           <script type="text/javascript">
   5           _uacct = "UA-3418876-1";
   6           urchinTracker();
   7           </script>
   8         </head><body><div id="top"><div id="main_navigation"><ul><li>Documentation</li><li><a href="contribute.html">Contribute</a></li><li><a href="index.html">Home</a></li></ul></div></div><div id="middle"><div id="content"><div id="secondary_navigation"><ul><li>Syntax</li><li><a href="semantic_interpretation.html">Semantics</a></li><li><a href="using_in_ruby.html">Using In Ruby</a></li><li><a href="pitfalls_and_advanced_techniques.html">Advanced Techniques</a></li></ul></div><div id="documentation_content"><h1>Syntactic Recognition</h1>
   9
  10 <p>Treetop grammars are written in a custom language based on parsing expression grammars. Literature on the subject of <a href="http://en.wikipedia.org/wiki/Parsing_expression_grammar">parsing expression grammars</a> is useful in writing Treetop grammars.</p>
  11
  12 <h1>Grammar Structure</h1>
  13
  14 <p>Treetop grammars look like this:</p>
  15
  16 <pre><code>grammar GrammarName
  17   rule rule_name
  18     ...
  19   end
  20
  21   rule rule_name
  22     ...
  23   end
  24
  25   ...
  26 end
  27 </code></pre>
  28
  29 <p>The main keywords are:</p>
  30
  31 <ul>
  32 <li><p><code>grammar</code> : This introduces a new grammar. It is followed by a constant name to which the grammar will be bound when it is loaded.</p></li>
  33 <li><p><code>rule</code> : This defines a parsing rule within the grammar. It is followed by a name by which this rule can be referenced within other rules. It is then followed by a parsing expression defining the rule.</p></li>
  34 </ul>
  35
  36 <h1>Parsing Expressions</h1>
  37
  38 <p>Each rule associates a name with a <em>parsing expression</em>. Parsing expressions are a generalization of vanilla regular expressions. Their key feature is the ability to reference other expressions in the grammar by name.</p>
  39
  40 <h2>Terminal Symbols</h2>
  41
  42 <h3>Strings</h3>
  43
  44 <p>Strings are surrounded in double or single quotes and must be matched exactly.</p>
  45
  46 <ul>
  47 <li><code>"foo"</code></li>
  48 <li><code>'foo'</code></li>
  49 </ul>
  50
  51 <h3>Character Classes</h3>
  52
  53 <p>Character classes are surrounded by brackets. Their semantics are identical to those used in Ruby's regular expressions.</p>
  54
  55 <ul>
  56 <li><code>[a-zA-Z]</code></li>
  57 <li><code>[0-9]</code></li>
  58 </ul>
  59
  60 <h3>The Anything Symbol</h3>
  61
  62 <p>The anything symbol is represented by a dot (<code>.</code>) and matches any single character.</p>
  63
  64 <h2>Nonterminal Symbols</h2>
  65
  66 <p>Nonterminal symbols are unquoted references to other named rules. They are equivalent to an inline substitution of the named expression.</p>
  67
  68 <pre><code>rule foo
  69   "the dog " bar
  70 end
  71
  72 rule bar
  73   "jumped"
  74 end
  75 </code></pre>
  76
  77 <p>The above grammar is equivalent to:</p>
  78
  79 <pre><code>rule foo
  80   "the dog jumped"
  81 end
  82 </code></pre>
  83
  84 <h2>Ordered Choice</h2>
  85
  86 <p>Parsers attempt to match ordered choices in left-to-right order, and stop after the first successful match.</p>
  87
  88 <pre><code>"foobar" / "foo" / "bar"
  89 </code></pre>
  90
  91 <p>Note that if <code>"foo"</code> in the above expression came first, <code>"foobar"</code> would never be matched.</p>
  92
  93 <h2>Sequences</h2>
  94
  95 <p>Sequences are a space-separated list of parsing expressions. They have higher precedence than choices, so choices must be parenthesized to be used as the elements of a sequence. </p>
  96
  97 <pre><code>"foo" "bar" ("baz" / "bop")
  98 </code></pre>
  99
 100 <h2>Zero or More</h2>
 101
 102 <p>Parsers will greedily match an expression zero or more times if it is followed by the star (<code>*</code>) symbol.</p>
 103
 104 <ul>
 105 <li><code>'foo'*</code> matches the empty string, <code>"foo"</code>, <code>"foofoo"</code>, etc.</li>
 106 </ul>
 107
 108 <h2>One or More</h2>
 109
 110 <p>Parsers will greedily match an expression one or more times if it is followed by the star (<code>+</code>) symbol.</p>
 111
 112 <ul>
 113 <li><code>'foo'+</code> does not match the empty string, but matches <code>"foo"</code>, <code>"foofoo"</code>, etc.</li>
 114 </ul>
 115
 116 <h2>Optional Expressions</h2>
 117
 118 <p>An expression can be declared optional by following it with a question mark (<code>?</code>).</p>
 119
 120 <ul>
 121 <li><code>'foo'?</code> matches <code>"foo"</code> or the empty string.</li>
 122 </ul>
 123
 124 <h2>Lookahead Assertions</h2>
 125
 126 <p>Lookahead assertions can be used to give parsing expressions a limited degree of context-sensitivity. The parser will look ahead into the buffer and attempt to match an expression without consuming input.</p>
 127
 128 <h3>Positive Lookahead Assertion</h3>
 129
 130 <p>Preceding an expression with an ampersand <code>(&amp;)</code> indicates that it must match, but no input will be consumed in the process of determining whether this is true.</p>
 131
 132 <ul>
 133 <li><code>"foo" &amp;"bar"</code> matches <code>"foobar"</code> but only consumes up to the end <code>"foo"</code>. It will not match <code>"foobaz"</code>.</li>
 134 </ul>
 135
 136 <h3>Negative Lookahead Assertion</h3>
 137
 138 <p>Preceding an expression with a bang <code>(!)</code> indicates that the expression must not match, but no input will be consumed in the process of determining whether this is true.</p>
 139
 140 <ul>
 141 <li><code>"foo" !"bar"</code> matches <code>"foobaz"</code> but only consumes up to the end <code>"foo"</code>. It will not match <code>"foobar"</code>.</li>
 142 </ul></div></div></div><div id="bottom"></div></body></html>