Broke dependency on facets and trying to coexist with Rails.
[treetop.git] / doc / site / syntactic_recognition.html
blobb4c3611fe397052191283df07215690a9ea01c88
1 <html><head><link type="text/css" href="./screen.css" rel="stylesheet" />
2 <script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
3 </script>
4 <script type="text/javascript">
5 _uacct = "UA-3418876-1";
6 urchinTracker();
7 </script>
8 </head><body><div id="top"><div id="main_navigation"><ul><li>Documentation</li><li><a href="contribute.html">Contribute</a></li><li><a href="index.html">Home</a></li></ul></div></div><div id="middle"><div id="content"><div id="secondary_navigation"><ul><li>Syntax</li><li><a href="semantic_interpretation.html">Semantics</a></li><li><a href="using_in_ruby.html">Using In Ruby</a></li><li><a href="pitfalls_and_advanced_techniques.html">Advanced Techniques</a></li></ul></div><div id="documentation_content"><h1>Syntactic Recognition</h1>
10 <p>Treetop grammars are written in a custom language based on parsing expression grammars. Literature on the subject of <a href="http://en.wikipedia.org/wiki/Parsing_expression_grammar">parsing expression grammars</a> is useful in writing Treetop grammars.</p>
12 <h1>Grammar Structure</h1>
14 <p>Treetop grammars look like this:</p>
16 <pre><code>grammar GrammarName
17 rule rule_name
18 ...
19 end
21 rule rule_name
22 ...
23 end
25 ...
26 end
27 </code></pre>
29 <p>The main keywords are:</p>
31 <ul>
32 <li><p><code>grammar</code> : This introduces a new grammar. It is followed by a constant name to which the grammar will be bound when it is loaded.</p></li>
33 <li><p><code>rule</code> : This defines a parsing rule within the grammar. It is followed by a name by which this rule can be referenced within other rules. It is then followed by a parsing expression defining the rule.</p></li>
34 </ul>
36 <h1>Parsing Expressions</h1>
38 <p>Each rule associates a name with a <em>parsing expression</em>. Parsing expressions are a generalization of vanilla regular expressions. Their key feature is the ability to reference other expressions in the grammar by name.</p>
40 <h2>Terminal Symbols</h2>
42 <h3>Strings</h3>
44 <p>Strings are surrounded in double or single quotes and must be matched exactly.</p>
46 <ul>
47 <li><code>"foo"</code></li>
48 <li><code>'foo'</code></li>
49 </ul>
51 <h3>Character Classes</h3>
53 <p>Character classes are surrounded by brackets. Their semantics are identical to those used in Ruby's regular expressions.</p>
55 <ul>
56 <li><code>[a-zA-Z]</code></li>
57 <li><code>[0-9]</code></li>
58 </ul>
60 <h3>The Anything Symbol</h3>
62 <p>The anything symbol is represented by a dot (<code>.</code>) and matches any single character.</p>
64 <h2>Nonterminal Symbols</h2>
66 <p>Nonterminal symbols are unquoted references to other named rules. They are equivalent to an inline substitution of the named expression.</p>
68 <pre><code>rule foo
69 "the dog " bar
70 end
72 rule bar
73 "jumped"
74 end
75 </code></pre>
77 <p>The above grammar is equivalent to:</p>
79 <pre><code>rule foo
80 "the dog jumped"
81 end
82 </code></pre>
84 <h2>Ordered Choice</h2>
86 <p>Parsers attempt to match ordered choices in left-to-right order, and stop after the first successful match.</p>
88 <pre><code>"foobar" / "foo" / "bar"
89 </code></pre>
91 <p>Note that if <code>"foo"</code> in the above expression came first, <code>"foobar"</code> would never be matched.</p>
93 <h2>Sequences</h2>
95 <p>Sequences are a space-separated list of parsing expressions. They have higher precedence than choices, so choices must be parenthesized to be used as the elements of a sequence. </p>
97 <pre><code>"foo" "bar" ("baz" / "bop")
98 </code></pre>
100 <h2>Zero or More</h2>
102 <p>Parsers will greedily match an expression zero or more times if it is followed by the star (<code>*</code>) symbol.</p>
104 <ul>
105 <li><code>'foo'*</code> matches the empty string, <code>"foo"</code>, <code>"foofoo"</code>, etc.</li>
106 </ul>
108 <h2>One or More</h2>
110 <p>Parsers will greedily match an expression one or more times if it is followed by the star (<code>+</code>) symbol.</p>
112 <ul>
113 <li><code>'foo'+</code> does not match the empty string, but matches <code>"foo"</code>, <code>"foofoo"</code>, etc.</li>
114 </ul>
116 <h2>Optional Expressions</h2>
118 <p>An expression can be declared optional by following it with a question mark (<code>?</code>).</p>
120 <ul>
121 <li><code>'foo'?</code> matches <code>"foo"</code> or the empty string.</li>
122 </ul>
124 <h2>Lookahead Assertions</h2>
126 <p>Lookahead assertions can be used to give parsing expressions a limited degree of context-sensitivity. The parser will look ahead into the buffer and attempt to match an expression without consuming input.</p>
128 <h3>Positive Lookahead Assertion</h3>
130 <p>Preceding an expression with an ampersand <code>(&amp;)</code> indicates that it must match, but no input will be consumed in the process of determining whether this is true.</p>
132 <ul>
133 <li><code>"foo" &amp;"bar"</code> matches <code>"foobar"</code> but only consumes up to the end <code>"foo"</code>. It will not match <code>"foobaz"</code>.</li>
134 </ul>
136 <h3>Negative Lookahead Assertion</h3>
138 <p>Preceding an expression with a bang <code>(!)</code> indicates that the expression must not match, but no input will be consumed in the process of determining whether this is true.</p>
140 <ul>
141 <li><code>"foo" !"bar"</code> matches <code>"foobaz"</code> but only consumes up to the end <code>"foo"</code>. It will not match <code>"foobar"</code>.</li>
142 </ul></div></div></div><div id="bottom"></div></body></html>