7 <title|<TeXmacs> grammar tools>
9 <author|Joris van der Hoeven>
17 The aim of the <TeXmacs> grammar tools is to provide general functions
18 for parsing and pretty printing mathematical formulas or programs
19 according to classical LALR grammars. We provide a language for
23 <item>Lexical symbol groups.
25 <item>Typographical rules associated to lexical symbol groups.
27 <item>Rewriting rules.
29 <item>A module system for grammars.
33 <section|General architecture>
35 Creating a suite of <TeXmacs> grammar-tools has several aims:
38 <item>Provide tools for maintaining a correspondance between several
39 representations of mathematical expressions or computer programs
40 (typically a semantic one like a Scheme-expression and a physical
41 representation on the screen).
43 <item>Use this correspondance to enhance the rendering of editing
44 behaviour for mathematical expressions (copy and paste of semantic
45 subtrees, syntactic highlighting, <abbr|etc.>).
47 <item>Make it possible to locally enrich grammars with new rules
48 (priorities, extra notations, extra typographic rules, <abbr|etc.>).
50 <item>Associating precise typographic rules to the grammar being used.
52 <item>Make it easier for extern software to communicate with <TeXmacs>.
55 As to the grammars being used, one has basically two options:
58 <item>A fixed set of expression types (binary infix, prefix, <abbr|etc.>)
64 We have opted for the second approach. It is classical that this approach
65 covers the first approach, but in a verbose way. As we will see in section
66 <reference|param-grammar>, the concept of ``parameterized grammars'' will
67 allow us to cover the first approach in a more natural, less verbose way:
68 new sets of expression types will correspond to parameterized grammar
69 modules and inequalities between priorities to productions (thereby
70 providing a more fine-grained control over priorities, which are not
71 necessarily linearly ordered).
73 The grammars mainly provide us with rewriting algorithms, mainly parsers
74 and pretty printers. Although the grammars work on lists on a given
75 alphabet of terminal symbols, it is easy to cover trees as well, by
76 serialize them using special symbols.
78 As to the editing of mathematical formulas, the rewriting tools provide us
79 with the possibility to <em|choose> a preferred representation. When
80 choosing a representation close to rendering (as currently in <TeXmacs>)
81 editing is simple, but obtaining structured semantic information more
82 cumbersome. When choosing a semantic representation (like Philippe
83 Audebaud, and as planned by the <with|mode|math|\<Omega\>>-group; also used
84 for <TeXmacs> style-file editing), the editing is a bit more cumbersome,
85 but structured editing operations easier. I am not sure which choice is
86 best, but I <em|do> believe that the incorporation of powerful rewriting
87 tools into <TeXmacs> will make this issue less and less important.
89 <section|Lexical symbol groups>
91 The lexer produces lists of terminal symbols of user-definable types. New
92 types can be defined (or further enriched) using instructions of the form
95 (terminal <em|symbol> <em|regexpr-1> ... <em|regexpr-n>)
98 Here <verbatim|<em|symbol>> denotes the name of a terminal symbol and
99 <verbatim|<em|regexpr-1>> until <verbatim|<em|regexpr-n>> regular
100 expressions. For instance, one may define
107 (terminal id (repeat (or (range "a" "z") (range "A" "Z") "_")))
114 In extension grammars, these terminal symbols may be further enriched:
117 (terminal plus "\<less\>oplus\<gtr\>")
119 (terminal times "\<less\>otimes\<gtr\>")
122 For instance, the input string <verbatim|hop+x> would be lexed as
123 <verbatim|((hop id) (+ plus) (x id))>.\
125 In order to lex <TeXmacs>-trees, we enrich the alphabet with special
126 symbols of the form <verbatim|\\<em|tag>>, <verbatim|\|<em|tag>> and
127 <verbatim|/<em|tag>> for encoding trees. For instance, the tree
130 <tree|concat|sin x+|<tree|frac|b|c>>
133 corresponding to <with|mode|math|sin x+<frac|b|c>> would be encoded as
134 <verbatim|(\\concat 's' 'i' 'n' ' ' 'x' \|concat \\frac 'b' \|frac 'c'
135 /frac /concat)>. It may be useful to add a facility like
138 (ignore \\concat \|concat /concat)
141 to the lexer language for ignoring certain tokens or comments. Similarly,
142 it may be useful to add a primitive <verbatim|accept> which specifies which
143 tags are accepted (all other being ignored).
145 Scheme trees may encoded using special symbols <verbatim|[> and
146 <verbatim|]> in a similar way. For instance the scheme expression
147 <verbatim|(+ (sin x) (/ b c))> corresponding to the tree
150 <tree|+|<tree|sin|x>|<tree|/|b|c>>
153 might be encoded by <verbatim|[ "+" [ "sin" "x" ] [ "b" "c" ] ]>.
155 <section|Typographical rules>
157 Typographical rules are associated to terminal symbols. For instance:
160 (penalty plus (invalid 30))
162 (spacing plus (default default))
165 It might also be interesting to allow grammars to export additional
166 <TeXmacs> macro definitions:
169 (macro my-notation (macro "x" "y" ...))
172 <section|Rewriting rules>
174 Context-free grammars are both used for parsing purposes and pretty
175 printing. A context-free rewriting rule is of the form
178 (rule <em|symbol> <em|source> <em|destination>)
181 Here <verbatim|<em|symbol>> is a non-terminal symbol, <verbatim|<em|rule>>
182 a list of terminal or non-terminal symbols and <verbatim|<em|production>> a
183 list of terminal symbols or symbols or numbers. For instance, the following
184 grammar can be used to parse simple arithmetic expressions:
187 (rule AddExpr (MultExpr plus AddExpr) ([ 2 1 3 ]))
189 (rule AddExpr (MultExpr) (1))
191 (rule MultExpr (RadExpr times MultExpr) ([ 2 1 3 ]))
193 (rule MultExpr (RadExpr) (1))
195 (rule RadExpr (id) (1))
197 (rule RadExpr (open AddExpr close) (2))
200 Here <verbatim|plus>, <verbatim|times>, <verbatim|id>, <verbatim|open> and
201 <verbatim|close> are terminal symbols. Typically a list like <verbatim|((a
202 id) (+ <no-break>plus) (b id) (+ plus) (c id))> would produce <verbatim|([
205 In a similar way, the following grammar may be used in order for pretty
206 printing trees produced using the above grammar:
209 (rule AddExpr ([ MultExpr plus AddExpr ]) (2 1 3))
211 (rule AddExpr (MultExpr) (1))
213 (rule MultExpr ([ RadExpr times MultExpr ]) (2 1 3))
215 (rule MultExpr (RadExpr) (1))
217 (rule RadExpr (id) (1))
219 (rule RadExpr (AddExpr) (open 2 close))
222 Notice that the last rule of this grammar introduces a circular ambiguity.
223 The correct way of parsing such ambiguous grammars is to use the shortest
227 The parsing tool should be should be able to partially cope with
228 ambiguity as follows:
231 <item>It should resolve circular ambiguities in the expected way.
233 <item>It should accept general ambiguous grammars (and produce
234 ambiguous parsing tables). When parsing, we may use a small look-ahead,
235 and chose an arbitrary production for highly ambiguous expression.
237 <item>When parsing an ambiguous expression, we should be able to return
238 a list of positions where we made arbitrary choices for productions.
242 <section|Grammar modules>
244 New lexer rules, typographical rules and rewriting rules may be grouped
245 together into grammar modules. For instance:
248 (grammar (Arith-terminals)
250 \ \ (terminal plus "+")
252 \ \ (terminal times "*")
254 \ \ (terminal id (repeat (or (range "a" "z") (range "A" "Z") "_")))
256 \ \ (terminal open "(")
258 \ \ (terminal close ")"))
261 Grammars may be inherited:
264 (grammar (Arith-parser)
266 \ \ (import (Arith-terminals))
268 \ \ (rule AddExpr (MultExpr plus AddExpr) ([ 2 1 3 ]))
270 \ \ (rule AddExpr (MultExpr) (1))
272 \ \ (rule MultExpr (RadExpr times MultExpr) ([ 2 1 3 ]))
274 \ \ (rule MultExpr (RadExpr) (1))
276 \ \ (rule RadExpr (id) (1))
278 \ \ (rule RadExpr (open AddExpr close) (2)))
281 <section|Paramerized grammars and meta-rules><label|param-grammar>
283 An interesting feature is that we allow grammar specifications to take
287 (grammar (Associative-infix which)
289 \ \ (rule (:: Expr which) ((:: Expr+ which) which (:: Expr which))
291 \ \ \ \ \ \ \ \ ([ 2 1 3]))
293 \ \ (rule (:: Expr which) (:: Expr+ which) (1))
295 \ \ (rule (:: Expr which) (RadExpr) (1))
297 \ \ (spacing which (default default)))
300 The <verbatim|::> is used to indicate scoping. For instance <verbatim|(::
301 Expr plus)> stands for <verbatim|Expr::plus>. Now consider the following
305 (grammar (Arith-primitive-parser)
307 \ \ (import (Arith-terminals))
309 \ \ (import (Associative-infix plus))
311 \ \ (import (Associative-infix times))
313 \ \ (rule RadExpr (id) (1))
315 \ \ (rule RadExpr (open AddExpr close) (2)))
318 Notice that we have not yet specified which operation among <verbatim|plus>
319 and <verbatim|times> has priority. This has been done on purpose, for it
320 allows us to deal with underspecification. The user may enrich the above
321 language in a local context in order to specify additional priorities:
324 (grammar (Arith-parser)
326 \ \ (import (Arith-primitive-parser))
328 \ \ (rule Expr+::plus (Expr::times) (1)))
331 Another use of scoping is the possibility to define different grammars (one
332 for parsing, two for pretty printing, <abbr|etc.>) and to selectively use
333 them. For instance, <verbatim|Arith-parser::Expr::plus> may be used as a
334 non-terminal symbol even if <verbatim|Arith-parser> has not been imported.