internal/reports/grammar.tm

   1 <TeXmacs|1.0.3.8>
   2
   3 <style|tmdoc>
   4
   5 <\body>
   6   <\make-title>
   7     <title|<TeXmacs> grammar tools>
   8
   9     <author|Joris van der Hoeven>
  10
  11     <\address>
  12       Home sweet home.
  13     </address>
  14   </make-title>
  15
  16   <\abstract>
  17     The aim of the <TeXmacs> grammar tools is to provide general functions
  18     for parsing and pretty printing mathematical formulas or programs
  19     according to classical LALR grammars. We provide a language for
  20     describing
  21
  22     <\enumerate>
  23       <item>Lexical symbol groups.
  24
  25       <item>Typographical rules associated to lexical symbol groups.
  26
  27       <item>Rewriting rules.
  28
  29       <item>A module system for grammars.
  30     </enumerate>
  31   </abstract>
  32
  33   <section|General architecture>
  34
  35   Creating a suite of <TeXmacs> grammar-tools has several aims:
  36
  37   <\itemize>
  38     <item>Provide tools for maintaining a correspondance between several
  39     representations of mathematical expressions or computer programs
  40     (typically a semantic one like a Scheme-expression and a physical
  41     representation on the screen).
  42
  43     <item>Use this correspondance to enhance the rendering of editing
  44     behaviour for mathematical expressions (copy and paste of semantic
  45     subtrees, syntactic highlighting, <abbr|etc.>).
  46
  47     <item>Make it possible to locally enrich grammars with new rules
  48     (priorities, extra notations, extra typographic rules, <abbr|etc.>).
  49
  50     <item>Associating precise typographic rules to the grammar being used.
  51
  52     <item>Make it easier for extern software to communicate with <TeXmacs>.
  53   </itemize>
  54
  55   As to the grammars being used, one has basically two options:
  56
  57   <\itemize>
  58     <item>A fixed set of expression types (binary infix, prefix, <abbr|etc.>)
  59     and priorities.
  60
  61     <item>LALR grammars.
  62   </itemize>
  63
  64   We have opted for the second approach. It is classical that this approach
  65   covers the first approach, but in a verbose way. As we will see in section
  66   <reference|param-grammar>, the concept of ``parameterized grammars'' will
  67   allow us to cover the first approach in a more natural, less verbose way:
  68   new sets of expression types will correspond to parameterized grammar
  69   modules and inequalities between priorities to productions (thereby
  70   providing a more fine-grained control over priorities, which are not
  71   necessarily linearly ordered).
  72
  73   The grammars mainly provide us with rewriting algorithms, mainly parsers
  74   and pretty printers. Although the grammars work on lists on a given
  75   alphabet of terminal symbols, it is easy to cover trees as well, by
  76   serialize them using special symbols.
  77
  78   As to the editing of mathematical formulas, the rewriting tools provide us
  79   with the possibility to <em|choose> a preferred representation. When
  80   choosing a representation close to rendering (as currently in <TeXmacs>)
  81   editing is simple, but obtaining structured semantic information more
  82   cumbersome. When choosing a semantic representation (like Philippe
  83   Audebaud, and as planned by the <with|mode|math|\<Omega\>>-group; also used
  84   for <TeXmacs> style-file editing), the editing is a bit more cumbersome,
  85   but structured editing operations easier. I am not sure which choice is
  86   best, but I <em|do> believe that the incorporation of powerful rewriting
  87   tools into <TeXmacs> will make this issue less and less important.
  88
  89   <section|Lexical symbol groups>
  90
  91   The lexer produces lists of terminal symbols of user-definable types. New
  92   types can be defined (or further enriched) using instructions of the form
  93
  94   <\scheme-fragment>
  95     (terminal <em|symbol> <em|regexpr-1> ... <em|regexpr-n>)
  96   </scheme-fragment>
  97
  98   Here <verbatim|<em|symbol>> denotes the name of a terminal symbol and
  99   <verbatim|<em|regexpr-1>> until <verbatim|<em|regexpr-n>> regular
 100   expressions. For instance, one may define
 101
 102   <\scheme-fragment>
 103     (terminal plus "+")
 104
 105     (terminal times "*")
 106
 107     (terminal id (repeat (or (range "a" "z") (range "A" "Z") "_")))
 108
 109     (terminal open "(")
 110
 111     (terminal close ")")
 112   </scheme-fragment>
 113
 114   In extension grammars, these terminal symbols may be further enriched:
 115
 116   <\scheme-fragment>
 117     (terminal plus "\<less\>oplus\<gtr\>")
 118
 119     (terminal times "\<less\>otimes\<gtr\>")
 120   </scheme-fragment>
 121
 122   For instance, the input string <verbatim|hop+x> would be lexed as
 123   <verbatim|((hop id) (+ plus) (x id))>.\
 124
 125   In order to lex <TeXmacs>-trees, we enrich the alphabet with special
 126   symbols of the form <verbatim|\\<em|tag>>, <verbatim|\|<em|tag>> and
 127   <verbatim|/<em|tag>> for encoding trees. For instance, the tree
 128
 129   <\equation*>
 130     <tree|concat|sin x+|<tree|frac|b|c>>
 131   </equation*>
 132
 133   corresponding to <with|mode|math|sin x+<frac|b|c>> would be encoded as
 134   <verbatim|(\\concat 's' 'i' 'n' ' ' 'x' \|concat \\frac 'b' \|frac 'c'
 135   /frac /concat)>. It may be useful to add a facility like
 136
 137   <\scheme-fragment>
 138     (ignore \\concat \|concat /concat)
 139   </scheme-fragment>
 140
 141   to the lexer language for ignoring certain tokens or comments. Similarly,
 142   it may be useful to add a primitive <verbatim|accept> which specifies which
 143   tags are accepted (all other being ignored).
 144
 145   Scheme trees may encoded using special symbols <verbatim|[> and
 146   <verbatim|]> in a similar way. For instance the scheme expression
 147   <verbatim|(+ (sin x) (/ b c))> corresponding to the tree
 148
 149   <\equation*>
 150     <tree|+|<tree|sin|x>|<tree|/|b|c>>
 151   </equation*>
 152
 153   might be encoded by <verbatim|[ "+" [ "sin" "x" ] [ "b" "c" ] ]>.
 154
 155   <section|Typographical rules>
 156
 157   Typographical rules are associated to terminal symbols. For instance:
 158
 159   <\scheme-fragment>
 160     (penalty plus (invalid 30))
 161
 162     (spacing plus (default default))
 163   </scheme-fragment>
 164
 165   It might also be interesting to allow grammars to export additional
 166   <TeXmacs> macro definitions:
 167
 168   <\scheme-fragment>
 169     (macro my-notation (macro "x" "y" ...))
 170   </scheme-fragment>
 171
 172   <section|Rewriting rules>
 173
 174   Context-free grammars are both used for parsing purposes and pretty
 175   printing. A context-free rewriting rule is of the form
 176
 177   <\scheme-fragment>
 178     (rule <em|symbol> <em|source> <em|destination>)
 179   </scheme-fragment>
 180
 181   Here <verbatim|<em|symbol>> is a non-terminal symbol, <verbatim|<em|rule>>
 182   a list of terminal or non-terminal symbols and <verbatim|<em|production>> a
 183   list of terminal symbols or symbols or numbers. For instance, the following
 184   grammar can be used to parse simple arithmetic expressions:
 185
 186   <\scheme-fragment>
 187     (rule AddExpr (MultExpr plus AddExpr) ([ 2 1 3 ]))
 188
 189     (rule AddExpr (MultExpr) (1))
 190
 191     (rule MultExpr (RadExpr times MultExpr) ([ 2 1 3 ]))
 192
 193     (rule MultExpr (RadExpr) (1))
 194
 195     (rule RadExpr (id) (1))
 196
 197     (rule RadExpr (open AddExpr close) (2))
 198   </scheme-fragment>
 199
 200   Here <verbatim|plus>, <verbatim|times>, <verbatim|id>, <verbatim|open> and
 201   <verbatim|close> are terminal symbols. Typically a list like <verbatim|((a
 202   id) (+ <no-break>plus) (b id) (+ plus) (c id))> would produce <verbatim|([
 203   + a [ + b c ] ])>.
 204
 205   In a similar way, the following grammar may be used in order for pretty
 206   printing trees produced using the above grammar:
 207
 208   <\scheme-fragment>
 209     (rule AddExpr ([ MultExpr plus AddExpr ]) (2 1 3))
 210
 211     (rule AddExpr (MultExpr) (1))
 212
 213     (rule MultExpr ([ RadExpr times MultExpr ]) (2 1 3))
 214
 215     (rule MultExpr (RadExpr) (1))
 216
 217     (rule RadExpr (id) (1))
 218
 219     (rule RadExpr (AddExpr) (open 2 close))
 220   </scheme-fragment>
 221
 222   Notice that the last rule of this grammar introduces a circular ambiguity.
 223   The correct way of parsing such ambiguous grammars is to use the shortest
 224   derivation tree.
 225
 226   <\remark>
 227     The parsing tool should be should be able to partially cope with
 228     ambiguity as follows:
 229
 230     <\enumerate>
 231       <item>It should resolve circular ambiguities in the expected way.
 232
 233       <item>It should accept general ambiguous grammars (and produce
 234       ambiguous parsing tables). When parsing, we may use a small look-ahead,
 235       and chose an arbitrary production for highly ambiguous expression.
 236
 237       <item>When parsing an ambiguous expression, we should be able to return
 238       a list of positions where we made arbitrary choices for productions.
 239     </enumerate>
 240   </remark>
 241
 242   <section|Grammar modules>
 243
 244   New lexer rules, typographical rules and rewriting rules may be grouped
 245   together into grammar modules. For instance:
 246
 247   <\scheme-fragment>
 248     (grammar (Arith-terminals)
 249
 250     \ \ (terminal plus "+")
 251
 252     \ \ (terminal times "*")
 253
 254     \ \ (terminal id (repeat (or (range "a" "z") (range "A" "Z") "_")))
 255
 256     \ \ (terminal open "(")
 257
 258     \ \ (terminal close ")"))
 259   </scheme-fragment>
 260
 261   Grammars may be inherited:
 262
 263   <\scheme-fragment>
 264     (grammar (Arith-parser)
 265
 266     \ \ (import (Arith-terminals))
 267
 268     \ \ (rule AddExpr (MultExpr plus AddExpr) ([ 2 1 3 ]))
 269
 270     \ \ (rule AddExpr (MultExpr) (1))
 271
 272     \ \ (rule MultExpr (RadExpr times MultExpr) ([ 2 1 3 ]))
 273
 274     \ \ (rule MultExpr (RadExpr) (1))
 275
 276     \ \ (rule RadExpr (id) (1))
 277
 278     \ \ (rule RadExpr (open AddExpr close) (2)))
 279   </scheme-fragment>
 280
 281   <section|Paramerized grammars and meta-rules><label|param-grammar>
 282
 283   An interesting feature is that we allow grammar specifications to take
 284   parameters:
 285
 286   <\scheme-fragment>
 287     (grammar (Associative-infix which)
 288
 289     \ \ (rule (:: Expr which) ((:: Expr+ which) which (:: Expr which))
 290
 291     \ \ \ \ \ \ \ \ ([ 2 1 3]))
 292
 293     \ \ (rule (:: Expr which) (:: Expr+ which) (1))
 294
 295     \ \ (rule (:: Expr which) (RadExpr) (1))
 296
 297     \ \ (spacing which (default default)))
 298   </scheme-fragment>
 299
 300   The <verbatim|::> is used to indicate scoping. For instance <verbatim|(::
 301   Expr plus)> stands for <verbatim|Expr::plus>. Now consider the following
 302   grammar
 303
 304   <\scheme-fragment>
 305     (grammar (Arith-primitive-parser)
 306
 307     \ \ (import (Arith-terminals))
 308
 309     \ \ (import (Associative-infix plus))
 310
 311     \ \ (import (Associative-infix times))
 312
 313     \ \ (rule RadExpr (id) (1))
 314
 315     \ \ (rule RadExpr (open AddExpr close) (2)))
 316   </scheme-fragment>
 317
 318   Notice that we have not yet specified which operation among <verbatim|plus>
 319   and <verbatim|times> has priority. This has been done on purpose, for it
 320   allows us to deal with underspecification. The user may enrich the above
 321   language in a local context in order to specify additional priorities:
 322
 323   <\scheme-fragment>
 324     (grammar (Arith-parser)
 325
 326     \ \ (import (Arith-primitive-parser))
 327
 328     \ \ (rule Expr+::plus (Expr::times) (1)))
 329   </scheme-fragment>
 330
 331   Another use of scoping is the possibility to define different grammars (one
 332   for parsing, two for pretty printing, <abbr|etc.>) and to selectively use
 333   them. For instance, <verbatim|Arith-parser::Expr::plus> may be used as a
 334   non-terminal symbol even if <verbatim|Arith-parser> has not been imported.
 335 </body>