added a try...catch...with extension
[metalua.git] / doc / manual / sample-match.tex
blob7b939931a9645a873b0a82fccd648dbcc1588b9d
1 \subsection{Structural pattern matching}
2 \subsubsection{Basic principles}
3 In many languages, including Lua, ``pattern matching'' generally
4 refers to the ability to:
5 \begin{itemize}
6 \item Analyse the general shape of a string;
7 \item capture specified parts of it into temporary variables
8 \item do some computation when the string matches a certain pattern,
9 generally using the temporary captured variables.
10 \end{itemize}
12 In languages derived from ML\footnote{for instance OCaml, SML, Haskell
13 or Erlang.}, pattern matching can be done on arbitrary data, not
14 only strings. One can write patterns which describe the shape of a
15 data structure, bind some interesting sub-parts of it to variables,
16 and execute some statements associated with a given pattern whenever
17 the data matches.
19 This sample aims to implement this capability into metalua. It will
20 discuss:
21 \begin{itemize}
22 \item How to describe data patterns (it turns out that we'll hijack
23 metalua's expression parser, which is a great time-saving trick when
24 it can be pulled).
25 \item How to compile them into equivalent code.
26 \item How to wrap all this up into a working compiled statement
27 \item What improvements can be done to the proposed design.
28 \end{itemize}
30 \subsubsection{Patterns}
31 A match statement consists of a tested term, and a series of (pattern,
32 block) pairs. At execution, the first pair whose pattern accurately
33 describes the tested term is selected, and its corresponding block is
34 evaluated. Other blocks are ignored.
36 We will keep the implementation of patterns as simple as possible. To
37 do that, we will put a first constraint: the AST of a pattern must be
38 a legal expression AST. This way, we save most of the syntax handling
39 trouble. More specifically:
41 \begin{itemize}
42 \item numbers are valid patterns, which only match identical numbers;
43 \item strings are valid patterns, which only match identical strings;
44 \item tables are valid patterns if all of their keys are integers or
45 strings, and all of their values are valid patterns. A table pattern
46 matches a tested term if:
47 \begin{itemize}
48 \item the tested term is a table;
49 \item every key that appears in the pattern also appears in the
50 tested term;
51 \item for every key, the value associated to this key in the tested
52 term is matched by the sub-pattern associated to this key in the
53 pattern;
54 \end{itemize}
55 \item variables are valid patterns, and match everything. Moreover,
56 the term matched by the variable captures it, i.e. in the
57 corresponding block, that variable is set to the
58 matched term.
59 \end{itemize}
61 Notice that since \verb|tag| is a regular field in metalua (albeit
62 with an optional dedicated syntax), it doesn't need a special case in
63 our pattern definition.
65 \paragraph{Example} Let's consider implementing an evaluator for
66 Metalua expressions. We will restrict ourselves to binary operators,
67 variables and numbers; we will also consider that we have a
68 \verb|values| table mapping variable names to their value, and
69 \verb|binopfuncs| which associate an operator name with the corresponding
70 function. The evaluator would be written:
72 \begin{verbatim}
73 function eval(t)
74 match t with
75 | `Op{ op, a, b } -> return binopfuncs[op](eval(a), eval(b))
76 | `Number{ n } -> return n
77 | `Variable{ v } -> return values[v]
78 end
79 end
80 \end{verbatim}
82 \subsubsection{Pattern compilation}
84 A pattern in a case will translate into:
85 \begin{itemize}
86 \item tests: testing that the type of the tested case is the same as
87 the pattern's type for numbers, strings and tables;
88 \item local declarations: a variable must be bound to the
89 tested term's value;
90 \item nested tests for every pattern key/value pair, when the pattern
91 is a table. Moreover, since there might be multiple tests to run
92 against the tested term's field, it should be put in a temporary
93 variable.
94 \end{itemize}
96 For instance, consider the following pattern:
98 \begin{verbatim}
99 { tag = "Foo", 10, { 20 }, { x = a }, b }
100 \end{verbatim}
102 It corresponds to the series of tests and assignments on the left:
104 \begin{minipage}{6cm}
105 \begin{verbatim}
106 let v1 = <tested_term>
107 type(v1) == "table"
108 let v2 = v1.tag
109 v2 ~= nil
110 type(v2) == "string"
111 v2 == "Foo"
112 let v2 = v1[1]
113 v2 ~= nil
114 type(v2) == "number"
115 v2 == 10
116 let v2 = v1[2]
117 v2 ~= nil
118 type(v2) == "table"
119 let v3 = v2[1]
120 v3 ~= nil
121 type(v3) == "number"
122 v3 == 20
123 let v2 = v1[3]
124 v2 ~= nil
125 type(v2) == "table"
126 let v3 = v2.x
127 v3 ~= nil
128 local a = v3
129 let v2 = v1[4]
130 v2 ~= nil
131 local b = v2
134 \end{verbatim}
135 \end{minipage}
136 \begin{minipage}{6cm}
137 \begin{verbatim}
138 let v1 = tested_term
139 if type(v1) == "table" then
140 let v2 = v1.tag
141 if v2 ~= nil then
142 if type(v2) == "string" then
143 if v2 == "Foo" then
144 let v2 = v1[1]
145 if v2 ~= nil then
146 if type(v2) == "number" then
147 if v2 == 10 then
148 let v2 = v1[2]
149 if v2 ~= nil then
150 if type(v2) == "table" then
151 let v3 = v2[1]
152 if v3 ~= nil then
153 if type(v3) == "number" then
154 if v3 == 20 then
155 let v2 = v1[3]
156 if v2 ~= nil then
157 if type(v2) == "table" then
158 let v3 = v2.x
159 if v3 ~= nil then
160 local a = v3
161 let v2 = v1[4]
162 if v2 ~= nil then
163 local b = v2
164 <inner_term>
165 end ... end -- (16 times)
166 \end{verbatim}
167 \end{minipage}
168 ~\\~\\
170 Notice that the relative order of tests and assignments is meaningful:
171 we cannot put all assignments on one side, and all tests on an
172 other, e.g. \verb|v2 = v1.tag| on line 3 doesn't make sense if
173 \verb|type(v1) == table| on line 2 fails.
175 We will compile patterns in several steps: first, accumulate all the
176 tests and assignments in order; then, we will collapse them in a
177 single big nesting of ``if'' statements. At first, we won't try
178 to optimize the form of the final term. The list above left would be
179 collapsed into the single compound statement above on the right.
181 \paragraph{Accumulating constraints and tests}
182 This is done by a \verb|parse_pattern()| function, which does just what
183 is described in the bullet list above:
185 \begin{verbatim}
186 -------------------------------------------------------------------
187 -- Turn a pattern into a list of conditions and assignations,
188 -- stored into [acc]. [n] is the depth of the subpattern into the
189 -- toplevel pattern; [tested_term] is the AST of the term to be
190 -- tested; [pattern] is the AST of a pattern, or a subtree of that
191 -- pattern when [n>0].
192 -------------------------------------------------------------------
193 local function parse_pattern (n, pattern)
194 local v = var(n)
195 if "Number" == pattern.tag or "String" == pattern.tag then
196 accumulate (+{ -{v} == -{pattern} })
197 elseif "Id" == pattern.tag then
198 accumulate (+{stat: local -{pattern} = -{v} })
199 elseif "Table" == pattern.tag then
200 accumulate (+{ type( -{v} ) == "table" } )
201 local idx = 1
202 for _, x in ipairs (pattern) do
203 local w = var(n+1)
204 local key, sub_pattern
205 if x.tag=="Key"
206 then key = x[1]; sub_pattern = x[2]
207 else key = `Number{ idx }; sub_pattern = x; idx=idx+1 end
208 accumulate (+{stat: (-{w}) = -{v} [-{key}] })
209 accumulate (+{ -{w} ~= nil })
210 parse_pattern (n+1, sub_pattern)
212 else error "Invalid pattern type" end
214 -------------------------------------------------------------------
215 \end{verbatim}
217 This function relies on the following helper functions:
218 \begin{itemize}
219 \item {\tt var($n$)} generates the variable name ``\verb|v|$n$'', which
220 is used to store the tested term at depth level $n$. Indeed,
221 sub-patterns in table fields are matched against sub-terms of the
222 tested term. It also remembers of the biggest $n$ it ever received,
223 and stores it into \verb|max_n| (this will be used to know which
224 local vars have to be generated, see below);
225 \item {\tt accumulate()} just stores an additional code snippet in a
226 dedicated list;
227 \end{itemize}
229 In the quotes, you might notice the parentheses around the variable in
230 ``\verb|(-{w}) = -{v} [-{key}]|'': they're here to let the compiler
231 know that what's in \verb|-{...}| is an expression, not a
232 statement. Without them, it would expect {\tt w} to be the AST of a
233 statement (since we're in a statement context at this point), then
234 choke on the unexpected ``='' symbol. This is a common idiom, which
235 you should think about everytime you generate an assignment to an
236 anti-quoted identifier.
238 \paragraph{Collapsing the accumulated quotes}
239 As written above, our collapsing function will be kept as simple
240 as possible, and will not try to minimize the amount of generated
241 code. It takes as parameters {\tt n}, the index of the quote currently
242 collapsed in the accumulator, and {\tt inner\_term} the statement
243 block to put inside the innermost part of the test. It calls itself
244 recursively, so that the collapsed term is built inside out (generally
245 speaking, working with trees, including AST, involves a lot of
246 recursive functions). \verb|acc| is the list filled by the
247 \verb|accumulate()| function:
249 \begin{verbatim}
250 -------------------------------------------------------------------
251 -- Turn a list of tests and assignations into [acc] into a
252 -- single term of nested conditionals and assignments.
253 -- [inner_term] is the AST of a term to be put into the innermost
254 -- conditionnal, after all assignments. [n] is the index in [acc]
255 -- of the term currently parsed.
257 -- This is a recursive function, which builds the inner part of
258 -- the statement first, then surrounds it with nested
259 -- [if ... then ... end], [local ... = ...] and [let ... = ...]
260 -- statements.
261 -------------------------------------------------------------------
262 local function collapse (n, inner_term)
263 assert (not inner_term.tag, "collapse inner term must be a block")
264 if n > #acc then return inner_term end
265 local it = acc[n]
266 local inside = collapse (n+1, inner_term)
267 assert (not inside.tag, "collapse must produce a block")
268 if "Op" == it.tag then
269 -- [it] is a test, put it in an [if ... then .. end] statement
270 return +{block: if -{it} then -{inside} end }
271 else
272 -- [it] is a statement, just add it at the result's beginning.
273 assert ("Let" == it.tag or "Local" == it.tag)
274 return { it, unpack (inside) }
277 \end{verbatim}
279 To fully understand this function, one must remember that test
280 operations are translated into {\tt`Op\{ <opname>, <arg1>, <arg2>\}}
281 AST. That's why we didn't have to put an explicit reminder in the
282 accumulator, remembering whether a quote was a test or a statement:
283 we'll just check for \verb|`Op|'s presence.
285 \subsubsection{Match statement compilation}
286 We already know how to translate a single pattern into a Lua test
287 statement. This statement will execute a given block, with all pattern
288 variables bound appropriately, if and only if the tested term matches
289 the pattern.
291 To build a complete match statement, we need to test all patterns in
292 sequence. When one of them succeeds, we must skip all the following
293 cases. We could do that with some complex ``\verb|else|'' parts in the
294 tests, but there is a simpler way to jump out of some nested blocks:
295 the ``\verb|break|'' statement. Therefore, we will enclose the cases
296 into a loop that executes exactly once, so that a ``\verb|break|''
297 will jump just after the last case. Then, we will add a
298 ``\verb|break|'' at the end of every statement that doesn't already
299 finish with a ``\verb|return|'', so that other cases are skipped upon
300 successful matching. To sum up, we will translate this:
302 \begin{verbatim}
303 match <foo> with
304 | <pattern_1> -> <x1>
305 | <pattern_2> -> <x2>
307 | <pattern_n> -> <xn>
309 \end{verbatim}
311 \noindent into this:
313 \begin{verbatim}
314 repeat
315 local v1, v2, ... vx -- variables used to store subpatterns
316 let v1 = <tested_term>
317 if <compilation of pattern_1> ... then x1; break end
318 if <compilation of pattern_2> ... then x2; break end
320 if <compilation of pattern_n> ... then xn; break end
321 until true
322 \end{verbatim}
324 First, we add final \verb|break| statements where required, and we
325 compile all (pattern, block) pairs:
327 \begin{verbatim}
328 -------------------------------------------------------------------
329 -- parse all [pattern ==> block] pairs. Result goes in [body].
330 -------------------------------------------------------------------
331 local body = { }
332 for _, case in ipairs (cases) do
333 acc = { } -- reset the accumulator
334 parse_pattern (1, case[1], var(1)) -- fill [acc] with conds and lets
335 local last_stat = case[2][#case[2]]
336 if last_stat and last_stat.tag ~= "Break" and
337 last_stat.tag ~= "Return" then
338 table.insert (case[2], `Break) -- to skip other cases on success
340 local compiled_case = collapse (1, case[2])
341 for _, x in ipairs (compiled_case) do table.insert (body, x) end
343 \end{verbatim}
345 \noindent Then, we can just splice it into the appropriate quote:
347 \begin{verbatim}
348 -------------------------------------------------------------------
349 local local_vars = { }
350 for i = 1, max_n do table.insert (local_vars, var(i)) end
352 -------------------------------------------------------------------
353 -- cases are put inside a [repeat until true], so that the [break]
354 -- inserted after the value will jump after the last case on success.
355 -------------------------------------------------------------------
356 local result = +{ stat:
357 repeat
358 -{ `Local{ local_vars, { } } }
359 (-{var(1)}) = -{tested_term}
360 -{ body }
361 until true }
362 return result
363 -------------------------------------------------------------------
364 \end{verbatim}
366 There is one point to notice in this quote: \verb|body| is used where
367 a statement is expected, although it contains a {\em list} if
368 statements rather than a single statement. Metalua is designed to
369 accept this, i.e. if {\tt a, b, c, d} are statements, AST {\tt
370 `Do\{ a, b, c, d \}} and {\tt`Do\{ a, \{ b, c \}, d\} } are
371 equivalent. This feature partially replaces Lisp's \verb|@(...)|
372 operator.
374 \subsubsection{Syntax extension}
375 To use this, we provide a syntax inspired by OCaml\footnote{It is
376 actually the same syntax as OCaml's, except that we introduced an
377 explicit {\tt end} terminator, to stay homogeneous with Lua.}:
379 \begin{verbatim}
380 match <foo> with
381 <pattern> -> block
382 | <pattern> -> block
384 | <pattern> -> block
386 \end{verbatim}
388 For this, we need to declare new keywords \verb|match|,
389 \verb|with| and \verb|->|. Then, we build the (pattern, block) parser
390 with \verb|gg.sequence{ }|, and read a list of them, separated with
391 ``\verb+|+'', thanks to \verb|gg.list{ }|. Moreover, we accept an
392 optional ``\verb+|+'' before the first case, so that all cases line up
393 nicely:
395 \begin{verbatim}
396 ----------------------------------------------------------------------
397 mlp.lexer:add{ "match", "with", "->" }
399 mlp.stat:add{ "match", mlp.expr, "with", gg.optkeyword "|",
400 gg.list{ gg.sequence{ mlp.expr, "->", mlp.block },
401 separators = "|",
402 terminators = "end" },
403 "end",
404 builder = |x| match_parser (x[1], x[3]) }
405 ----------------------------------------------------------------------
406 \end{verbatim}
408 \noindent Now, if you try this\ldots\ it won't work! Indeed, Metalua
409 needs to know what keywords might terminate a block of statements. In
410 this case, it doesn't know that ``\verb+|+'' can terminate a block. We
411 need therefore to add the following statement:
413 \begin{verbatim}
414 mlp.block.terminators:add "|"
415 \end{verbatim}
417 \noindent Finally that's it, we have implemented a working pattern
418 matching system in 75 lines of code!
420 \subsubsection{Possible improvements}
421 Here are a couple of suggestions to further improve the pattern
422 matching system presented above. Some of these proposals can be
423 implemented very quickly, some others more complex; all of them
424 present some practical interest.
426 The code of the basic version, as presented here, is available at
427 \url{http://metalua.luaforge.net/FIXME}.
429 \paragraph{Booleans} Boolean constants aren't handled in the system
430 above, neither as table keys nor as patterns. They're easy to add, and
431 doing it will help you get gently into the code.
433 \paragraph{Gotos considered beneficial} Gotos might be harmful in
434 hand-written programs, but they're a bliss for machine-generated
435 code. They would slightly simplify the code of pattern matching as
436 presented above; but for many extension proposals listed below, they
437 will make reasonnably easy some things which would otherwise be
438 awfully contrived. Exercice: simplify the implementation above as much
439 as possible by using gotos.
441 Labels and gotos in metalua ASTs are represented as {\tt`Label\{ id
442 \}} and {\tt`Goto\{ id \}} respectively, with {\tt id} an
443 identifier, typically generated by {\tt mlp.gensym()}. It is always
444 safe to jump out of a block; jumping into a block is not guaranteed
445 against weird interactions with local variables and upvalues.
447 \paragraph{{\tt collapse()} optimization} Instead of nesting if
448 statements systematically, two nested {\tt if}s without {\tt else}
449 branches can be simplified in a single branch with an {\tt and}
450 operator. Not sure it would change the bytecode's efficiency, but
451 that's a good exercice of AST manipulation.
453 \paragraph{Superfluous assignments} When parsing a table entry, we
454 assign it to a variable, then recursively call {\tt parse\_pattern()}
455 on it; in the frequent case where the entry was simply a variable, it
456 re-assigns it again. This could be optimized away.
458 \paragraph{Checking table arity} In many cases, it is practical to
459 check the number of elements in the array-part of the table. Here is a
460 simple modification proposal: by default, a table pattern matches a
461 table term only if they have the same number of array-part
462 elements. However, if the last element of the pattern is {\tt`Dots}
463 (a.k.a. {\tt+\{...\}}), then the term simply has to have
464 {\it at least} as many array-part elements as the pattern.
466 \paragraph{Adding guards}
467 It is sometimes desirable to add arbitrary conditions for a pattern to
468 match, conditions which might no be expressed by a pattern. OCaml
469 allows to add them with a ``\verb|when|'' keyword:
470 \begin{verbatim}
471 match n with
472 | 0 -> print "zero"
473 | n when n%2 == 0 -> print "even number"
474 | _ -> print "odd number"
476 \end{verbatim}
477 I'd advise you to prefer {\tt if} as a dedicated keyword, rather than
478 {\tt when}: it's unambiguous in this context, and saves a keyword
479 reservation.
481 \paragraph{More bindings}
482 The way pattern matching is currently implemented, one can either bind
483 a subterm to a variable, or check its structure against a sub-pattern,
484 not both simultaneously. OCaml provides an ``\verb|as|'' operator,
485 which allows to do both (Haskell calls it ``\verb|@|''). For instance,
486 in the following example, any ADT whose tag is \verb|"RepeatMe"| will
487 be replaced by two occurrences of itself, while others will remain
488 unchanged:
489 \begin{verbatim}
490 match something with
491 | `RepeatMe{ ... } as r -> { r, r }
492 | x -> x
494 \end{verbatim}
495 ``\verb|as|'' will have to be declared as an infix operator, whose
496 meaning will remain undefined in expressions which are not patterns.
498 As an alternative, you can reuse an existing infix operator, thus
499 avoiding to mess the expression parser. For instance, use {\tt *}
500 instead of {\tt as}. You can go further, and implement {\tt +} as an
501 ``or'' operator ({\tt pattern1 + pattern2} would match if either
502 of the patterns matches), although this would significantly complicate
503 the implementation of {\tt parse\_pattern()}.
505 The {\tt+} operator might prove tricky to implement, if you don't
506 convert your code generator to gotos and labels first.
508 \paragraph{Linear bindings}
509 We should check, when compiling a pattern, that it is left-linear,
510 i.e. that variables don't appear more than once in the pattern. People
511 might be tempted to write things like this to check whether a tree is
512 symmetric:
513 \begin{verbatim}
514 match t with
515 | `Tree{ x, x } -> print "Symmetric!"
516 | `Tree{ x, y } -> print "Not symmetric"
517 | `Leaf{ _ } -> print "Symmetric!"
519 \end{verbatim}
520 However, this would work in Prolog but not with our pattern matching,
521 as two occurences of the same variable in the pattern don't cause an
522 equality test to be added. We should detect such non-linear variables,
523 and implement a suitable reaction:
524 \begin{itemize}
525 \item throw an error, or at least a warning;
526 \item or add an equality test between the terms matched by the
527 non-linear variable;
528 \item or offer a syntax extension that lets the user provide his own
529 equality predicate.
530 \end{itemize}
532 Notice that the latter choice would drive us towards a Prolog
533 unification algorithm, which opens interesting opportunities.
535 You might offer an exception for variable ``{\tt\_}'', which is often
536 intended as a dummy, unused variable. Non-linear occurences of it
537 should probably be silently accepted, without even performing the
538 corresponding binding.
540 \paragraph{Generalized assignments}
541 Yet another OCaml-inspired feature: assignments such as
542 ``\verb|foo = bar|'', is almost a special
543 case of pattern matching with only one case: the left-hand side is
544 the pattern, and the right-hand side is the ``raw'' ``\verb|foo=bar|''
545 assignment. Seen this way, it allows to write things such as
546 ``\verb|`If{ cond, block } = some_ast }|'' to assign \verb|cond| and
547 \verb|block| to the subparts of \verb|some_ast| (if we know that
548 \verb|some_ast| is the AST of an \verb|if| statement).
550 If you go this way, however, make sure that the code generated for
551 simple {\tt let}s is as efficient as before! Moreover, there is an (easy)
552 scoping issue: the variables assigned belong to the scope of the
553 surrounding block.
555 \paragraph{Pattern matchings as expressions}
556 Pattern matching are currently statements, and take statements as
557 right-hand sides of cases. We could allow pattern matchings where
558 expressions are expected: these would take expressions instead of
559 statements as right-hand sides. Two ways to implement this: the dirty
560 one (hack with functions to change match statements into expressions),
561 and the clean one (refactoring existing code, so that it is agnostic
562 about its right-hand side type, and provide two specialized
563 versions of it for statements and expressions).
565 \paragraph{Bootstrap it}
566 That's something language designers love to do, for largely mystic
567 reasons: writing a language's compiler in the language itself. Here,
568 the idea is to re-implement the pattern matching extension by using
569 pattern matching, and compile it with the older version. Comparing the
570 firsrt and second versions of the code will give you an idea of how
571 much code clarification is brought to you by the pattern matching
572 extension.
574 \paragraph{Pattern conjunction} Another feature to take from OCaml is
575 multiple patterns for a single block. Instead of associating one
576 block with one pattern, cases associate a block with a (non-empty)
577 list of patterns. All of these patterns have to bond the same
578 variables, except for {\tt\_}. The first pattern in the list to match
579 the tested term does the binding. Patterns are separated by
580 ``\verb+|+''. Example:
581 \begin{verbatim}
582 match x with
583 | 1 | 2 | 3 -> print(x)
584 | n -> print "more than 3"
586 \end{verbatim}
587 (Hint: put the block in a local function. $2^{\mathrm{nd}}$ hint: sort
588 bound variables, e.g. by lexicographic order. Or much simpler and
589 more effective: convert your code generator to gotos+labels first).
591 \paragraph{XML munching} Ever tried to transform some XML document through
592 XSLT? Did you feel that this was even more kludgy than XML itself? Here
593 is a challenging proposal:
594 \begin{itemize}
595 \item Realize, if you didn't already, that Metalua's ADT are
596 isomorphic to XML, if you identify string-keys in tables with
597 attributes, and limit there content to strings and number. For
598 instance, ``{\tt <foo bar=3><baz/>eek</foo>}'' easily maps to ``{\tt
599 `foo\{ bar=3, `baz, "eek" \}}'';
600 \item compare what ML-style pattern matching does with what XSLT
601 does (and with what you'd like it to do);
602 \item design, implement, publish. You might want to google
603 ``CDuce''\footnote{\url{http://www.cduce.org}} for neat ideas.
604 \end{itemize}
606 If you do this, I'd be really interested to put back your contribution
607 in the next version of Metalua!
609 \subsubsection{Correction}
610 Most of the improvements proposed here are actually implemented in the
611 {\tt match} library provided with metalua. Check its (commented)
612 sources!