doc/manual/walk-ref.tex

   1 \section{{\tt walk}, the code walker}
   2
   3 When you write advanced macros, or when you're analyzing some code to check for
   4 some property, you often need to design a function that walks through an
   5 arbitrary piece of code, and does complex stuff on it. Such a function is called
   6 a code walker. Code walkers can be used for some punctual adjustments, e.g.
   7 changing a function's {\tt return} statements into something else, or checking
   8 that a loop performs no {\tt break}, up to pretty advanced transformations, such
   9 as CPS transformation (a way to encode full continuations into a language taht
  10 doesn't support them natively; see Paul Graham's On Lisp for an accessible
  11 description of how it works), lazy semantics...
  12
  13 Anyway, code walkers are tricky to write, can involve a lot of boilerplate code,
  14 and are generally brittle. To ease things as much as possible, Metalua comes
  15 with a walk library, which intends to accelerate code walker implementation.
  16 Since code walking is intrinsically tricky, the lib won't magically make it
  17 trivial, but at least it will save you a lot of time and code, when compared to
  18 writing all walkers from scratch. Moreover, other people who took the time to
  19 learn the walker generator's API will enter into your code much faster.
  20
  21 \subsection{Principles}
  22
  23 Code walking is about traversing a tree, first from root to leaves, then from
  24 leaves back to the root. This tree is not uniform: some nodes are expressions,
  25 some others statements, some others blocks; and each of these node kinds is
  26 subdivided in several sub-cases (addition, numeric for loop...). The basic code
  27 walker just goes from root to leaves and back to root without doing anything.
  28 Then it's up to you to plug some action callbacks in that walker, so that it
  29 does interesting things for you.
  30
  31 Without entering into the details of AST structure, here is a simplified version
  32 of the walking algorithm, pretending to work on a generic tree:
  33
  34 \begin{verbatim}
  35 function traverse(node)
  36    local down_result = down(node)
  37    if down_result ~= 'break' then
  38       for c in children(node) do
  39          traverse(node)
  40       end
  41    end
  42    up(node)
  43 end
  44 \end{verbatim}
  45
  46 The algorithm behind 'walk' is almost as simple as the one above, except that
  47 it's specialized for AST trees. You can essentially specialize it by providing
  48 your own up() and down() functions. These visitor functions perform whatever
  49 action you want on the AST; moreover, down() has the option to return 'break':
  50 in that case, the sub-nodes are not traversed. It allows the user to shun parts
  51 of the tree, or to provide his own special traversal method in the down()
  52 visitor.
  53
  54 The real walker generator is only marginally more complex than that:
  55 \begin{itemize}
  56 \item It lets you define your up() and down(), and down() can return 'break' to
  57   cut the tree traversal; however, these up() and down() functions are
  58   specialized by node kind: there can be separate callbacks for expr nodes, stat
  59   nodes, block nodes.
  60 \item There's also a binder() visitor for identifier binders. Binders are
  61   variables which declare a new local variable; you'll find them in nodes
  62   `Local, `Localrec, `Forin, `Fornum, `Function. The binders are visited just
  63   before the variable's scope begins, i.e. before traversing a loop or a
  64   function's body, after traversing a `Local's right-hand side, before
  65   traversing a `Localrec's right-hand side. \\
  66   Notice that separate up() and down() visitors wouldn't make sense for
  67   binders, since they're leave nodes.
  68 \item Visitor functions don't only take the visited node as parameter: they also
  69   take the list of all expr, stat and block nodes above it, up to the AST's
  70   root. This allows a child node to act on its parent nodes.
  71 \end{itemize}
  72
  73 \subsection{API}
  74 There are 3 main tree walkers: {\tt walk.expr()}, {\tt walk.stat()} and {\tt
  75   walk.block()}, to walk through the corresponding kinds of ASTs. Each of these
  76 walker take as parameters a table {\tt cfg} containing the various visitor
  77 functions, and the AST to walk throuhg. the configuration table {\tt cfg} can
  78 contain fields:
  79 \begin{itemize}
  80 \item {\tt cfg.stat.down(node, parent, grandparent...)}, which applies when
  81   traversing a statement down, i.e. before its children nodes are parsed, and
  82   can modify the tree, and return {\tt nil} or {\tt'break'}. The way children
  83   are traversed is decided {\em after} the {\tt down()} visitor has been run:
  84   this point matters when the visitor modifies its children nodes.
  85 \item {\tt cfg.stat.up(node, parent, grandparent...)}, which is applies on the
  86   way back up. It is applied even if {\tt cfg.stat.down()} returned
  87   {\tt'break'}, but in that case, the children have not been (and will not be)
  88   traversed.
  89 \item {\tt cfg.expr.down()} and {\tt cfg.expr.up()}, which work just as their
  90   {\tt stat} equivalent, but apply to expression nodes.\\
  91   Notice that in Lua, function calls and method invocations can be used as
  92   statements as well as as espressions: in such cases, they are visited only by
  93   the statement visitor, not by the expression visitor.
  94 \item {\tt cfg.block.down()} and {\tt cfg.block.up()} do the same for statements
  95   blocks: loops, conditional and function bodies.
  96 \item {\tt cfg.binder(identifier, id\_parent, id\_grandparent...)}: this
  97   is run on identifiers which create a new local variable, jsut before that
  98   variable's scope begins.
  99 \end{itemize}
 100
 101 Moreover, there is a {\tt walk.guess(cfg, ast)} walker which tries to guess the
 102 type of the AST it receives, and applies the appropriate walker. When an AST can
 103 be either an expression or a statement (nodes {\tt`Call} and {\tt`Invoke}), it
 104 is interpreted as an expression.
 105
 106 \subsection{Examples}
 107
 108 A bit of practice now. Let's build the simplest walker possible, that does
 109 nothing:
 110
 111 \begin{verbatim}
 112 cfg = { }
 113 walker = |ast| walk.block(cfg, ast)
 114 \end{verbatim}
 115
 116 Now, let's say we want to catch and remove all statement calls to function
 117 assert(). This can be done by removing its tag and content: an empty list is
 118 simply ignored in an AST. So we're only interested by `Call nodes, and within
 119 these nodes, we want the function to be `Id 'assert'. All of this is only
 120 relevant to stat nodes:
 121
 122 \begin{verbatim}
 123 function cfg.stat.down (x)
 124    match x with
 125    | `Call{ `Id 'assert', ... } -> x.tag=nil; x <- { }
 126    | _ -> -- not interested by this node, do nothing
 127    end
 128 end
 129 \end{verbatim}
 130
 131 You'll almost always want to use the 'match' extension to implement visitors.
 132 The imperative table overrider ({\tt x <- y} a.k.a. {\tt table.override(x, y)}
 133 also often comes handy to modify an AST.
 134
 135 We'll now remove {\tt assert()} calls in non-statement; we cannot replace an
 136 expression by nothing, so we'll replace these nodes by these will simply be
 137 replaced by {\tt nil}:
 138
 139 \begin{verbatim}
 140 function cfg.expr.down (x)
 141    match x with
 142    | `Call{ `Id 'assert', ... } -> x <- `Nil
 143    | _ -> -- not interested by this node, do nothing
 144    end
 145 end
 146 \end{verbatim}
 147
 148
 149 Here's a remark for functional programmers: this API is very imperative; you
 150 might cringe at seeing the `Call nodes transformed in-place. Well, I tend to
 151 agree but it's generally counter-productive to work against the grain of the
 152 wood: Lua is imperative at heart, and design attempts at doing this functionally
 153 sucked more than approaches that embraced imperativeness.
 154
 155 \paragraph{Cuts}
 156 By making down() return 'break', you can prevent the traversal to go further
 157 down. This might be either because you're not interested by the subtrees, or
 158 because you want to traverse them in a special way. In that later case, just do
 159 the traversal by yourself in the down() function, and cut the walking by
 160 returning 'break', so that nodes aren't re-traversed by the default walking
 161 algorithm. We'll see that in the next, more complex example, listing of free
 162 variables.
 163
 164 This example is exclusively there for demonstration purposes. For actual work on
 165 identifiers that require awareness of an identifier's binder of freedom, there
 166 is a dedicated {\tt walk.id} library.
 167
 168 We'll progressively build a walker that gathers all global variables used in a
 169 given AST. This involves keeping, at all times, a set of the identifiers
 170 currently bound by a "local" declaration, by function parameters, as for loop
 171 variables etc. Then, every time an identifier is found in the AST, its presence
 172 is checked in the current set of bound variables. If it isn't in it, then it's a
 173 free (global) identifier.
 174
 175 The first thing we'll need is a scope handling system: something that keeps
 176 track of what identifiers are currently in scope. It must also allow to save the
 177 current scope (e.g. before we enter a new block) and restore it afterwards (e.g.
 178 after we leave the block). This is quite straightforward and unrelated to code
 179 walking; here is the code:
 180
 181 \begin{Verbatim}[fontsize=\scriptsize]
 182 require 'std'
 183 require 'walk'
 184
 185 -{ extension 'match' }
 186
 187 --------------------------------------------------------------------------------
 188 -- Scope handling: ':push()' saves the current scope, ':pop()'
 189 -- restores the previously saved one. ':add(identifiers_list)' adds
 190 -- identifiers to the current scope. Current scope is stored in
 191 -- '.current', as a string->boolean hashtable.
 192 --------------------------------------------------------------------------------
 193
 194 local scope = { }
 195 scope.__index = scope
 196
 197 function scope:new()
 198    local ret = { current = { } }
 199    ret.stack = { ret.current }
 200    setmetatable (ret, self)
 201    return ret
 202 end
 203
 204 function scope:push()
 205    table.insert (self.stack, table.shallow_copy (self.current))
 206 end
 207
 208 function scope:pop()
 209    self.current = table.remove (self.stack)
 210 end
 211
 212 function scope:add (vars)
 213    for id in values (vars) do
 214       match id with `Id{ x } -> self.current[x] = true end
 215    end
 216 end
 217 \end{Verbatim}
 218
 219 (There is an improved version of that class in library {\tt walk.scope}; cf.
 220 its documentation for details).
 221
 222 Now let's start designing the walker. We'll keep a scope object up to date, as
 223 well as a set of found free identifiers, gathered every time we find an `Id{ }
 224 node. To slightly simplify matter, we'll consider that the AST represent a
 225 block.
 226
 227 \begin{Verbatim}[fontsize=\scriptsize]
 228 local function fv (term)
 229    local freevars = { }
 230    local scope    = scope:new()
 231    local cfg      = { expr  = { } }
 232
 233    function cfg.expr.down(x)
 234       match x with
 235       | `Id{ name } -> if not scope.current[name] then freevars[name] = true end
 236       | _ -> -- pass
 237       end
 238    end
 239
 240    walk.guess(cfg, term)
 241    return freevars
 242 end
 243 \end{Verbatim}
 244
 245 Since we don't ever add any identifier to the scope, this will just list all the
 246 identifiers used in the AST, bound or not. Now let's start doing more
 247 interesting things:
 248
 249 \begin{itemize}
 250 \item We'll save the scope's state before entering a new block, and restore it
 251   when we leave it. That will be done by providing functions {\tt cfg.block.down()}
 252   and {\tt cfg.block.up()}. Saving and restoring will be performed by methods
 253   {\tt :push()} and {\tt :pop()}.
 254 \item Whenever we find a {\tt local} declaration, we'll add the list of
 255   identifiers to the current scope, so that they won't be gathered when parsing
 256   the {\tt `Id} expression nodes. Notice that the identifiers declared by the
 257   'local' statement only start being valid after the statement, so we'll add
 258   them in the {\tt cfg.stat.up()} function rather than {\tt cfg.stat.down()}.
 259 \end{itemize}
 260
 261 \begin{Verbatim}[fontsize=\scriptsize]
 262 local cfg = { expr  = { },
 263               stat  = { },
 264               block = { } }
 265    [...]
 266    function cfg.stat.up(x)
 267       match x with
 268       | `Local{ vars, ... } -> scope:add(vars)
 269       | _ -> -- pass
 270       end
 271    end
 272
 273    -----------------------------------------------------------------------------
 274    -- Create a separate scope for each block, close it when leaving.
 275    -----------------------------------------------------------------------------
 276    function cfg.block.down() scope:push() end
 277    function cfg.block.up()   scope:pop()  end
 278 \end{Verbatim}
 279
 280 This starts to be useful. We can also easily add the case for `Localrec{ } nodes
 281 (the ones generated by {\tt "local function foo() ... end"}), where the variable
 282 is already bound in the {\tt`Localrec} statement's right-hand side; so we do the
 283 same as for {\tt`Local}, but we do it in the {\tt down()} function rather than
 284 in the up() one.
 285
 286 We'll also take care of {\tt`Function}, {\tt`Forin} and {\tt`Fornum} nodes,
 287 which introduce new bound identifiers as function parameters or loop variables.
 288 This is quite straightforward; the only thing to take care of is to save the
 289 scope before entering the function/loop body (in {\tt down()}), and restore it
 290 when leaving (in {\tt up()}). The code becomes:
 291
 292
 293 \begin{Verbatim}[fontsize=\scriptsize]
 294 local function fv (term)
 295    local freevars = { }
 296    local scope    = scope:new()
 297    local cfg      = { expr  = { },
 298                       stat  = { },
 299                       block = { } }
 300
 301    -----------------------------------------------------------------------------
 302    -- Check identifiers; add functions parameters to newly created scope.
 303    -----------------------------------------------------------------------------
 304    function cfg.expr.down(x)
 305       match x with
 306       | `Id{ name } -> if not scope.current[name] then freevars[name] = true end
 307       | `Function{ params, _ } -> scope:push(); scope:add (params)
 308       | _ -> -- pass
 309       end
 310    end
 311
 312    -----------------------------------------------------------------------------
 313    -- Close the function scope opened by 'down()'.
 314    -----------------------------------------------------------------------------
 315    function cfg.expr.up(x)
 316       match x with
 317       | `Function{ ... } -> scope:pop()
 318       | _ -> -- pass
 319       end
 320    end
 321
 322    -----------------------------------------------------------------------------
 323    -- Create a new scope and register loop variable[s] in it
 324    -----------------------------------------------------------------------------
 325    function cfg.stat.down(x)
 326       match x with
 327       | `Forin{ vars, ... }    -> scope:push(); scope:add(vars)
 328       | `Fornum{ var, ... }    -> scope:push(); scope:add{var}
 329       | `Localrec{ vars, ... } -> scope:add(vars)
 330       | `Local{ ... }          -> -- pass
 331       | _ -> -- pass
 332       end
 333    end
 334
 335    -----------------------------------------------------------------------------
 336    -- Close the scopes opened by 'up()'
 337    -----------------------------------------------------------------------------
 338    function cfg.stat.up(x)
 339       match x with
 340       | `Forin{ ... } | `Fornum{ ... } -> scope:pop()
 341       | `Local{ vars, ... }            -> scope:add(vars)
 342       | _ -> -- pass
 343       end
 344    end
 345
 346    -----------------------------------------------------------------------------
 347    -- Create a separate scope for each block, close it when leaving.
 348    -----------------------------------------------------------------------------
 349    function cfg.block.down() scope:push() end
 350    function cfg.block.up()   scope:pop()  end
 351
 352    walk.guess(cfg, term)
 353    return freevars
 354 end
 355 \end{Verbatim}
 356
 357 This is almost correct now. There's one last tricky point of Lua's semantics
 358 that we need to address: in {\tt repeat foo until bar} loops, "bar" is included
 359 in "foo"'s scope. For instance, if we write {\tt repeat local x=foo() until
 360   x>3}, the "x" in the condition is the local variable "x" declared inside the
 361 body. This violates our way of handling block scopes: the scope must be kept
 362 alive after the block is finished. We'll fix this by providing a custom walking
 363 for the block inside `Repeat, and preventing the normal walking to happen by
 364 returning 'break':
 365
 366 \begin{Verbatim}[fontsize=\scriptsize]
 367    -----------------------------------------------------------------------------
 368    -- Create a new scope and register loop variable[s] in it
 369    -----------------------------------------------------------------------------
 370    function cfg.stat.down(x)
 371       match x with
 372       | `Forin{ vars, ... }    -> scope:push(); scope:add(vars)
 373       | `Fornum{ var, ... }    -> scope:push(); scope:add{var}
 374       | `Localrec{ vars, ... } -> scope:add(vars)
 375       | `Local{ ... }          -> -- pass
 376       | `Repeat{ block, cond } -> -- 'cond' is in the scope of 'block'
 377          scope:push()
 378          for s in values (block) do walk.stat(cfg)(s) end -- no new scope
 379          walk.expr(cfg)(cond)
 380          scope:pop()
 381          return 'break' -- No automatic walking of subparts 'cond' and 'body'
 382       | _ -> -- pass
 383       end
 384    end
 385 \end{Verbatim}
 386
 387 That's it, we've now got a full free variables lister, and have demonstrated
 388 most APIs offered by the basic 'walk' library. If you want to walk through
 389 identifiers in a scope-aware way, though, you'll want to look at the {\tt
 390   walk.id} library.
 391
 392 \subsection{Library {\tt walk.id}, the scope-aware walker}
 393
 394 This library walks AST to gather information about the identifiers in it. It
 395 call distinct visitor functions depending on whether an identifier is bound or
 396 free; moreover, when an identifier is bound, the visitor also receives its
 397 binder node as a parameter. For instance, in {\tt +\{function(x) print(x)
 398   end\}}, the bound identifier walker will be called on the \verb|+{x}| in the
 399 \verb|print| call, and the visitor's second parameter will be the {\tt`Function}
 400 node which created the local variable {\tt x}.
 401
 402 \paragraph{API}
 403 The library is loaded with \verb|require 'walk.id'|. The walkers provided are:
 404 \begin{itemize}
 405 \item {\tt walk\_id.expr()};
 406 \item {\tt walk\_id.stat()};
 407 \item {\tt walk\_id.block()};
 408 \item {\tt walk\_id.guess()}.
 409 \end{itemize}
 410
 411 They take the same config tables as regular walkers, except that they also
 412 recognize the following entries:
 413 \begin{itemize}
 414 \item {\tt cfg.id.free(identifier, parent, grandparent...)}, which is run on
 415   free variables;
 416 \item {\tt cfg.id.bound(identifier, binder, parent, grandparent...)}, which is
 417   run on bound variables. The statement or expression which created this bound
 418   veriable's scope is passed as a second parameter, before the parent nodes.
 419 \end{itemize}
 420
 421 \paragraph{Examples}
 422 Let's rewrite the free variables walker above, with the id walker:
 423
 424 \begin{Verbatim}[fontsize=\scriptsize]
 425 function fv (term)
 426    local cfg = { id = { } }
 427    local freevars = { }
 428    function cfg.id.free(id)
 429       local id_name = id[1]
 430       freevars[id_name] = true
 431    end
 432    walk_id.guess (cfg, term)
 433    return freevars
 434 end
 435 \end{Verbatim}
 436
 437 Now, let's $\alpha$-rename all bound variables in a term. This is slightly
 438 trickier than one could think: we need to first walk the whole tree, then
 439 perform all the replacement. If we renamed binders as we went, they would stop
 440 binding their variables, and something messy would happen. For instance, if we
 441 took {\tt +\{function(x) print(x) end\}} and renamed the binder {\tt x} into
 442 {\tt foo}, we'd then start walking the function body on the tree {\tt
 443   +\{function(foo) print(x) end\}}, where {\tt x} isn't bound anymore.
 444
 445 \begin{Verbatim}[fontsize=\scriptsize]
 446 --------------------------------------------------------------------------------
 447 -- bound_vars keeps a binder node -> old_name -> new_name dictionary.
 448 -- It allows to associate all instances of an identifier together,
 449 -- with the binder that created them
 450 --------------------------------------------------------------------------------
 451 local bound_vars    = { }
 452
 453 --------------------------------------------------------------------------------
 454 -- local_renames will keep all identifier nodes to rename as keys,
 455 -- and their new name as values. The renaming must happen after
 456 -- the whole tree has been visited, in order to avoid breaking captures.
 457 --------------------------------------------------------------------------------
 458 local local_renames = { }
 459
 460 --------------------------------------------------------------------------------
 461 -- Associate a new name in bound_vars when a local variable is created.
 462 --------------------------------------------------------------------------------
 463 function cfg.binder (id, binder)
 464    local old_name         = id[1]
 465    local binder_table     = bound_vars[binder]
 466    if not binder_table then
 467       -- Create a new entry for this binder:
 468       binder_table        = { }
 469       bound_vars[binder]  = binder_table
 470    end
 471    local new_name         = mlp.gensym(old_name)[1] -- generate a new name
 472    binder_table[old_name] = new_name -- remember name for next instances
 473    local_renames[id]      = new_name -- add to the rename todo-list
 474 end
 475
 476 --------------------------------------------------------------------------------
 477 -- Add a bound variable the the rename todo-list
 478 --------------------------------------------------------------------------------
 479 function cfg.id.bound (id, binder)
 480    local old_name    = id[1]
 481    local new_name    = bound_vars[binder][old_name]
 482    local_renames[id] = new_name
 483 end
 484
 485 -- walk the tree and fill laocal_renames:
 486 walk_id.guess(cfg, ast)
 487
 488 -- perform the renaming actions:
 489 for id, new_name in pairs(local_renames) do id[1] = new_name end
 490 \end{Verbatim}
 491
 492 \subsection{Library {\tt walk.scope}, the scope helper}
 493 This library allows to easily store data, in an AST walking function, in a scope
 494 aware way. Cf. comments in the sources for details.