js/src/vm/PortableBaselineInterpret.h

   1 /* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*-
   2  * vim: set ts=8 sts=2 et sw=2 tw=80:
   3  * This Source Code Form is subject to the terms of the Mozilla Public
   4  * License, v. 2.0. If a copy of the MPL was not distributed with this
   5  * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
   6
   7 #ifndef vm_PortableBaselineInterpret_h
   8 #define vm_PortableBaselineInterpret_h
   9
  10 /*
  11  * [SMDOC] Portable Baseline Interpreter
  12  * =====================================
  13  *
  14  * The Portable Baseline Interpreter (PBL) is a portable interpreter
  15  * that supports executing ICs by directly interpreting CacheIR.
  16  *
  17  * This interpreter tier fits into the hierarchy between the C++
  18  * interpreter, which is fully generic and does not specialize with
  19  * ICs, and the native baseline interpreter, which does attach and
  20  * execute ICs but requires native codegen (JIT). The distinguishing
  21  * feature of PBL is that it *does not require codegen*: it can run on
  22  * any platform for which SpiderMonkey supports an interpreter-only
  23  * build. This is useful both for platforms that do not support
  24  * runtime addition of new code (e.g., running within a WebAssembly
  25  * module with a `wasm32-wasi` build) or may disallow it for security
  26  * reasons.
  27  *
  28  * The main idea of PBL is to emulate, as much as possible, how the
  29  * native baseline interpreter works, so that the rest of the engine
  30  * can work the same either way. The main aspect of this "emulation"
  31  * comes with stack frames: unlike the native blinterp and JIT tiers,
  32  * we cannot use the machine stack, because we are still executing in
  33  * portable C++ code and the platform's C++ compiler controls the
  34  * machine stack's layout. Instead, we use an auxiliary stack.
  35  *
  36  * Auxiliary Stack
  37  * ---------------
  38  *
  39  * PBL creates baseline stack frames (see `BaselineFrame` and related
  40  * structs) on an *auxiliary stack*, contiguous memory allocated and
  41  * owned by the JSRuntime.
  42  *
  43  * This stack operates nearly identically to the machine stack: it
  44  * grows downward, we push stack frames, we maintain a linked list of
  45  * frame pointers, and a series of contiguous frames form a
  46  * `JitActivation`, with the most recent activation reachable from the
  47  * `JSContext`. The only actual difference is that the return address
  48  * slots in the frame layouts are always null pointers, because there
  49  * is no need to save return addresses: we always know where we are
  50  * going to return to (either post-IC code -- the return point of
  51  * which is known because we actually do a C++-level call from the
  52  * JSOp interpreter to the IC interpreter -- or to dispatch the next
  53  * JSOp).
  54  *
  55  * The same invariants as for native baseline code apply here: when we
  56  * are in `PortableBaselineInterpret` (the PBL interpreter body) or
  57  * `ICInterpretOps` (the IC interpreter) or related helpers, it is as
  58  * if we are in JIT code, and local state determines the top-of-stack
  59  * and innermost frame. The activation is not "finished" and cannot be
  60  * traversed. When we need to call into the rest of SpiderMonkey, we
  61  * emulate how that would work in JIT code, via an exit frame (that
  62  * would ordinarily be pushed by a trampoline) and saving that frame
  63  * as the exit-frame pointer in the VM state.
  64  *
  65  * To add a little compile-time enforcement of this strategy, and
  66  * ensure that we don't accidentally call something that will want to
  67  * traverse the (in-progress and not-completed) JIT activation, we use
  68  * a helper class `VMFrame` that pushes and pops the exit frame,
  69  * wrapping the callsite into the rest of SM with an RAII idiom. Then,
  70  * we *hide the `JSContext`*, and rely on the idiom that `cx` is
  71  * passed to anything that can GC or otherwise observe the JIT
  72  * state. The `JSContext` is passed in as `cx_`, and we name the
  73  * `VMFrame` local `cx` in the macro that invokes it; this `cx` then
  74  * has an implicit conversion to a `JSContext*` value and reveals the
  75  * real context.
  76  *
  77  * Interpreter Loops
  78  * -----------------
  79  *
  80  * There are two interpreter loops: the JSOp interpreter and the
  81  * CacheIR interpreter. These closely correspond to (i) the blinterp
  82  * body that is generated at startup for the native baseline
  83  * interpreter, and (ii) an interpreter version of the code generated
  84  * by the `BaselineCacheIRCompiler`, respectively.
  85  *
  86  * Execution begins in the JSOp interpreter, and for any op(*) that
  87  * has an IC site (`JOF_IC` flag), we invoke the IC interpreter. The
  88  * IC interpreter runs a loop that traverses the IC stub chain, either
  89  * reaching CacheIR bytecode and executing it in a virtual machine, or
  90  * reaching the fallback stub and executing that (likely pushing an
  91  * exit frame and calling into the rest of SpiderMonkey).
  92  *
  93  * (*) As an optimization, some opcodes that would have IC sites in
  94  * native baseline skip their IC chains and run generic code instead
  95  * in PBL. See "Hybrid IC mode" below for more details.
  96  *
  97  * IC Interpreter State
  98  * --------------------
  99  *
 100  * While the JS opcode interpreter's abstract machine model and its
 101  * mapping of those abstract semantics to real machine state are
 102  * well-defined (by the other baseline tiers), the IC interpreter's
 103  * mapping is less so. When executing in native baseline tiers,
 104  * CacheIR is compiled to machine code that undergoes register
 105  * allocation and several optimizations (e.g., handling constants
 106  * specially, and eliding type-checks on values when we know their
 107  * actual types). No other interpreter for CacheIR exists, so we get
 108  * to define how we map the semantics to interpreter state.
 109  *
 110  * We choose to keep an array of uint64_t values as "virtual
 111  * registers", each corresponding to a particular OperandId, and we
 112  * store the same values that would exist in the native machine
 113  * registers. In other words, we do not do any sort of register
 114  * allocation or reclamation of storage slots, because we don't have
 115  * any lookahead in the interpreter. We rely on the typesafe writer
 116  * API, with newtype'd wrappers for different kinds of values
 117  * (`ValOperandId`, `ObjOperandId`, `Int32OperandId`, etc.), producing
 118  * typesafe CacheIR bytecode, in order to properly store and interpret
 119  * unboxed values in the virtual registers.
 120  *
 121  * There are several subtle details usually handled by register
 122  * allocation in the CacheIR compilers that need to be handled here
 123  * too, mainly around input arguments and restoring state when
 124  * chaining to the next IC stub. IC callsites place inputs into the
 125  * first N OperandId registers directly, corresponding to what the
 126  * CacheIR expects. There are some CacheIR opcodes that mutate their
 127  * argument in-place (e.g., guarding that a Value is an Object strips
 128  * the tag-bits from the Value and turns it into a raw pointer), so we
 129  * cannot rely on these remaining unmodified if we need to invoke the
 130  * next IC in the chain; instead, we save and restore the first N
 131  * values in the chain-walking loop (according to the arity of the IC
 132  * kind).
 133  *
 134  * Optimizations
 135  * ------------
 136  *
 137  * There are several implementation details that are critical for
 138  * performance, and thus should be carefully maintained or verified
 139  * with any changes:
 140  *
 141  * - Caching values in locals: in order to be competitive with "native
 142  *   baseline interpreter", which has the advantage of using machine
 143  *   registers for commonly-accessed values such as the
 144  *   top-of-operand-stack and the JS opcode PC, we are careful to
 145  *   ensure that the C++ compiler can keep these values in registers
 146  *   in PBL as well. One might naively store `pc`, `sp`, `fp`, and the
 147  *   like in a context struct (of "virtual CPU registers") that is
 148  *   passed to e.g. the IC interpreter. This would be a mistake: if
 149  *   the values exist in memory, the compiler cannot "lift" them to
 150  *   locals that can live in registers, and so every push and pop (for
 151  *   example) performs a store. This overhead is significant,
 152  *   especially when executing more "lightweight" opcodes.
 153  *
 154  *   We make use of an important property -- the balanced-stack
 155  *   invariant -- so that we can pass SP *into* calls but not take an
 156  *   updated SP *from* them. When invoking an IC, we expect that when
 157  *   it returns, SP will be at the same location (one could think of
 158  *   SP as a "callee-saved register", though it's not usually
 159  *   described that way). Thus, we can avoid a dependency on a value
 160  *   that would have to be passed back through memory.
 161  *
 162  * - Hybrid IC mode: the fact that we *interpret* ICs now means that
 163  *   they are more expensive to invoke. Whereas a small IC that guards
 164  *   two int32 arguments, performs an int32 add, and returns might
 165  *   have been a handful of instructions before, and the call/ret pair
 166  *   would have been very fast (and easy to predict) instructions at
 167  *   the machine level, the setup and context transition and the
 168  *   CacheIR opcode dispatch overhead would likely be much slower than
 169  *   a generic "if both int32, add" fastpath in the interpreter case
 170  *   for `JSOp::Add`.
 171  *
 172  *   We thus take a hybrid approach, and include these static
 173  *   fastpaths for what would have been ICs in "native
 174  *   baseline". These are enabled by the `kHybridICs` global and may
 175  *   be removed in the future (transitioning back to ICs) if/when we
 176  *   can reduce the cost of interpreted ICs further.
 177  *
 178  *   Right now, calls and property accesses use ICs:
 179  *
 180  *   - Calls can often be special-cased with CacheIR when intrinsics
 181  *     are invoked. For example, a call to `String.length` can turn
 182  *     into a CacheIR opcode that directly reads a `JSString`'s length
 183  *     field.
 184  *   - Property accesses are so frequent, and the shape-lookup path
 185  *     is slow enough, that it still makes sense to guard on shape
 186  *     and quickly return a particular slot.
 187  *
 188  * - Static branch prediction for opcode dispatch: we adopt an
 189  *   interpreter optimization we call "static branch prediction": when
 190  *   one opcode is often followed by another, it is often more
 191  *   efficient to check for those specific cases first and branch
 192  *   directly to the case for the following opcode, doing the full
 193  *   switch otherwise. This is especially true when the indirect
 194  *   branches used by `switch` statements or computed gotos are
 195  *   expensive on a given platform, such as Wasm.
 196  *
 197  * - Inlining: on some platforms, calls are expensive, and we want to
 198  *   avoid them whenever possible. We have found that it is quite
 199  *   important for performance to inline the IC interpreter into the
 200  *   JSOp interpreter at IC sites: both functions are quite large,
 201  *   with significant local state, and so otherwise, each IC call
 202  *   involves a lot of "context switching" as the code generated by
 203  *   the C++ compiler saves registers and constructs a new native
 204  *   frame. This is certainly a code-size tradeoff, but we have
 205  *   optimized for speed here.
 206  *
 207  * - Amortized stack checks: a naive interpreter implementation would
 208  *   check for auxiliary stack overflow on every push. We instead do
 209  *   this once when we enter a new JS function frame, using the
 210  *   script's precomputed "maximum stack depth" value. We keep a small
 211  *   stack margin always available, so that we have enough space to
 212  *   push an exit frame and invoke the "over-recursed" helper (which
 213  *   throws an exception) when we would otherwise overflow. The stack
 214  *   checks take this margin into account, failing if there would be
 215  *   less than the margin available at any point in the called
 216  *   function.
 217  *
 218  * - Fastpaths for calls and returns: we are able to push and pop JS
 219  *   stack frames while remaining in one native (C++ interpreter
 220  *   function) frame, just as the C++ interpreter does. This means
 221  *   that there is a one-to-many mapping from native stack frame to JS
 222  *   stack frame. This does create some complications at points that
 223  *   pop frames: we might remain in the same C++ frame, or we might
 224  *   return at the C++ level. We handle this in a unified way for
 225  *   returns and exception unwinding as described below.
 226  *
 227  * Unwinding
 228  * ---------
 229  *
 230  * Because one C++ interpreter frame can correspond to multiple JS
 231  * frames, we need to disambiguate the two cases whenever leaving a
 232  * frame: we may need to return, or we may stay in the current
 233  * function and dispatch the next opcode at the caller's next PC.
 234  *
 235  * Exception unwinding compilcates this further. PBL uses the same
 236  * exception-handling code that native baseline does, and this code
 237  * computes a `ResumeFromException` struct that tells us what our new
 238  * stack pointer and frame pointer must be. These values could be
 239  * arbitrarily far "up" the stack in the current activation. It thus
 240  * wouldn't be sufficient to count how many JS frames we have, and
 241  * return at the C++ level when this reaches zero: we need to "unwind"
 242  * the C++ frames until we reach the appropriate JS frame.
 243  *
 244  * To solve both issues, we remember the "entry frame" when we enter a
 245  * new invocation of `PortableBaselineInterpret()`, and when returning
 246  * or unwinding, if the new frame is *above* this entry frame, we
 247  * return. We have an enum `PBIResult` that can encode, when
 248  * unwinding, *which* kind of unwinding we are doing, because when we
 249  * do eventually reach the C++ frame that owns the newly active JS
 250  * frame, we may resume into a different action depending on this
 251  * information.
 252  *
 253  * Completeness
 254  * ------------
 255  *
 256  * Whenever a new JSOp is added, the opcode needs to be added to
 257  * PBL. The compiler should enforce this: if no case is implemented
 258  * for an opcode, then the label in the computed-goto table will be
 259  * missing and PBL will not compile.
 260  *
 261  * In contrast, CacheIR opcodes need not be implemented right away,
 262  * and in fact right now most of the less-common ones are not
 263  * implemented by PBL. If the IC interpreter hits an unimplemented
 264  * opcode, it acts as if a guard had failed, and transfers to the next
 265  * stub in the chain. Every chain ends with a fallback stub that can
 266  * handle every case (it does not execute CacheIR at all, but instead
 267  * calls into the runtime), so this will always give the correct
 268  * result, albeit more slowly. Implementing the remainder of the
 269  * CacheIR opcodes, and new ones as they are added, is thus purely a
 270  * performance concern.
 271  *
 272  * PBL currently does not implement async resume into a suspended
 273  * generator. There is no particular reason that this cannot be
 274  * implemented; it just has not been done yet. Such an action will
 275  * currently call back into the C++ interpreter to run the resumed
 276  * generator body. Execution up to the first yield-point can still
 277  * occur in PBL, and PBL can successfully save the suspended state.
 278  */
 279
 280 #include "jspubtd.h"
 281
 282 #include "jit/BaselineFrame.h"
 283 #include "jit/BaselineIC.h"
 284 #include "jit/JitContext.h"
 285 #include "jit/JitScript.h"
 286 #include "vm/Interpreter.h"
 287 #include "vm/Stack.h"
 288
 289 namespace js {
 290 namespace pbl {
 291
 292 // Trampoline invoked by EnterJit that sets up PBL state and invokes
 293 // the main interpreter loop.
 294 bool PortableBaselineTrampoline(JSContext* cx, size_t argc, Value* argv,
 295                                 size_t numActuals, size_t numFormals,
 296                                 jit::CalleeToken calleeToken,
 297                                 JSObject* envChain, Value* result);
 298
 299 // Predicate: are all conditions satisfied to allow execution within
 300 // PBL? This depends only on properties of the function to be invoked,
 301 // and not on other runtime state, like the current stack depth, so if
 302 // it returns `true` once, it can be assumed to always return `true`
 303 // for that function. See `PortableBaselineInterpreterStackCheck`
 304 // below for a complimentary check that does not have this property.
 305 jit::MethodStatus CanEnterPortableBaselineInterpreter(JSContext* cx,
 306                                                       RunState& state);
 307
 308 // A check for availbale stack space on the PBL auxiliary stack that
 309 // is invoked before the main trampoline. This is required for entry
 310 // into PBL and should be checked before invoking the trampoline
 311 // above. Unlike `CanEnterPortableBaselineInterpreter`, the result of
 312 // this check cannot be cached: it must be checked on each potential
 313 // entry.
 314 bool PortablebaselineInterpreterStackCheck(JSContext* cx, RunState& state,
 315                                            size_t numActualArgs);
 316
 317 struct State;
 318 struct Stack;
 319 struct StackVal;
 320 struct StackValNative;
 321 struct ICRegs;
 322 class VMFrameManager;
 323
 324 enum class PBIResult {
 325   Ok,
 326   Error,
 327   Unwind,
 328   UnwindError,
 329   UnwindRet,
 330 };
 331
 332 PBIResult PortableBaselineInterpret(JSContext* cx_, State& state, Stack& stack,
 333                                     StackVal* sp, JSObject* envChain,
 334                                     Value* ret);
 335
 336 enum class ICInterpretOpResult {
 337   NextIC,
 338   Return,
 339   Error,
 340   Unwind,
 341   UnwindError,
 342   UnwindRet,
 343 };
 344
 345 ICInterpretOpResult MOZ_ALWAYS_INLINE
 346 ICInterpretOps(jit::BaselineFrame* frame, VMFrameManager& frameMgr,
 347                State& state, ICRegs& icregs, Stack& stack, StackVal* sp,
 348                jit::ICCacheIRStub* cstub, jsbytecode* pc);
 349
 350 } /* namespace pbl */
 351 } /* namespace js */
 352
 353 #endif /* vm_PortableBaselineInterpret_h */