origdox/perf-notes.txt

   1 I tried caching the hashcode in the naStr object, to avoid the
   2 recomputation each time a symbol lookup is made.  The objperf.nas
   3 benchmark reported a speed up of 13.59s vs. 13.78s without
   4 optimization, and 6.71s vs. 7.06s with full optimization.  Or about 5%
   5 for 5 lines of code.  Probably not worth it.
   6
   7 I thought the depth-first behavior of the closure objects would be an
   8 issue, but it really isn't.  Getting and setting local variables does
   9 not require traversing the closure list.  Access to variables in outer
  10 scopes is generally less common, for obvious reasons.  Traversing the
  11 whole list when setting a local variable for the first time occurs
  12 only once.  It is hard to find a real-world use case for code that
  13 wants to iterate repeatedly over variables in an outer scope, and/or
  14 sets many local variables without doing significant later computation
  15 with them.
  16
  17 Combining naFunc and naClosure made sense, and was a good optimization
  18 both for code size and runtime speed.
  19
  20 Examining objperf some more: The inner loop contains 14 simple
  21 statements, each of which (in C code) might be expected to take two
  22 instructions (maybe 1.5 cycles on a modern processor).  Measured on my
  23 1.8GHz Athlon64, each of these statements takes just about 200 clock
  24 cycles.  So on the whole the Nasal code is running maybe 130x slower
  25 than the equivalent C code.  Or (very) roughly at the speed the
  26 equivalent C code would exhibit when executed on a 33MHz i486.
  27
  28 Consider tail recursion.  The separate call and operand stacks should
  29 make this easy...
  30
  31 Allocating behaviors.  The following syntax constructions require
  32 allocation of garbage collected objects:
  33  + Creation of new strings with the "~" operator or substr() function
  34  + Creating new hashes with the {} construction
  35  + Creating new vectors with the [] construction
  36  + Binding a function.  Any use of the func keyword allocates an
  37    naFunc object for the binding.
  38  + All function calls allocate a hash for the namespace, and an arg
  39    vector.  Strictly, the call() function does not allocate the arg
  40    vector itself...
  41
  42 Maximum speed of a mark/sweep collector: any collection of a large
  43 data set is obviously going to thrash the CPU cache and be limited by
  44 main memory bandwidth (corrolary: parallel collection threads on
  45 different CPUs are not likely to help unless the CPUs have their own
  46 dedicated memory).  Modern DDR400 memory uses 8 cycles to read 4 64
  47 bit values into a cache line, and typically comes in dual channel
  48 configurations (which I *think* are used to make a single 512 bit
  49 cache line, and not two parallel channels for loading independent
  50 lines...).  That comes out to 64 byte blocks read at a rate of 50MHz.
  51 If the in-memory object layout exhibits good locality of reference,
  52 then that whole cache line will be "used up" before it is spilled out
  53 to make room for another read and we will be limited by the maximum
  54 read bandwidth of 64*50 = 3.2 gigabytes per second.  If the objects
  55 are scattered pessimally, then every 8 byte naRef will require a full
  56 64 byte read, and the rate drops by a factor of 8 to 400 megabytes per
  57 second.  Finally, note that the full GC requires two passes over the
  58 live data; the second pass is done in-order and will definitely be
  59 able to use the full memory bandwidth.  So a 1 gig set of data would
  60 take somewhere between 0.6 and 2.8 seconds to collect on a modern
  61 system.  Multi-cpu systems with parallel memory architectures
  62 (Opteron) will scale linearly with the number of CPUs.
  63
  64 Fun surprise: changing from the "naive" C stack mark algorithm to the
  65 "optimized" non-recursive one slowed things down in gcperf.nas by
  66 about 15%.  It was supposed to use 5x less space for the
  67 hand-maintained stack, but instead used thousands of times MORE space.
  68 Why?  The gcperf.nas script creates very "broad" hashes of hashes.  So
  69 before recursing into the second level, the GC needs to push thousands
  70 of objects onto the stack; it's maximum size is the maximum reference
  71 depth times the size of the average object in the reference chain!
  72 The naive algorithm was "true" depth first, so it used only one entry
  73 on the stack per object in the chain.  The solution is to keep "state"
  74 with the mark stack to prevent needlessly pushing many items onto the
  75 stack at once.
  76
  77 Consider adding an explicit freeObject() interface, so that in
  78 situations where we know that a reference to an object will not be
  79 saved it can be added directly to the free lists.  Possibilities would
  80 include: the local namespace in CODE objects where there are CODE
  81 constants which would bind to them; the results of ~ operators which
  82 exist as the left or right side of another binary operator (knowable
  83 from the parse tree).  Note: you can't collect local arguments without
  84 doing data flow analysis to make sure they aren't saved off by passing
  85 the arg vector to a sub function, etc...
  86
  87 Since strings will always be at least MINLEN (16) bytes long, put that
  88 buffer in the string object.  This saves memory for short strings, and
  89 wastes only a little for long ones.  And it avoids the allocation
  90 overhead involved in creating a string with ~.
  91
  92 Optimising local variable and object field lookup times: Interning
  93 symbols and using a pointer equality test first in hash.c.
  94 Pre-optimization objperf.nas: 8.622s; post 8.325 (3.6%).  Using an
  95 immediate index argument to OP_LOCAL instead of a prior OP_PUSHCONST:
  96 7.12s (14%).  Doing the same for OP_MEMBER gets to 6.38s (+10%).  So
  97 the total looks to be a 25% speedup.  Not huge, but not bad.  And it
  98 adds comparatively little code. [Measuring later, more rigorously,
  99 showed only a 15% speedup.]
 100
 101 Eliminating OP_LINE instructions seemed to get 0.85% speedup in the
 102 tests above.
 103
 104 Commenting out the stack underflow checks (which are really debugging
 105 code) gets to 6.01s, or another 6%.
 106
 107 I tried adding an array of 4 naRef objects to the naVec struct as a
 108 "base" array of minimal size (the idea being to speed up function
 109 calls by preventing the allocation of the argument lists from hitting
 110 the heap every time), but that reduced (!) performance by 14%,
 111 presumably due to cache effects.  Hitting the heap isn't so bad when
 112 all that data was used just last call.  Leaving all the buffers in
 113 place means leaving them all on the heap, too.
 114
 115 For compile-time vector and hash generation, instead of pushing each
 116 element on the stack and doing OP_xAPPEND for each one, use an
 117 immediate argument for the operator and slurp them all up at once.
 118 This will be especially good for multi-argument function calls.
 119
 120 In addition to reducing the number of runtime operators, try
 121 simplifying code.c:run1().  Every cycle in this function adds up.
 122 ...
 123 INDEED!  Simply moving the contents of run1() into the body of run()
 124 eliminates a ton of needless pointer indirections and gets us down to
 125 5.13s, or another 15%.  We're now just 33% slower than perl on
 126 objperf.
 127
 128 Revisiting the "store hash codes in strings" issue, after all the
 129 other optimizations, worked GREAT!  We are now down to 4.5s, for
 130 another 15% speed increase.  This was apparently always a good idea,
 131 it was just drowned in the overhead uncovered by other optimizations.
 132
 133 Writing a (comparatively small) special case naHash_sym() lookup
 134 function got another huge increase.  This mechanism is tried first,
 135 and is allowed to assume that the key is always a string, always has
 136 its hashcode precomputed, and that object identity is the only
 137 equality that is needed.  This gets us from 4.5s to 3.6s on objperf,
 138 so we're now beating perl (!) with full optimizations.  Using only
 139 -O2, we're tying it.
 140
 141 The Frame object's args field is unused (another relic from past
 142 optimizations).  Yank it.  Ditto for the obj field.  The line field is
 143 already going away, so that shrinks a frame down to just
 144 func/locals/ip/bp.
 145
 146 Now it's on to fib.nas -- the test of function call overhead.  Perl is
 147 beating nasal by a about 2.2:1 here (15s vs. 6.8s).  My original
 148 thought was that the loss was caused by GC overhead in the allocation
 149 of all the arguments vectors and local variable hashes.  Nope.
 150 Changing GC parameters to make it run less often had almost no effect.
 151 Running gprof confirms it -- naGarbageCollect makes up only a tiny
 152 fraction of the run time of the program.  This is actually *good*
 153 news, because the bytecode interpreter is easier to optimize than the
 154 garbage collector (which is already about as good as it's going to
 155 get, IMHO).
 156
 157 The top culprits were numify, PUSH, naVec_append, (hash.c) find, and
 158 evalBinaryNumeric, followed by setupFuncall.  The PUSH and
 159 naVec_append calls all have to do with the arguments list, so perhaps
 160 I should try changing OP_CALL and OP_MCALL to take an immediate
 161 representing the number of arguments, and copy these directly off of
 162 the stack.
 163
 164 Also, the naVec_append() calls in setupFuncall are not necessary.  Try
 165 saving those off in the context object...  Not sure how important this
 166 is.  Update: Not very.  Changing the "save" operation to a pair of
 167 naObj* in the context object (hard to imaging a faster way to save
 168 something than storing two pointers) got only 1% (or 166ms) in
 169 fib.nas.
 170
 171 The gprof run also produced a weird result: the naHash_get calls
 172 outnumbered the naHash_sym calls.  Why???  Figured it out: the "fib"
 173 symbol is not local, it's in the outer namespace.  Which means that
 174 the _sym call fails, then the first (!) _get call needs to fail before
 175 the second _get call on the outer scope succeeds.  This is actually a
 176 performance regression -- it would be faster if we were not using
 177 naHash_sym().
 178
 179 Changing the calling conventions to push the object, function, and
 180 then all the arguments on the stack and then build the frame from
 181 there (with no interrim VECAPPENDs) was good for 10% in fib.nas, and
 182 even more when calling functions with more than 1 argument.  Still at
 183 11.397s vs. perl's 6.773s though.
 184
 185 Go through the bytecode inner loop and strip out all the needless POP
 186 and PUSH calls, using stuff on the stack where possible.
 187
 188 Changing the objperf benchmark to use 7x more reads than writes showed
 189 perl with a pretty huge lead, but that turns out to be due to loop
 190 overhead.  Unrolling the read loop 7 times brings the two interpreters
 191 back into a tie.  Maybe should consider doing ++/-- operators for
 192 performance reasons...