external/clem/doc/clem-performance.tex

   1 \documentclass[10pt]{article}
   2 \setcounter{secnumdepth}{5}
   3 \usepackage{amssymb}
   4 \usepackage{amsmath}
   5 \usepackage{verbatim}
   6 \usepackage{graphicx}
   7 \usepackage{subfigure}
   8 \usepackage{caption}
   9 \usepackage{hyperref}
  10 \usepackage{fancyheadings}
  11 \usepackage{longtable}
  12 \usepackage[letterpaper]{geometry}
  13 \newcommand{\argmax}{\operatornamewithlimits{argmax}}\newcommand{\argmin}{\operatornamewithlimits{argmin}}
  14 \title{CLEM Matrix Performance}
  15 \author{Cyrus L. Harmon}
  16 \geometry{verbose,tmargin=1in,bmargin=1in,lmargin=1in,rmargin=1in}
  17 \setcounter{topnumber}{2}
  18 \setcounter{bottomnumber}{2}
  19 \setcounter{totalnumber}{4}     % 2 may work better
  20 \setcounter{dbltopnumber}{2}    % for 2-column pages
  21 \renewcommand{\dbltopfraction}{0.9}     % fit big float above 2-col. text
  22 \renewcommand{\textfraction}{0.07}      % allow minimal text w. figs
  23 %   Parameters for FLOAT pages (not text pages):
  24 \renewcommand{\floatpagefraction}{0.7}  % require fuller float pages
  25 % N.B.: floatpagefraction MUST be less than topfraction !!
  26 \renewcommand{\dblfloatpagefraction}{0.7}       % require fuller float pages
  27 \setlength{\captionmargin}{10pt}
  28 \begin{document}
  29 \maketitle
  30 \let\mypdfximage\pdfximage
  31 \def\pdfximage{\immediate\mypdfximage}
  32 \baselineskip14pt
  33
  34
  35 \clearpage
  36 \section{CLEM Performance}
  37 \baselineskip12pt
  38 \subsection{Introduction}
  39 \baselineskip14pt
  40 Common Lisp is a high-level language that is a modern-day
  41      member of the LISP family of languages. Common Lisp can be
  42      either interpreted or compiled, or both, depending on the
  43      implementation and modern compilers offer the promise of
  44      performance on the order of that achieved by C and fortran
  45      compilers. CLEM is a matrix math package for Common Lisp that
  46      strives to offer high-performance matrix math routines, written
  47      in Common Lisp. Common Lisp has a sophisticated type system and
  48      it is a goal of CLEM to offer matrix representation and
  49      opreations that exploit the features of this type
  50      system. Furthermore, Common Lisp has a sophisticated object
  51      model, the Common Lisp Object System (CLOS), and CLEM uses
  52      features of CLOS and its companion metaobject system, the
  53      Meta-object Protocol (MOP) to define its classes, objects and
  54      methods.
  55
  56 Common Lisp implementations vary greatly in their
  57      peformance, but the Common Lisp standard provides an
  58      infrastructure for high-performance-capable implementations
  59      to generate efficient code through the use of compiler
  60      declarations of types and optimization settings. Lisp
  61      implementations are free to compile lisp code to native
  62      machine code, rather than either interpreting the lisp code
  63      on the fly, or compiling the code to a byte-code
  64      representation, that is then executed by a virtual
  65      machine. This suggests that, subject to the limits placed on
  66      the output by the common lisp language and to the
  67      implementation details of the particular lisp system, a lisp
  68      environment should be able to produce code that approaches
  69      the speed and efficiency of other compiled languages. SBCL
  70      is a Common Lisp implementation that compiles to native code
  71      and has a sophisticated compiler called, somewhat
  72      confusingly, Python, although the Python compiler predates
  73      the Python interpreted language by many years. SBCL's compiler
  74      uses type and optimization declarations to generate efficient
  75      code by producing optimized routines that are specific to the
  76      declared types, often representing data in an "unboxed" format,
  77      ideally one that matches the type representation handled directly
  78      by the CPU. One of CLEM's main goals is to provide efficient
  79      matrix operations that utilize the capabilities of the lisp
  80      compiler to generate efficient code.
  81
  82 \baselineskip12pt
  83 \subsubsection{Boxed and Unboxed Representation of Lisp Data Objects}
  84 \baselineskip14pt
  85 Boxed representations of lisp values are generally stored
  86  as tagged data objects where the tags serve to identify the type
  87  of the particular object. Unboxed representations, on the other
  88  hand, are merely represented by the data values themselves,
  89  without any identifying metadata being stored with the
  90  value. Obviously, the language environment needs to know, or to
  91  be able to determine, the type of a particular data value. For
  92  boxed values, the language environment can rely on the fact that
  93  the type information is stored with the data value directly. For
  94  unboxed values, the language system must keep track of the value
  95  stored in a particular directly. The main advantage of using
  96  unboxed datatypes is that the language environment, or compiled
  97  code it produces, does not have to bother extracting the type
  98  information from the boxed value and extracting the data as
  99  appropriate. However, this has the disadvantage that the type of
 100  data is then fixed to be that represented by the unboxed
 101  type. Often, hybrid representations are used in structures such
 102  as an array, where the array itself will be a dynamically typed,
 103  boxed object while the values of the array will be unboxed
 104  values placed in an a particular region of memory. By using
 105  unboxed values for the elements of the array, the code can
 106  directly access these values without the overhead of "unboxing"
 107  the data. Boxed memory access often allocates memory,
 108  or "conses" in lisp parlance, as a storage area is needed to
 109  hold the unboxed value. In summary, sing boxed datatypes introduces
 110  (at least) two important areas where work has to be done by the
 111  language environment to access the data, the allocation of temporary
 112  to store the resulting values that will be unboxed from the data, and
 113  additional work on the part of the code to unbox the data in the
 114  first place.
 115
 116 For these reasons, it is higly desirable for CLEM to use
 117  unboxed data where possibly. In the inner loop of a matrix
 118  operation, for instance, accessing a boxed data type can
 119  introduce substantial performance penalties. Therefore, CLEM has gone
 120  to great lengths to provide matrix representations that yield unboxed
 121  values and operations that can operate on these values with
 122  allocation of a minimum amount of memory.
 123
 124 \baselineskip12pt
 125 \subsubsection{Avoiding Unneccessary Memory Allocation}
 126 \baselineskip14pt
 127 \clearpage
 128 \section{CLEM Design Goals}
 129 \baselineskip12pt
 130 \subsection{CLEM Implementation Choices}
 131 \baselineskip14pt
 132 \clearpage
 133 \section{Matrix Data Representation}
 134 What options do we have for storing matrix data? Main choices
 135      are lisp arrays or an external block of memory. Are there other
 136      options here?
 137
 138 \baselineskip12pt
 139 \subsection{Reification of Matrix Data Objects}
 140 \baselineskip14pt
 141 Defering, for a moment, the question of what form the
 142      actual matrix data will take, let us consider the form of
 143      the matrix object itself. It could be the object that
 144      represents the data directly, or it could be an object, such
 145      as an instance of a class or struct, that contains a
 146      reference to the object that holds the data. In a sense, the
 147      simplest approach to providing matrix arithmetic operations
 148      is just to use common lisp arrays both to hold the data and
 149      to be the direct representation of the matrix object. The
 150      CLEM design assumes that there is additional data, besides
 151      the data values stored in the array, or what have you, that
 152      will be needed and that just using a lisp array as the
 153      matrix itself is insufficient.
 154
 155 \baselineskip12pt
 156 \subsection{Common Lisp Arrays}
 157 \baselineskip14pt
 158 Lisp arrays have the advantage that they are likely to
 159      take advantage of the lisp type system. Yes, an
 160      implementation may choose to ignroe this information, or
 161      continue to produce the same code as it would for untyped
 162      arrays, but a sufficently smart compiler, such as Python,
 163      should use the array type information to produce efficient
 164      code. Of course this also means that in order to get this
 165      efficiency we have to provide this type information to the
 166      compiler. As we try to modularize various pieces, it is
 167      often the case that one would like to have generic code that
 168      can work on matrices of any type. It these cases, additional
 169      measures may be needed to coax the compiler into generating
 170      efficient code, while doing so in a generic manner.
 171
 172 \baselineskip12pt
 173 \subsubsection{One-dimensional or Multi-dimensional Arrays?}
 174 \baselineskip14pt
 175 One issue in dealing with lisp arrays is whether to use
 176      the lisp facilitty for multi-dimensional array. One argument
 177      in favor of native multi-dimensional arrays is that the
 178      compiler can generate efficent code to access data in
 179      certain multi-dimensional arrays, provided that this
 180      information is known and passed to the compiler at
 181      compile-time. On the other hand, using one-dimensional
 182      arrays puts both the burden of and the flexibility of
 183      computing array indices on the matrix package.
 184
 185 \baselineskip12pt
 186 \subsubsection{Extensible arrays?}
 187 \baselineskip14pt
 188 \baselineskip12pt
 189 \subsubsection{What About Lists?}
 190 \baselineskip14pt
 191 Lists are convenient for representing matrices in that
 192  the iteration functions can be used to traverse the elements of
 193  the matrix, yielding the famous trivial transpose operation
 194  using mapcar and list. However, lists aren't designed for
 195  efficient random access and are a poor choice for representing
 196  anything but trivially small matrices.
 197
 198 It is worth mentioning the possibility of other approaches,
 199  such as an external block of memory, perhaps allocated with either
 200  non-standard routines of the lisp system, or via a foreign-function
 201  interface, and to determine the offsets into this block of memory,
 202  coerce the contents to a given lisp type and obtain the results. This
 203  approach is used by some libraries that use clem, such as ch-image,
 204  to access matrix/array data stored in memory in a non-native-lisp
 205  form, such as matlisp matrices (which are really BLAS/LAPACK
 206  matrices), fftw matrices, and arrays of data from TIFF images. While
 207  this is nice to be able to do, it is unlikely to be practical for
 208  storing matrix data, given the alternative of using lisp
 209  arrays. Coercing of the contents of the matrix to lisp types is
 210  done for us by optimized code in the compiler. It is unlikely that we
 211  would be able to do a better job of this than the compiler. This
 212  approach is useful for conversion of matrix data to other in-memory
 213  formats, but unlikely to be useful for the typed lisp matrices for
 214  which CLEM is designed. If we were to go this route, it would make
 215  sense to use other, optimized, code libraries for operating on this
 216  data, and this is what matlisp does, handing these blocks of memory
 217  off to BLAS/LAPACK for processing in highly-optimized routines
 218  written in C, Fortran or Assembly Language.
 219
 220 \clearpage
 221 \section{Matrix Data Access}
 222 \baselineskip12pt
 223 \subsection{Now that we have chosen a matrix representation, how do we
 224       access the data in it?}
 225 \baselineskip14pt
 226 \baselineskip12pt
 227 \subsection{Slow and Fast Data Access Paths}
 228 \baselineskip14pt
 229 \baselineskip12pt
 230 \subsection{Flexibility}
 231 \baselineskip14pt
 232 \baselineskip12pt
 233 \subsubsection{Resizing Matrices}
 234 \baselineskip14pt
 235 \baselineskip12pt
 236 \subsubsection{Matrix Index "Recycling"}
 237 \baselineskip14pt
 238 \baselineskip12pt
 239 \subsection{Macros}
 240 \baselineskip14pt
 241 \baselineskip12pt
 242 \subsection{Compiler Macros}
 243 \baselineskip14pt
 244 \baselineskip12pt
 245 \subsection{SBCL-specific Compiler Features}
 246 \baselineskip14pt
 247 \baselineskip12pt
 248 \subsubsection{defknown/deftransform/defoptimizer}
 249 \baselineskip14pt
 250 \clearpage
 251 \section{Benchmarks}
 252 We'll start with some simple benchmarks of matrix operations
 253  and examine the effect of the various implementation strategies for
 254  representing matrices on these operations.
 255
 256 \baselineskip12pt
 257 \subsection{2-Dimensional Lisp Arrays}
 258 \baselineskip14pt
 259
 260 \baselineskip12pt
 261 \begin{verbatim}(defparameter b1
 262   (make-array '(1024 1024) :element-type 'double-float :initial-element 1.0d0
 263               :adjustable nil :fill-pointer nil))
 264 B1
 265
 266 (defparameter b2
 267   (make-array '(1024 1024) :element-type 'double-float :initial-element 1.0d0
 268               :adjustable nil :fill-pointer nil))
 269 B2
 270
 271 (defparameter b3
 272   (make-array '(1024 1024) :element-type 'double-float :initial-element 1.0d0
 273               :adjustable nil :fill-pointer nil))
 274 B3
 275
 276 \end{verbatim}
 277 \baselineskip14pt
 278 Now, our function to add the two arrays:
 279
 280
 281 \baselineskip12pt
 282 \begin{verbatim}(defun bench/add-matrix/aref (a1 a2 a3)
 283   (destructuring-bind
 284       (rows cols)
 285       (array-dimensions a1)
 286     (dotimes (i rows)
 287       (dotimes (j cols) (setf (aref a3 i j) (+ (aref a1 i j) (aref a2 i j)))))))
 288 BENCH/ADD-MATRIX/AREF
 289
 290 \end{verbatim}
 291 \baselineskip14pt
 292 Now, we time how long it takes to run bench/addmatrix/aref:
 293
 294
 295 \baselineskip12pt
 296 \begin{verbatim}(ch-util:time-to-string (bench/add-matrix/aref b1 b2 b3))
 297 "Evaluation took:
 298   0.215 seconds of real time
 299   0.1958 seconds of user run time
 300   0.018307 seconds of system run time
 301   [Run times include 0.083 seconds GC run time.]
 302   0 calls to %EVAL
 303   0 page faults and
 304   50,331,032 bytes consed.
 305 "
 306
 307 \end{verbatim}
 308 \baselineskip14pt
 309 \baselineskip12pt
 310 \subsection{1-Dimensional Lisp Arrays}
 311 \baselineskip14pt
 312 \baselineskip12pt
 313 \subsection{A CLOS object holding a reference to a 2-Dimensional Lisp
 314  Array}
 315 \baselineskip14pt
 316 \baselineskip12pt
 317 \subsection{A CLOS object holding a reference to a 1-Dimensional Lisp
 318  Array}
 319 \baselineskip14pt
 320 \end{document}