6 \setbeamercovered{transparent
}
9 \usepackage[english
]{babel
}
10 \usepackage[latin1]{inputenc}
12 \usepackage[T1]{fontenc}
15 \title[CLS
]{Back to the Future: Life beyond R, by going back to Lisp
}
16 \subtitle{Using History to Design better data analysis systems
}
17 \author[Rossini
]{Anthony~(Tony)~Rossini
}
19 \institute[Novartis and University of Washington
]{
20 Group Head, Modeling and Simulation Statistics\\
21 Novartis Pharma AG, Switzerland
23 Affiliate Assoc Prof, Biomedical and Health Informatics\\
24 University of Washington, USA
}
26 \date[ISS, Sinica
]{Institute of Statistical Science, Sept
2009, Taipei, Taiwan
}
27 \subject{Statistical Computing Environments
}
37 \begin{frame
}{Historical Computing Languages
}
39 \item FORTRAN : FORmula TRANslator. Original numerical computing
40 language, designed for clear implementation of numerical
42 \item LISP : LISt Processor. Associated with symbolic
43 manipulation, AI, and knowledge approaches
46 They represent the
2 generalized needs of statistical computing,
47 which could be summarized as
49 \item algorithms/numerics (``computation''),
50 \item extraction, communication, and generation of knowledge (``data
51 analysis'' principles and application)
54 Most statistical computing work is closer to (
1), and only
55 indirectly supporting (
2). How to support (
2)?
58 \begin{frame
}[fragile
]{Introducing Lisp notation
}
60 ;; This is a Lisp comment
64 '(a list of things to become data)
65 (list a list of things to become data)
66 (what-I-execute with-one-thing with-two-thing)
69 input2) ; and to create input1:
70 (my-fcn-name (my-fcn-name input3 input4)
77 \begin{frame
}{Current State
}
79 The current approaches to how statisticians interface with computers
80 to perform data analysis can be put into
2 camps
82 \item GUI with a spreadsheet (Excel/Minitab/Rcmdr without the
84 \item applications programming (with an approach that Fortran
85 programmers would not find strange).
87 There are different levels of sophistication, and some merging.
90 \begin{frame
}{Human-Computer Interaction
}
92 Principle: Computers should eliminate tasks which are tedious, simple,
93 computationally-intensive, non-intellectual, and repetitive.
95 What might those tasks be, from a statistician's perspective?
98 \item searching for and discovering associated work (``Googling'')
99 \item drafting documents (papers, WWW pages, computer programs) in
101 \item constructing the references for papers based on associated work
102 \item computation-intensive tasks
103 \item data-intensive tasks
109 \begin{frame
}{What do you do?
}
110 The primary point of this talk:
114 \emph{When you begin to work with a computer on an statistical
115 activity (methodological, theoretical, application/substantive),
116 and go to the keyboard to work on a computer,
}
120 \centerline{\textbf{How should the computer react to you?
}}
124 \section{Computable Statistics
}
126 \begin{frame
}{Can we compute with them?
}
129 \item Statistical Research (``Annals work'')
130 \item Consulting, Applied Statistics, Scientific Honesty.
131 \item Re-implementation.
133 Can ``compute'' with the information given? (that is:
135 \item do we have sufficient information to communicate enough for
136 the right person to understand or recreate the effort?
137 \item have we sufficient clarity to prevent misunderstandings about
138 intentions and claims?
143 \begin{frame
}[fragile
]{Example
1: Theory
\ldots}
145 Let $f(x;
\theta)$ describe the likelihood of XX under the following
151 Then if we use the following algorithm:
156 then $
\hat{\theta}$ should be $N(
0,
\hat\sigma^
2)$ with the following
157 characteristics
\ldots
161 \frametitle{Can we compute, using this description?
}
162 Given the information at hand:
164 \item we ought to have a framework for initial coding for the
165 actual simulations (test-first!)
166 \item the implementation is somewhat clear
167 \item We should ask: what theorems have similar assumptions?
168 \item We should ask: what theorems have similar conclusions but
169 different assumptions?
172 \begin{frame
}[fragile
]{Realizing Theory
}
175 (define-theorem my-proposed-theorem
176 (:theorem-type '(distribution-properties
177 frequentist likelihood))
178 (:assumes '(assumption-
1 assumption-
2))
180 (defun likelihood (data theta gamma)
181 (exponential-family theta gamma)))
184 (compute-start-values thetahat gammahat)
187 (or (step-
1 thetahat)
188 (step-
2 gammahat))))))
189 (:claim (equal-distr '(thetahat gammahat) 'normal))))
194 \begin{frame
}[fragile
]{It would be nice to have
}
196 (theorem-veracity 'my-proposed-theorem)
198 returning some indication of how well it met given computable claims,
199 modulo what proportion of computable claims could be tested.
201 \item and have it run some illustrative simulations which suggest
202 which might be problematic in real situations, and real situations
203 for which there are no problems.
204 \item and work through some of the logic based on related claims using
205 identical assumptions to confirm some of the results
209 \begin{frame
}[fragile
]{and why not...?
}
211 (when (> (theorem-veracity
212 'my-proposed-theorem)
214 (make-draft-paper 'my-proposed-theorem
221 \begin{frame
}{Comments
}
223 \item Of course the general problem is very difficult, but one must
225 \item Key requirement: a statistical ontology (assumptions,
227 \item Current activity: basic statistical proof of concepts (not
228 finished): T-Test, linear regression (LS-based, Normal-Normal
230 \item Goal: results, with reminder of assumptions and how well the
231 situation meets assumptions
233 \emph{(metadata for both data and procedure: how well do they match in
234 describing requirements and how well requirements are met?)
}
235 \item Areas targeted for medium-term future: resampling methods and
241 \frametitle{Example
2: Practice
\ldots}
243 The dataset comes from a series of clinical trials, some with active
244 control and others using placebo control. We model the primary
245 endpoint, ``relief'', as a binary random variable. There is a
246 random trial effect on relief as well as severity due to differences
247 in recruitment and inclusion/exclusion criteria from
2 different
252 \frametitle{Can we compute, using this description?
}
254 \item With a real such description, it is clear what some of the
255 potential models might be for this dataset
256 \item It should be clear how to start thinking of a data dictionary
261 \begin{frame
}[fragile
]{Can we compute?
}
263 (dataset-metadata paper-
1
264 :context 'clinical-trial 'randomized
265 'active-ctrl 'placebo-ctrl 'metaanalysis
266 :variables '((relief :model-type dependent
268 (trial :model-type independent
271 :metadata '(incl-crit-net1 excl-crit-net1
272 incl-crit-net1 excl-crit-net2
273 recr-rate-net1 recr-rate-net2))
274 (propose-analysis paper-
1)
275 ; => (list 'tables '(logistic-regression))
279 \begin{frame
}{Example
3: The Round-trip
\ldots}
281 The first examples describe ``ideas $
\rightarrow$ code''
283 Consider the last time you read someone else's implementation of a
284 statistical procedure (i.e. R package code). When you read the
287 \item the assumptions used?
288 \item the algorithm implemented?
289 \item practical guidance for when you might select the algorithm
291 \item practical guidance for when you might select the
292 implementation over others?
294 These are usually components of any reasonable journal article.
295 \textit{(Q: have you actually read an R package that wasn't yours?)
}
298 \begin{frame
}{Exercise left to the reader!
}
300 % (aside: I have been looking at the \textbf{stats} and \textbf{lme4}
301 % packages recently -- \textit{for me}, very clear numerically, much
302 % less so statistically)
305 \begin{frame
}{Point of Examples
}
307 \item Few statistical concepts are ``computable'' with existing systems.
309 \item Some of this work is computable, let the computer do it.
311 \item There is little point in having people re-do basics
313 \item Computing environments for statistical work have been stable
314 for far too long, and limit the development and implementation of
315 better, more efficient, and more appropriate methods by allowing
316 people to be lazy (i.e. classic example of people publishing
317 papers on changes which are very minimal from a
318 methodological/theoretical view, but very difficult from an
319 implementation/practical view).
323 \begin{frame
}{Issues which arise when computing...
}
325 \item relevant substantive issues (causality, variable independence,
326 design issues such as sampling strategies) not incorporated.
327 \item irrelevant substantive issues (coding, wide vs. long
328 collection, other non-statistical data management issues) become
329 statistically-relevant.
330 \item little support for encoding theoretical considerations (``expert
331 systems'' for guidance). Must be hard-coded in and hard-coded
332 away (``stars for p-values as evil''). Nearly impossible to
333 construct and apply philosophical opinions to ensure appropriate
334 use (and training through use) of singular or personalized
335 mixtures of statistical philosophies when doing data analysis (or
336 methodological development, or theoretical development).
340 \begin{frame
}{Problem Statement
}
342 How can statistical computing environments support appropriate use
343 of statistical concepts (algorithmic, knowledge-centric,
344 knowledge-managing, philosophical discipline), so that the computing
345 structure doesn't rely on data-munging or numerical skill?
350 \begin{frame
}{Goals for this Talk
}{(define, strategic approach,
354 \item To describe the concept of
\alert{computable statistics
},
355 placing it in a historical context.
357 \item To demonstrate that
\alert{a research program
}
358 implemented through simple steps can increase the efficiency of
359 statistical computing approaches by clearly describing both:
361 \item numerical characteristics of procedures,
362 \item statistical concepts driving them.
365 \item To justify that the
\alert{approach is worthwhile
} and
366 represents a staged effort towards
\alert{increased use of best
367 practices
} and efficient tech transfer of modern statistical
368 theory (i.e. why must we wait
10 years for Robins' estimation
371 (unfortunately, the last is still incomplete)
374 \begin{frame
}[fragile
]{Why not use R?
}
376 \item the R programming language is incomplete and constantly being
377 redefined. Common Lisp is an old formal standard, proven through
379 \item R isn't compiled; standalone application delivery is difficult
380 \item Without parens, Common Lisp could be R (interactive, or batch,
381 or through ``compiled applications'').
382 \item R is the Microsoft of statistical computing.
383 \item R has problems which can't be fixed due to sizeable user
385 (
\verb+library(nlme)+ vs
\verb+nlme<-''lme4'' , library(nlme)+)
386 \item Evolutionary development requires strawmen to challenge
391 \section{CLS: Current Status
}
395 \begin{frame
}{Is it Vaporware? Not quite
}
396 The follow is possible with the help of the open source Common Lisp
397 community, who provided most of the packages, tools, and glue.
398 (Tamas Papp, Raymond Toy, Mark Hoemmomem, and many, many others).
399 Most of the underlying code was written by others, and ``composed''
403 \subsection{Graphics
}
404 \label{sec:work:graphics
}
406 \begin{frame
}{Silly Visualization Example
}
407 \includegraphics[width=
2.8in,height=
2.8in
]{./test1.png
}
410 \begin{frame
}[fragile
]{Defining Plot Structure
}
412 (defparameter *frame2*
413 (as-frame (create-xlib-image-context
200 200)
414 :background-
color +white+))
415 (bind ((
#2A((f1 f2) (f3 f4))
416 (split-frame *frame2*
419 (defparameter *f1* f1) ; lower left
420 (defparameter *f2* f2) ; lower right f3 f4
421 (defparameter *f3* f3) ; top left f1 f2
422 (defparameter *f4* f4)); top right
426 \begin{frame
}[fragile
]{Plotting Functions
}
428 (plot-function *f1* #'sin
430 :x-title "x" :y-title "sin(x)")
431 (plot-function *f2* #'cos (interval-of
0 2)
432 :x-title "x" :y-title "cos(x)")
433 (plot-function *f3* #'tan (interval-of
0 2)
434 :x-title "x" :y-title "tan(x)")
438 \begin{frame
}[fragile
]{Plotting Data
}
443 :from
0 :to
10 :length n))
445 #'(lambda (x) (+ x
8 (random
4.0)))
448 (replicate #'(lambda () (
1+ (random
10)))
450 (da (plot-simple *f4*
453 :x-title "x" :y-title "y")))
454 (draw-symbols da xs ys :weights weights))
459 \begin{frame
}[fragile
]{Copying existing graphics
}
460 And we generated the figure on the first page by:
462 (xlib-image-context-to-png
464 "/home/tony/test1.png")
468 \subsection{Statistical Models
}
469 \label{sec:work:statmod
}
471 \begin{frame
}[fragile
]{Linear Regression
}
474 ;; Worse than LispStat, wrapping LAPACK's dgelsy:
475 (defparameter *result1*
476 (lm (list->vector-like iron)
477 (list->vector-like absorbtion)))
479 ((#<LA-SIMPLE-VECTOR-DOUBLE (
2 x
1)
484 #<LA-SIMPLE-MATRIX-DOUBLE
2 x
2
485 9.730392177126686e-6 -
0.001513787114206932
486 -
0.001513787114206932 0.30357851215706255>
493 \subsection{Data Manip/Mgmt
}
494 \label{sec:work:data
}
496 \begin{frame
}[fragile
]{DataFrames
}
499 (defparameter *my-df-
1*
500 (make-instance 'dataframe-array
501 :storage
#2A((
1 2 3 4 5) (
10 20 30 40 50))
502 :doc "This is a boring dataframe-array"
503 :case-labels (list "x" "y")
504 :var-labels (list "a" "b" "c" "d" "e")))
506 (xref *my-df-
1*
0 0) ; API change in progress
508 (setf (xref *my-df-
1*
0 0) -
1d0)
513 \begin{frame
}[fragile
]{Numerical Matrices
}
516 (defparameter *mat-
1*
518 :initial-contents
#2A((
2d0
3d0 -
4d0)
522 (xref *mat-
1*
2 0) ; =>
4d0 ; API change
523 (setf (xref *mat-
1*
2 0) -
4d0)
526 (make-vector
4 :type :row
527 :initial-contents '((
1d0
3d0
2d0
4d0))))
532 \begin{frame
}[fragile
]{Macros make the above tolerable
}
535 (make-vector
4 :type :row
536 :initial-contents '((
1d0
3d0
2d0
4d0))))
538 ; can use defmacro for the following syntax =>
540 (make-row-vector *xv* '((
1d0
3d0
2d0
4d0)))
542 ; or reader macros for the following:
543 #mrv
(*xv* '((1d0 3d0 2d0 4d0)))
549 \begin{frame}{Why CLS?}
551 \item a component-based structure for statistical computing,
552 allowing for small and specific specification.
553 \item a means to drive philosophically customized data analysis, and
554 enforce a structure to allow simple comparisons between
556 \item This is a ``customization'' through packages to support
557 statistical computing, not a independent language. ``Ala Carte'',
562 \subsection{Implementation Plans}
566 \begin{frame}{Current Functionality}
568 \item basic dataframes (similar to R); subsetting API under
570 \item Basic regression (similar to XLispStat)
571 \item matrix storage both in foreign and lisp-centric areas.
572 \item LAPACK (small percentage, increasing), working with both
574 \item static graphics (X11) including preliminary grid functionality based
575 on CAIRO. Generation of PNG files from graphics windows.
576 \item CSV file support
581 \begin{frame}[fragile]{Computational Environment Supported}
583 \item works on Linux, with recent SBCL versions
584 \item Definitely works on bleeding edge Debian (unstable).
585 \item Has worked for weak definitions of ``work'' on 4 different
586 people's computers (not quite, but sort of requires a
587 \verb+/home/tony/+ !)
594 \item Better integration of data structures with statistical routines
595 (auto-handling with dataframes, rather than manual parsing).
596 \item dataframe to model-matrix tools (leveraging old XLispStat GEE
601 \item Support for other Common Lisps
602 \item Cleaner front-end API to matrices and numerical algorithms
603 \item constraint system for different statistical algorithm
604 development, as well as for interactive GUIs and graphics
605 \item LispStat compatible (object system in-progress, GUI to do)
606 \item Integrated invisible parallelization when more efficient
607 (multi-core, threading, and user-space systems)
611 \begin{frame}[fragile]{What does NOT work?}
612 Primarily, the reason that we doing this:
614 \textbf{Computable and Executable Statistics}
618 <car brand="honda" engine="4cyl">accord</car>
622 ; data follows keywords...
623 (car :brand 'honda :engine "4cyl" accord)
629 \begin{frame}{Why use Common Lisp?}
631 \item Parens provide clear delineation of a \textbf{Complete
632 Thought} (functional programming with side effects).
633 \item Lisp-2 (symbols represent a different function and variable)
634 \item ANSI standard (built by committee, but the committee was
636 \item Many implementations
637 \item Most implementations are interactive \textbf{compiled}
638 languages (few are interpreted, nearly all byte-compiled).
639 \item The Original \emph{Programming with Data} Language
640 (\emph{Programs are Data} and \emph{Data are Executable} apply).
641 \item advanced, powerful, first-class macros (macros functionally
642 re-write code, allowing for structural clarity and complete
643 destruction of syntax, should that be reasonable)
647 \begin{frame}{Available Common Lisp Packages}
648 (They are packages and called packages, not libraries. Some people
651 \item infrastructure \emph{enhancements}: infix-notation, data
652 structures, control and flow structures
653 \item numerics, graphics, GUIs,
654 \item primitive R to CL compiler (which could also be considered an
655 object-code compiler for R); 3 interfaces which embed R within CL.
656 \item Web 2.0 support and TeX-like reporting facilities for PDF
659 See \url{http://www.common-lisp.net/} and
660 \url{http://www.cliki.org/}. CLS sources can be found on
661 \url{http://github.com/blindglobe/}
665 \begin{frame}{Conclusion}
667 This slowly developing research program aims to a statistical
668 computing system which enables sophisticated statistical research
669 which can be readily transfer to applications, is supportable.
671 Related numerical/statistical projects:
673 \item Incanter : R/LispStat/Omegahat-like system for Clojure (Lisp
675 \item FEMLisp : system/workshop for finite-element analysis modeling
677 \item matlisp/LispLab : LAPACK-based numerical linear algebra packages
678 \item GSLL : GNU Scientific Library, Lisp interface.
679 \item RCL, RCLG, CLSR (embedding R within Common Lisp)
682 Bill Gates, 2004: ``the next great innovation will be data
683 integration''. Perhaps this will be followed by statistics.
686 \begin{frame}{What can you do to follow up?}
689 \item Read: Introduction to Common Lisp: Paul Graham's ANSI Common Lisp,
690 enjoyable book with boring title, best intro to S4 classes
691 around. Practical Common Lisp, by Peter Seibel
692 \item Consider: how a computing environment could better support
693 features in the research you do (event-time data, design,
694 longitudinal data modeling, missing and coarsened data, multiple
695 comparisons, feature selection).
697 The next stage of reproducible research will require computable
698 statistics (code that explains itself and can be parsed to generate
699 knowledge about its claims; ``XML's promise'').