small clean up of David's section, more to go.
[CommonLispStat.git] / Doc / talks / Rossini-SinicaISS-Oct2009.tex
blob4c20faec88f071ed2ee9d2999cdda23989caf875
1 \documentclass{beamer}
3 \mode<presentation>
5 \usetheme{Warsaw}
6 \setbeamercovered{transparent}
9 \usepackage[english]{babel}
10 \usepackage[latin1]{inputenc}
11 \usepackage{times}
12 \usepackage[T1]{fontenc}
13 \usepackage{url}
15 \title[CLS]{Back to the Future: Life beyond R, by going back to Lisp}
16 \subtitle{Using History to Design better data analysis systems}
17 \author[Rossini]{Anthony~(Tony)~Rossini}
19 \institute[Novartis and University of Washington]{
20 Group Head, Modeling and Simulation Statistics\\
21 Novartis Pharma AG, Switzerland
22 \and
23 Affiliate Assoc Prof, Biomedical and Health Informatics\\
24 University of Washington, USA}
26 \date[ISS, Sinica]{Institute of Statistical Science, Sept 2009, Taipei, Taiwan}
27 \subject{Statistical Computing Environments}
29 \begin{document}
31 \begin{frame}
32 \titlepage
33 \end{frame}
35 \section{Orientation}
37 \begin{frame}{Historical Computing Languages}
38 \begin{itemize}
39 \item FORTRAN : FORmula TRANslator. Original numerical computing
40 language, designed for clear implementation of numerical
41 algorithms.
42 \item LISP : LISt Processor. Associated with symbolic
43 manipulation, AI, and knowledge approaches
44 \end{itemize}
46 They represent the 2 generalized needs of statistical computing,
47 which could be summarized as
48 \begin{enumerate}
49 \item algorithms/numerics (``computation''),
50 \item extraction, communication, and generation of knowledge (``data
51 analysis'' principles and application)
52 \end{enumerate}
54 Most statistical computing work is closer to (1), and only
55 indirectly supporting (2). How to support (2)?
56 \end{frame}
58 \begin{frame}[fragile]{Introducing Lisp notation}
59 \begin{verbatim}
60 ;; This is a Lisp comment
62 and so is this
64 '(a list of things to become data)
65 (list a list of things to become data)
66 (what-I-execute with-one-thing with-two-thing)
67 ;; that is:
68 (my-fcn-name input1
69 input2) ; and to create input1:
70 (my-fcn-name (my-fcn-name input3 input4)
71 input2)
72 \end{verbatim}
73 \end{frame}
77 \begin{frame}{Current State}
79 The current approaches to how statisticians interface with computers
80 to perform data analysis can be put into 2 camps
81 \begin{enumerate}
82 \item GUI with a spreadsheet (Excel/Minitab/Rcmdr without the
83 script)
84 \item applications programming (with an approach that Fortran
85 programmers would not find strange).
86 \end{enumerate}
87 There are different levels of sophistication, and some merging.
88 \end{frame}
90 \begin{frame}{Human-Computer Interaction}
92 Principle: Computers should eliminate tasks which are tedious, simple,
93 computationally-intensive, non-intellectual, and repetitive.
95 What might those tasks be, from a statistician's perspective?
97 \begin{itemize}
98 \item searching for and discovering associated work (``Googling'')
99 \item drafting documents (papers, WWW pages, computer programs) in
100 proper formats.
101 \item constructing the references for papers based on associated work
102 \item computation-intensive tasks
103 \item data-intensive tasks
104 \item
106 \end{itemize}
108 \end{frame}
109 \begin{frame}{What do you do?}
110 The primary point of this talk:
112 \vspace*{1cm}
114 \emph{When you begin to work with a computer on an statistical
115 activity (methodological, theoretical, application/substantive),
116 and go to the keyboard to work on a computer, }
118 \vspace*{1cm}
120 \centerline{\textbf{How should the computer react to you?}}
122 \end{frame}
124 \section{Computable Statistics}
126 \begin{frame}{Can we compute with them?}
127 3 Toy Examples:
128 \begin{itemize}
129 \item Statistical Research (``Annals work'')
130 \item Consulting, Applied Statistics, Scientific Honesty.
131 \item Re-implementation.
132 \end{itemize}
133 Can ``compute'' with the information given? (that is:
134 \begin{itemize}
135 \item do we have sufficient information to communicate enough for
136 the right person to understand or recreate the effort?
137 \item have we sufficient clarity to prevent misunderstandings about
138 intentions and claims?
139 \end{itemize}
141 \end{frame}
143 \begin{frame}[fragile]{Example 1: Theory\ldots}
144 \label{example1}
145 Let $f(x;\theta)$ describe the likelihood of XX under the following
146 assumptions.
147 \begin{enumerate}
148 \item assumption-1
149 \item assumption-2
150 \end{enumerate}
151 Then if we use the following algorithm:
152 \begin{enumerate}
153 \item step-1
154 \item step-2
155 \end{enumerate}
156 then $\hat{\theta}$ should be $N(0,\hat\sigma^2)$ with the following
157 characteristics\ldots
158 \end{frame}
160 \begin{frame}
161 \frametitle{Can we compute, using this description?}
162 Given the information at hand:
163 \begin{itemize}
164 \item we ought to have a framework for initial coding for the
165 actual simulations (test-first!)
166 \item the implementation is somewhat clear
167 \item We should ask: what theorems have similar assumptions?
168 \item We should ask: what theorems have similar conclusions but
169 different assumptions?
170 \end{itemize}
171 \end{frame}
172 \begin{frame}[fragile]{Realizing Theory}
173 \small{
174 \begin{verbatim}
175 (define-theorem my-proposed-theorem
176 (:theorem-type '(distribution-properties
177 frequentist likelihood))
178 (:assumes '(assumption-1 assumption-2))
179 (:likelihood-form
180 (defun likelihood (data theta gamma)
181 (exponential-family theta gamma)))
182 (:compute-by
183 '(progn
184 (compute-start-values thetahat gammahat)
185 (until (convergence)
186 (setf convergence
187 (or (step-1 thetahat)
188 (step-2 gammahat))))))
189 (:claim (equal-distr '(thetahat gammahat) 'normal))))
190 \end{verbatim}
192 \end{frame}
194 \begin{frame}[fragile]{It would be nice to have}
195 \begin{verbatim}
196 (theorem-veracity 'my-proposed-theorem)
197 \end{verbatim}
198 returning some indication of how well it met given computable claims,
199 modulo what proportion of computable claims could be tested.
200 \begin{itemize}
201 \item and have it run some illustrative simulations which suggest
202 which might be problematic in real situations, and real situations
203 for which there are no problems.
204 \item and work through some of the logic based on related claims using
205 identical assumptions to confirm some of the results
206 \end{itemize}
207 \end{frame}
209 \begin{frame}[fragile]{and why not...?}
210 \begin{verbatim}
211 (when (> (theorem-veracity
212 'my-proposed-theorem)
213 0.8)
214 (make-draft-paper 'my-proposed-theorem
215 :style :JASA
216 :output-formats
217 '(LaTeX MSWord)))
218 \end{verbatim}
219 \end{frame}
221 \begin{frame}{Comments}
222 \begin{itemize}
223 \item Of course the general problem is very difficult, but one must
224 start somewhere.
225 \item Key requirement: a statistical ontology (assumptions,
226 conditions).
227 \item Current activity: basic statistical proof of concepts (not
228 finished): T-Test, linear regression (LS-based, Normal-Normal
229 Bayesian).
230 \item Goal: results, with reminder of assumptions and how well the
231 situation meets assumptions
233 \emph{(metadata for both data and procedure: how well do they match in
234 describing requirements and how well requirements are met?)}
235 \item Areas targeted for medium-term future: resampling methods and
236 similar algorithms.
237 \end{itemize}
238 \end{frame}
240 \begin{frame}
241 \frametitle{Example 2: Practice\ldots}
242 \label{example2}
243 The dataset comes from a series of clinical trials, some with active
244 control and others using placebo control. We model the primary
245 endpoint, ``relief'', as a binary random variable. There is a
246 random trial effect on relief as well as severity due to differences
247 in recruitment and inclusion/exclusion criteria from 2 different
248 trial networks.
249 \end{frame}
251 \begin{frame}
252 \frametitle{Can we compute, using this description?}
253 \begin{itemize}
254 \item With a real such description, it is clear what some of the
255 potential models might be for this dataset
256 \item It should be clear how to start thinking of a data dictionary
257 for this problem.
258 \end{itemize}
259 \end{frame}
261 \begin{frame}[fragile]{Can we compute?}
262 \begin{verbatim}
263 (dataset-metadata paper-1
264 :context 'clinical-trial 'randomized
265 'active-ctrl 'placebo-ctrl 'metaanalysis
266 :variables '((relief :model-type dependent
267 :distr binary)
268 (trial :model-type independent
269 :distr categorical)
270 (disease-severity))
271 :metadata '(incl-crit-net1 excl-crit-net1
272 incl-crit-net1 excl-crit-net2
273 recr-rate-net1 recr-rate-net2))
274 (propose-analysis paper-1)
275 ; => (list 'tables '(logistic-regression))
276 \end{verbatim}
277 \end{frame}
279 \begin{frame}{Example 3: The Round-trip\ldots}
280 \label{example3}
281 The first examples describe ``ideas $\rightarrow$ code''
283 Consider the last time you read someone else's implementation of a
284 statistical procedure (i.e. R package code). When you read the
285 code, could you see:
286 \begin{itemize}
287 \item the assumptions used?
288 \item the algorithm implemented?
289 \item practical guidance for when you might select the algorithm
290 over others?
291 \item practical guidance for when you might select the
292 implementation over others?
293 \end{itemize}
294 These are usually components of any reasonable journal article.
295 \textit{(Q: have you actually read an R package that wasn't yours?)}
296 \end{frame}
298 \begin{frame}{Exercise left to the reader!}
300 % (aside: I have been looking at the \textbf{stats} and \textbf{lme4}
301 % packages recently -- \textit{for me}, very clear numerically, much
302 % less so statistically)
303 \end{frame}
305 \begin{frame}{Point of Examples}
306 \begin{itemize}
307 \item Few statistical concepts are ``computable'' with existing systems.
309 \item Some of this work is computable, let the computer do it.
311 \item There is little point in having people re-do basics
313 \item Computing environments for statistical work have been stable
314 for far too long, and limit the development and implementation of
315 better, more efficient, and more appropriate methods by allowing
316 people to be lazy (i.e. classic example of people publishing
317 papers on changes which are very minimal from a
318 methodological/theoretical view, but very difficult from an
319 implementation/practical view).
320 \end{itemize}
321 \end{frame}
323 \begin{frame}{Issues which arise when computing...}
324 \begin{enumerate}
325 \item relevant substantive issues (causality, variable independence,
326 design issues such as sampling strategies) not incorporated.
327 \item irrelevant substantive issues (coding, wide vs. long
328 collection, other non-statistical data management issues) become
329 statistically-relevant.
330 \item little support for encoding theoretical considerations (``expert
331 systems'' for guidance). Must be hard-coded in and hard-coded
332 away (``stars for p-values as evil''). Nearly impossible to
333 construct and apply philosophical opinions to ensure appropriate
334 use (and training through use) of singular or personalized
335 mixtures of statistical philosophies when doing data analysis (or
336 methodological development, or theoretical development).
337 \end{enumerate}
338 \end{frame}
340 \begin{frame}{Problem Statement}
342 How can statistical computing environments support appropriate use
343 of statistical concepts (algorithmic, knowledge-centric,
344 knowledge-managing, philosophical discipline), so that the computing
345 structure doesn't rely on data-munging or numerical skill?
347 \end{frame}
350 \begin{frame}{Goals for this Talk}{(define, strategic approach,
351 justify)}
353 \begin{itemize}
354 \item To describe the concept of \alert{computable statistics},
355 placing it in a historical context.
357 \item To demonstrate that \alert{a research program}
358 implemented through simple steps can increase the efficiency of
359 statistical computing approaches by clearly describing both:
360 \begin{itemize}
361 \item numerical characteristics of procedures,
362 \item statistical concepts driving them.
363 \end{itemize}
365 \item To justify that the \alert{approach is worthwhile} and
366 represents a staged effort towards \alert{increased use of best
367 practices} and efficient tech transfer of modern statistical
368 theory (i.e. why must we wait 10 years for Robins' estimation
369 approaches?)
370 \end{itemize}
371 (unfortunately, the last is still incomplete)
372 \end{frame}
374 \begin{frame}[fragile]{Why not use R?}
375 \begin{itemize}
376 \item the R programming language is incomplete and constantly being
377 redefined. Common Lisp is an old formal standard, proven through
378 many implementations
379 \item R isn't compiled; standalone application delivery is difficult
380 \item Without parens, Common Lisp could be R (interactive, or batch,
381 or through ``compiled applications'').
382 \item R is the Microsoft of statistical computing.
383 \item R has problems which can't be fixed due to sizeable user
384 populations:
385 (\verb+library(nlme)+ vs \verb+nlme<-''lme4'' , library(nlme)+)
386 \item Evolutionary development requires strawmen to challenge
387 \end{itemize}
389 \end{frame}
391 \section{CLS: Current Status}
393 \label{sec:work}
395 \begin{frame}{Is it Vaporware? Not quite}
396 The follow is possible with the help of the open source Common Lisp
397 community, who provided most of the packages, tools, and glue.
398 (Tamas Papp, Raymond Toy, Mark Hoemmomem, and many, many others).
399 Most of the underlying code was written by others, and ``composed''
400 by me.
401 \end{frame}
403 \subsection{Graphics}
404 \label{sec:work:graphics}
406 \begin{frame}{Silly Visualization Example}
407 \includegraphics[width=2.8in,height=2.8in]{./test1.png}
408 \end{frame}
410 \begin{frame}[fragile]{Defining Plot Structure}
411 \begin{verbatim}
412 (defparameter *frame2*
413 (as-frame (create-xlib-image-context 200 200)
414 :background-color +white+))
415 (bind ((#2A((f1 f2) (f3 f4))
416 (split-frame *frame2*
417 (percent 50)
418 (percent 50))))
419 (defparameter *f1* f1) ; lower left
420 (defparameter *f2* f2) ; lower right f3 f4
421 (defparameter *f3* f3) ; top left f1 f2
422 (defparameter *f4* f4)); top right
423 \end{verbatim}
424 \end{frame}
426 \begin{frame}[fragile]{Plotting Functions}
427 \begin{verbatim}
428 (plot-function *f1* #'sin
429 (interval-of 0 2)
430 :x-title "x" :y-title "sin(x)")
431 (plot-function *f2* #'cos (interval-of 0 2)
432 :x-title "x" :y-title "cos(x)")
433 (plot-function *f3* #'tan (interval-of 0 2)
434 :x-title "x" :y-title "tan(x)")
435 \end{verbatim}
436 \end{frame}
438 \begin{frame}[fragile]{Plotting Data}
439 \small{
440 \begin{verbatim}
441 (let* ((n 500)
442 (xs (num-sequence
443 :from 0 :to 10 :length n))
444 (ys (map 'vector
445 #'(lambda (x) (+ x 8 (random 4.0)))
446 xs))
447 (weights
448 (replicate #'(lambda () (1+ (random 10)))
449 n 'fixnum))
450 (da (plot-simple *f4*
451 (interval-of 0 10)
452 (interval-of 10 20)
453 :x-title "x" :y-title "y")))
454 (draw-symbols da xs ys :weights weights))
455 \end{verbatim}
457 \end{frame}
459 \begin{frame}[fragile]{Copying existing graphics}
460 And we generated the figure on the first page by:
461 \begin{verbatim}
462 (xlib-image-context-to-png
463 (context *f1*)
464 "/home/tony/test1.png")
465 \end{verbatim}
466 \end{frame}
468 \subsection{Statistical Models}
469 \label{sec:work:statmod}
471 \begin{frame}[fragile]{Linear Regression}
472 \small{
473 \begin{verbatim}
474 ;; Worse than LispStat, wrapping LAPACK's dgelsy:
475 (defparameter *result1*
476 (lm (list->vector-like iron)
477 (list->vector-like absorbtion)))
478 *result*1 =>
479 ((#<LA-SIMPLE-VECTOR-DOUBLE (2 x 1)
480 -11.504913191235342
481 0.23525771181009483>
484 #<LA-SIMPLE-MATRIX-DOUBLE 2 x 2
485 9.730392177126686e-6 -0.001513787114206932
486 -0.001513787114206932 0.30357851215706255>
488 13 2)
489 \end{verbatim}
491 \end{frame}
493 \subsection{Data Manip/Mgmt}
494 \label{sec:work:data}
496 \begin{frame}[fragile]{DataFrames}
497 \small{
498 \begin{verbatim}
499 (defparameter *my-df-1*
500 (make-instance 'dataframe-array
501 :storage #2A((1 2 3 4 5) (10 20 30 40 50))
502 :doc "This is a boring dataframe-array"
503 :case-labels (list "x" "y")
504 :var-labels (list "a" "b" "c" "d" "e")))
506 (xref *my-df-1* 0 0) ; API change in progress
508 (setf (xref *my-df-1* 0 0) -1d0)
509 \end{verbatim}
511 \end{frame}
513 \begin{frame}[fragile]{Numerical Matrices}
514 \small{
515 \begin{verbatim}
516 (defparameter *mat-1*
517 (make-matrix 3 3
518 :initial-contents #2A((2d0 3d0 -4d0)
519 (3d0 2d0 -4d0)
520 (4d0 4d0 -5d0))))
522 (xref *mat-1* 2 0) ; => 4d0 ; API change
523 (setf (xref *mat-1* 2 0) -4d0)
525 (defparameter *xv*
526 (make-vector 4 :type :row
527 :initial-contents '((1d0 3d0 2d0 4d0))))
528 \end{verbatim}
530 \end{frame}
532 \begin{frame}[fragile]{Macros make the above tolerable}
533 \begin{verbatim}
534 (defparameter *xv*
535 (make-vector 4 :type :row
536 :initial-contents '((1d0 3d0 2d0 4d0))))
538 ; can use defmacro for the following syntax =>
540 (make-row-vector *xv* '((1d0 3d0 2d0 4d0)))
542 ; or reader macros for the following:
543 #mrv(*xv* '((1d0 3d0 2d0 4d0)))
544 \end{verbatim}
545 \end{frame}
547 \subsection{Why?}
549 \begin{frame}{Why CLS?}
550 \begin{itemize}
551 \item a component-based structure for statistical computing,
552 allowing for small and specific specification.
553 \item a means to drive philosophically customized data analysis, and
554 enforce a structure to allow simple comparisons between
555 methodologies.
556 \item This is a ``customization'' through packages to support
557 statistical computing, not a independent language. ``Ala Carte'',
558 not ``Menu''.
559 \end{itemize}
560 \end{frame}
562 \subsection{Implementation Plans}
563 \label{sec:CLS:impl}
566 \begin{frame}{Current Functionality}
567 \begin{itemize}
568 \item basic dataframes (similar to R); subsetting API under
569 development.
570 \item Basic regression (similar to XLispStat)
571 \item matrix storage both in foreign and lisp-centric areas.
572 \item LAPACK (small percentage, increasing), working with both
573 matrix storage types
574 \item static graphics (X11) including preliminary grid functionality based
575 on CAIRO. Generation of PNG files from graphics windows.
576 \item CSV file support
577 \item Common Lisp!
578 \end{itemize}
579 \end{frame}
581 \begin{frame}[fragile]{Computational Environment Supported}
582 \begin{itemize}
583 \item works on Linux, with recent SBCL versions
584 \item Definitely works on bleeding edge Debian (unstable).
585 \item Has worked for weak definitions of ``work'' on 4 different
586 people's computers (not quite, but sort of requires a
587 \verb+/home/tony/+ !)
588 \end{itemize}
589 \end{frame}
591 \begin{frame}{Goals}
592 Short Term
593 \begin{itemize}
594 \item Better integration of data structures with statistical routines
595 (auto-handling with dataframes, rather than manual parsing).
596 \item dataframe to model-matrix tools (leveraging old XLispStat GEE
597 package)
598 \end{itemize}
599 Medium/Long Term
600 \begin{itemize}
601 \item Support for other Common Lisps
602 \item Cleaner front-end API to matrices and numerical algorithms
603 \item constraint system for different statistical algorithm
604 development, as well as for interactive GUIs and graphics
605 \item LispStat compatible (object system in-progress, GUI to do)
606 \item Integrated invisible parallelization when more efficient
607 (multi-core, threading, and user-space systems)
608 \end{itemize}
609 \end{frame}
611 \begin{frame}[fragile]{What does NOT work?}
612 Primarily, the reason that we doing this:
614 \textbf{Computable and Executable Statistics}
616 but consider XML:
617 \begin{verbatim}
618 <car brand="honda" engine="4cyl">accord</car>
619 \end{verbatim}
620 becomes
621 \begin{verbatim}
622 ; data follows keywords...
623 (car :brand 'honda :engine "4cyl" accord)
624 \end{verbatim}
625 \end{frame}
627 \section{Discussion}
629 \begin{frame}{Why use Common Lisp?}
630 \begin{itemize}
631 \item Parens provide clear delineation of a \textbf{Complete
632 Thought} (functional programming with side effects).
633 \item Lisp-2 (symbols represent a different function and variable)
634 \item ANSI standard (built by committee, but the committee was
635 reasonably smart)
636 \item Many implementations
637 \item Most implementations are interactive \textbf{compiled}
638 languages (few are interpreted, nearly all byte-compiled).
639 \item The Original \emph{Programming with Data} Language
640 (\emph{Programs are Data} and \emph{Data are Executable} apply).
641 \item advanced, powerful, first-class macros (macros functionally
642 re-write code, allowing for structural clarity and complete
643 destruction of syntax, should that be reasonable)
644 \end{itemize}
645 \end{frame}
647 \begin{frame}{Available Common Lisp Packages}
648 (They are packages and called packages, not libraries. Some people
649 can rejoice!)
650 \begin{itemize}
651 \item infrastructure \emph{enhancements}: infix-notation, data
652 structures, control and flow structures
653 \item numerics, graphics, GUIs,
654 \item primitive R to CL compiler (which could also be considered an
655 object-code compiler for R); 3 interfaces which embed R within CL.
656 \item Web 2.0 support and TeX-like reporting facilities for PDF
657 output.
658 \end{itemize}
659 See \url{http://www.common-lisp.net/} and
660 \url{http://www.cliki.org/}. CLS sources can be found on
661 \url{http://github.com/blindglobe/}
662 \end{frame}
665 \begin{frame}{Conclusion}
667 This slowly developing research program aims to a statistical
668 computing system which enables sophisticated statistical research
669 which can be readily transfer to applications, is supportable.
671 Related numerical/statistical projects:
672 \begin{itemize}
673 \item Incanter : R/LispStat/Omegahat-like system for Clojure (Lisp
674 on the JVM)
675 \item FEMLisp : system/workshop for finite-element analysis modeling
676 using Lisp
677 \item matlisp/LispLab : LAPACK-based numerical linear algebra packages
678 \item GSLL : GNU Scientific Library, Lisp interface.
679 \item RCL, RCLG, CLSR (embedding R within Common Lisp)
680 \end{itemize}
682 Bill Gates, 2004: ``the next great innovation will be data
683 integration''. Perhaps this will be followed by statistics.
684 \end{frame}
686 \begin{frame}{What can you do to follow up?}
688 \begin{itemize}
689 \item Read: Introduction to Common Lisp: Paul Graham's ANSI Common Lisp,
690 enjoyable book with boring title, best intro to S4 classes
691 around. Practical Common Lisp, by Peter Seibel
692 \item Consider: how a computing environment could better support
693 features in the research you do (event-time data, design,
694 longitudinal data modeling, missing and coarsened data, multiple
695 comparisons, feature selection).
696 \end{itemize}
697 The next stage of reproducible research will require computable
698 statistics (code that explains itself and can be parsed to generate
699 knowledge about its claims; ``XML's promise'').
700 \end{frame}
702 \end{document}