6 \setbeamercovered{transparent
}
9 \usepackage[english
]{babel
}
10 \usepackage[latin1]{inputenc}
12 \usepackage[T1]{fontenc}
14 \title[CLS
]{Common Lisp Statistics
}
15 \subtitle{Using History to design better data analysis environments
}
16 \author[Rossini
]{Anthony~(Tony)~Rossini
}
18 \institute[Novartis and University of Washington
] % (optional, but mostly needed)
20 Group Head, Modeling and Simulation\\
21 Novartis Pharma AG, Switzerland
23 Affiliate Assoc Prof, Biomedical and Health Informatics\\
24 University of Washington, USA
}
26 \date[Rice
09]{Rice, Mar
2009}
27 \subject{Statistical Computing Environments
}
35 \begin{frame
}{Outline
}
39 % Structuring a talk is a difficult task and the following structure
40 % may not be suitable. Here are some rules that apply for this
43 % - Exactly two or three sections (other than the summary).
44 % - At *most* three subsections per section.
45 % - Talk about 30s to 2min per frame. So there should be between about
46 % 15 and 30 frames, all told.
48 % - A conference audience is likely to know very little of what you
49 % are going to talk about. So *simplify*!
50 % - In a 20min talk, getting the main ideas across is hard
51 % enough. Leave out details, even if it means being less precise than
52 % you think necessary.
53 % - If you omit details that are vital to the proof/implementation,
54 % just say so once. Everybody will be happy with that.
56 \section{Preliminaries
}
60 \begin{frame
}{Goals for this Talk
}{(define, strategic approach,
64 \item To describe the concept of
\alert{computable and executable
65 statistics
}, placing it in a historical context.
67 \item To demonstrate that
\alert{a research program
}
68 implemented through simple steps can increase the efficiency of
69 statistical computing approaches by clearly describing both:
71 \item numerical characteristics of procedures,
72 \item statistical concepts driving them.
75 \item To justify that the
\alert{approach is worthwhile
} and
76 represents a staged effort towards
\alert{increased use of best
79 (unfortunately, the last is still incomplete)
83 \begin{frame
}{Historical Computing Languages
}
85 \item FORTRAN : FORmula TRANslator. Original numerical computing
86 language, designed for clean implementation of numerical
88 \item LISP : LISt Processor. Associated with symbolic
89 manipulation, AI, and knowledge approaches
92 They represent the
2 generalized needs of statistical computing,
93 which could be summarized as
95 \item algorithms/numerics,
96 \item elicitation, communication, and generation of knowledge (``data
101 \begin{frame
}{Statistical Computing Environments
}
105 \item SPSS / BMDP / SAS
106 \item S ( S, S-PLUS, R)
107 \item LispStat ( XLispStat, ViSta, ARC , CommonLispStat ) ; QUAIL
108 \item XGobi (Orca / GGobi / Statistical Reality Engine)
112 \item Augsburg Impressionist series (MANET,
119 \begin{frame
}{How many are left?
}
127 \item very few others...
129 ``R is the Microsoft of the statistical computing world'' -- anonymous.
132 \begin{frame
}{Selection Pressure
}
134 \item the R user population is growing rapidly, fueled by critical
135 mass, quality, and value
136 \item R is a great system for applied data analysis
137 \item R is not such a great system for research into statistical
138 computing (backwards compatibility, inertia due to user population)
140 There is a need for alternative experiments for developing new
141 approaches/ideas/concepts.
144 \begin{frame
}{Philosophically, why Common Lisp?
}
147 \item Lisp can cleanly present computational intentions, both
148 symbolically and numerically.
149 \item Semantics and context are important: well supported by Lisp
151 \item Lisp's parentheses describe singular, multi-scale,
152 \alert{complete thoughts
}.
157 \begin{frame
}{Technically, why Common Lisp?
}
159 \item interactive COMPILED language (``R with a compiler'')
160 \item CLOS is R's S4 object system ``done right''.
161 \item clean semantics: modality, typing, can be expressed the way
163 \item programs are data, data are programs, leading to
164 \item Most modern computing tools available (XML, WWW technologies)
165 \item ``executable XML''
167 Common Lisp is very close in usage to how people currently use R
168 (mostly interactive, some batch, and a wish for compilation efficiency).
171 \subsection{Background
}
174 \frametitle{Desire: Semantics and Statistics
}
176 \item The semantic web (content which is self-descriptive) is an
177 interesting and potentially useful idea.
180 Biological informatics support (GO, Entrez) has allowed for
181 precise definitions of concepts in biology.
183 \item It is a shame that a field like statistics, requiring such
184 precision, has less than an imprecise and temporally instable
185 field such as biology
\ldots
188 How can we express statistical work (research, applied work) which
189 is both human and computer readable (perhaps subject to
190 transformations first)?
194 % \subsection{Context}
196 % \begin{frame}{Context}{(where I'm coming from, my ``priors'')}
198 % \item Pharmaceutical Industry
199 % \item Modeling and Simulation uses mathematical models/constructs to
200 % record beliefs (biology, pharmacology, clinical science) for
201 % explication, clinical team alignment, decision support, and
203 % \item My work at Novartis is at the intersection of biomedical
204 % informatics, statistics, and mathematical modeling.
205 % \item As manager: I need a mix of applications and novel research development to
206 % solve our challenges better, faster, more efficiently.
207 % \item Data analysis is a specialized approach to computer
208 % programming, \alert{different} than applications programming or
209 % systems programming.
213 \section{Computable and Executable Statistics
}
215 \begin{frame
}{Can we compute with them?
}
220 \item Reimplementation
222 Consider whether one can ``compute'' with the information given?
225 \begin{frame
}[fragile
]{Example
1: Theory
\ldots}
227 Let $f(x;
\theta)$ describe the likelihood of XX under the following
233 Then if we use the following algorithm:
238 then $
\hat{\theta}$ should be $N(
0,
\hat\sigma^
2)$ with the following
239 characteristics
\ldots
243 \frametitle{Can we compute, using this description?
}
244 Given the information at hand:
246 \item we ought to have a framework for initial coding for the
247 actual simulations (test-first!)
248 \item the implementation is somewhat clear
249 \item We should ask: what theorems have similar assumptions?
250 \item We should ask: what theorems have similar conclusions but
251 different assumptions?
255 \begin{frame
}[fragile
]{Realizing Theory
}
258 (define-theorem my-proposed-theorem
259 (:theorem-type '(distribution-properties
262 (:assumes '(assumption-
1 assumption-
2))
264 (defun likelihood (data theta gamma)
265 (exponential-family theta gamma)))
268 (compute-starting-values thetahat gammahat)
271 (or (step-
1 thetahat)
272 (step-
2 gammahat))))))
274 (and (equal-distribution thetahat 'normal)
275 (equal-distribution gammahat 'normal)))))
280 \begin{frame
}[fragile
]{It would be nice to have
}
282 (theorem-veracity 'my-proposed-theorem)
286 \begin{frame
}[fragile
]{and why not...?
}
288 (when (theorem-veracity
289 'my-proposed-theorem)
290 (write-paper 'my-proposed-theorem
297 \begin{frame
}{Comments
}
299 \item The general problem is very difficult
300 \item Some progress has been made in small areas of basic
301 statistics: currently working on linear regression (LS-based,
302 Normal-bayesian) and the T-test.
303 \item Areas targetted for medium-term future: resampling methods and
310 \frametitle{Example
2: Practice
\ldots}
312 The dataset comes from a series of clinical trials. We model the
313 primary endpoint, ``relief'', as a binary random variable. There is
314 a random trial effect on relief as well as severity due to
315 differences in recruitment and inclusion/exclusion criteria.
319 \frametitle{Can we compute, using this description?
}
321 \item With a real such description, it is clear what some of the
322 potential models might be for this dataset
323 \item It should be clear how to start thinking of a data dictionary
328 \begin{frame
}[fragile
]{Can we compute?
}
330 (dataset-metadata paper-
1
331 :context 'clinical-trials
332 :variables '((relief :model-type dependent
333 :distribution binary)
334 (trial :model-type independent
335 :distribution categorical)
337 :metadata '(inclusion-criteria
340 (propose-analysis paper-
1)
342 ; (logistic regression))
346 \begin{frame
}{Example
3: The Round-trip
\ldots}
348 The first examples describe ``ideas $
\rightarrow$ code''
350 Consider the last time you read someone else's implementation of a
351 statistical procedure (i.e. R package code). When you read the
354 \item the assumptions used?
355 \item the algorithm implemented?
356 \item practical guidance for when you might select the algorithm
358 \item practical guidance for when you might select the
359 implementation over others?
361 These are usually components of any reasonable journal article.
362 \textit{(Q: have you actually read an R package that wasn't yours?)
}
365 \begin{frame
}{Exercise left to the reader!
}
367 (aside: I have been looking at the
\textbf{stats
} and
\textbf{lme4
}
368 packages recently --
\textit{for me
}, very clear numerically, much
369 less so statistically)
374 \subsection{Literate Programming is insufficient
}
376 \begin{frame
}{Literate Statistical Practice.
}
378 \item Literate Programming applied to data analysis (Rossini,
1997/
2001)
379 \item among the
\alert{most annoying
} techniques to integrate into
380 work-flow if one is not perfectly methodological.
383 \item ESS: supports interactive creation of literate programs.
384 \item Sweave: tool which exemplifies reporting context; odfWeave
385 primarily simplifies reporting.
386 \item Roxygen: primarily supports a literate programming
387 documentation style, not a literate data analysis programming
390 \item ROI demonstrated in specialized cases: BioConductor.
391 \item \alert{usually done after the fact
} (final step of work-flow)
392 as a documentation/computational reproducibility technique, rarely
393 integrated into work-flow.
396 Knuth, Claerbout, Carey, de Leeuw, Leisch, Gentleman, Temple-Lang,
401 \frametitle{Literate Programming
}
402 \framesubtitle{Why isn't it enough for Data Analysis?
}
404 Only
2 contexts: (executable) code and documentation. Fine for
405 application programming, but for data analysis, we could benefit
408 \item classification of statistical procedures
409 \item descriptions of assumptions
410 \item pragmatic recommendations
411 \item inheritance of structure through the work-flow of a
412 statistical methodology or data analysis project
413 \item datasets and metadata
415 Concept: ontologies describing mathematical assumptions, applications
416 of methods, work-flow, and statistical data structures can enable
417 machine communication.
419 (i.e. informatics framework ala biology)
423 \begin{frame
}{Communication in Statistical Practice
}{\ldots is essential for
\ldots}
428 \item receiving information
430 \alert{``machine-readable'' communication/computation lets the
432 Semantic Web is about ``machine-enabled computability''.
435 \begin{frame
} \frametitle{Semantics
}
436 \framesubtitle{One definition: description and context
}
438 Interoperability is the key, with respect to
440 \item ``Finding things''
441 \item Applications and activities with related functionality
443 \item moving information from one state to another (paper, journal
444 article, computer program)
445 \item computer programs which implement solutions to similar tasks
451 \begin{frame
}{Statistical Practice is somewhat restricted
}
452 {...but in a good sense, enabling potential for semantics...
}
454 There is a restrictable set of intended actions for what can be done
455 -- the critical goal is to be able to make a difference by
456 accelerating activities that should be ``computable'':
458 \item restricted natural language processing
459 \item mathematical translation
460 \item common description of activities for simpler programming/data
461 analysis (S approach to objects and methods)
463 R is a good basic start (model formulation approach, simple
464 ``programming with data'' paradigm); we should see if we can do
468 \begin{frame
}{Computable and Executable Statistics requires
}
471 \item approaches to describe data and metadata (``data'')
474 \item metadata management and integration, driving
475 \item data integration
477 \item approaches to describe data analysis methods (``models'')
479 \item quantitatively: many ontologies (AMS, etc), few meeting
481 \item many substantive fields have implementations
482 (bioinformatics, etc) but not well focused.
484 \item approaches to describe the specific form of interaction
485 (``instances of models'')
487 \item Original idea behind ``Literate Statistical Analysis''.
488 \item That idea is suboptimal, more structure needed (not
489 necessarily built upon existing...).
494 \subsection{Common Lisp Statistics
}
497 \frametitle{Interactive Programming
}
498 \framesubtitle{Everything goes back to being Lisp-like
}
500 \item Interactive programming (as originating with Lisp): works
501 extremely well for data analysis (Lisp being the original
502 ``programming with data'' language).
503 \item Theories/methods for how to do this are reflected in styles
508 \begin{frame
}[fragile
]
511 Lisp (LISt Processor) is different than most high-level computing
512 languages, and is very old (
1956). Lisp is built on lists of things
513 which are evaluatable.
515 (functionName data1 data2 data3)
519 '(functionName data1 data2 data3)
521 which is shorthand for
523 (list functionName data1 data2 data3)
525 The difference is important -- lists of data (the second/third) are
526 not (yet?!) functions applied to (unencapsulated lists of) data (the first).
530 \frametitle{Features
}
532 \item Data and Functions semantically the same
533 \item Natural interactive use through functional programming with
535 \item Batch is a simplification of interactive -- not a special mode!
541 \begin{frame
}[fragile
]{Representation: XML and Lisp
}{executing your data
}
542 Many people are familiar with XML:
544 <name phone="+
41793674557">Tony Rossini</name>
546 which is shorter in Lisp:
548 (name "Tony Rossini" :phone "+
41613674557")
551 \item Lisp ``parens'', universally hated by unbelievers, are
552 wonderful for denoting when a ``concept is complete''.
553 \item Why can't your data self-execute?
557 \begin{frame
}[fragile
]{Numerics with Lisp
}
559 \item addition of rational numbers and arithmetic
560 \item example for mean
563 (checktype x 'vector-like)
564 (/ (loop for i from
0 to (- (nelts *x*)
1)
565 summing (vref *x* i))
568 \item example for variance
571 (let ((meanx (mean x))
572 (nm1 (
1- (nelts x))))
573 (/ (loop for i from
0 to nm1
574 summing (power (- (vref *x* i) meanx)
2)
577 \item But through macros,
\verb+(vref *x* i)+ could be
578 \verb+#V(X
[i
])+ or your favorite syntax.
584 \begin{frame
}{Common Lisp Statistics
1}
586 \item Originally based on LispStat (reusability)
587 \item Re-factored structure (some numerics worked with a
1990-era code base).
588 \item Current activities:
590 \item numerics redone using CFFI-based BLAS/LAPLACK (cl-blapack)
591 \item matrix interface based on MatLisp
592 \item starting design of a user interface system (interfaces,
594 \item general framework for model specification (regression,
596 \item general framework for algorithm specification (bootstrap,
597 MLE, algorithmic data anaylsis methods).
602 \begin{frame
}{Common Lisp Statistics
2}
605 \item Implemented using SBCL. Contributed fixes for
606 Clozure/OpenMCL. Goal to target CLISP
607 \item Supports LispStat prototype object system
608 \item Package-based design -- only use the components you need, or
609 the components whose API you like.
618 \item Semantics and Computability have captured a great deal of
619 attention in the informatics and business computing R\&D worlds
620 \item Statistically-driven Decision Making and Knowledge Discovery
621 is, with high likelihood, the next challenging stage after data
623 \item Statistical practice (theory and application) can be enhanced,
624 made more efficient, providing increased benefit to organizations
625 and groups using appropriate methods.
626 \item Lisp as a language, shares characteristics of both Latin
627 (difficult dead language useful for classical training) and German
628 (difficult living language useful for general life). Of course,
629 for some people, they are not difficult.
635 The research program described in this talk is currently driving the
636 design of CommonLisp Stat, which leverages concepts and approaches
637 from the dead and moribund LispStat project.
640 \item \url{http://repo.or.cz/w/CommonLispStat.git/
}
641 \item \url{http://www.github.com/blindglobe/
}
645 \begin{frame
}{Final Comment
}
648 \item In the Pharma industry, it is all about getting the right
649 drugs to the patient faster. Data analysis systems seriously
650 impact this process, being potentially an impediment or an
654 \item \alert{Information technologies can increase the efficiency
655 of statistical practice
}, though innovation change management
656 must be taking into account. (i.e. Statistical practice, while
657 considered by some an ``art form'', can benefit from
659 \item \alert{Lisp's features match the basic requirements we need
}
660 (dichotomy: programs as data, data as programs). Sales pitch,
662 \item Outlook: Lots of work and experimentation to do!
664 \item {\tiny Gratuitous Advert: We are hiring, have student
665 internships (undergrad, grad students), and a visiting faculty
666 program. Talk with me if possibly interested.
}
671 % % All of the following is optional and typically not needed.
675 % \section<presentation>*{\appendixname}
678 % \begin{frame} \frametitle{Complements and Backup}
679 % No more, stop here. Questions? (now or later).
682 % \begin{frame}{The Industrial Challenge.}{Getting the Consulting Right.}
683 % % - A title should summarize the slide in an understandable fashion
684 % % for anyone how does not follow everything on the slide itself.
687 % \item Recording assumptions for the next data analyst, reviewer.
688 % Use \texttt{itemize} a lot.
690 % Use very short sentences or short phrases.
695 % \begin{frame}{The Industrial Challenge.}{Getting the Right Research Fast.}
696 % % - A title should summarize the slide in an understandable fashion
697 % % for anyone how does not follow everything on the slide itself.
701 % Use \texttt{itemize} a lot.
703 % Use very short sentences or short phrases.
708 % \begin{frame}{Explicating the Work-flow}{QA/QC-based improvements.}
713 % \section{Motivation}
715 % \subsection{IT Can Speed up Deliverables in Statistical Practice}
717 % \begin{frame}{Our Generic Work-flow and Life-cycle}
718 % {describing most data analytic activities}
721 % \item Scope out the problem
722 % \item Sketch out a potential solution
723 % \item Implement until road-blocks appear
724 % \item Deliver results
730 % \item 1st e-draft of text/code/date (iterate to \#1, discarding)
731 % \item cycle through work
733 % \item ``throw-away''
735 % but there is valuble information that could enable the next
739 % \begin{frame}[fragile]{Paper $\rightarrow$ Computer $\rightarrow$ Article $\rightarrow$ Computer}{Cut and Paste makes for large errors.}
741 % \item Problems in a regulatory setting
742 % \item Regulatory issues are just ``best practices''
745 % Why do we ``copy/paste'', or analogously, restart our work?
749 % \item every time we repeat, we reinforce the idea in our brain
750 % \item review of ideas can help improve them
755 % \item introduction of mistakes
756 % \item loss of historical context
757 % \item changes to earlier work (on a different development branch)
762 % \section{Semantics and Statistical Practice}
766 % \frametitle{Statistical Activity Leads to Reports}
767 % \framesubtitle{You read what you know, do you understand it?}
769 % How can we improve the communication of the ideas we have?
771 % Precision of communication?
777 % \begin{frame} \frametitle{Communication Requires Context}
778 % \framesubtitle{Intentions imply more than one might like...}
781 % \item Consideration of what we might do
782 % \item Applications with related functionality
789 % \frametitle{Design Patterns}
790 % \framesubtitle{Supporting Work-flow Transitions}
792 % (joint work with H Wickham): The point of this research program is
793 % not to describe what to do at any particular stage of work, but to
794 % encourage researchers and practitioners to consider how the
795 % translation and transfer of information between stages so that work
798 % Examples of stages in a work-flow:
800 % \item planning, execution, reporting;
801 % \item scoping, illustrative examples or counter examples, algorithmic construction,
803 % \item descriptive statistics, preliminary inferential analysis,
804 % model/assumption checking, final inferential analysis,
805 % communication of scientific results
807 % Description of work-flows is essential to initiating discussions on
808 % quality/efficiency of approaches to work.
811 % \section{Design Challenges}
814 % \frametitle{Activities are enhanced by support}
817 % \item Mathematical manipulation can be enhanced by symbolic
819 % \item Statistical programming can be enabled by examples and related
820 % algorithm implementation
821 % \item Datasets, to a limited extent, can self-describe.
826 % \frametitle{Executable and Computable Science}
828 % Use of algorithms and construction to describe how things work.
830 % Support for agent-based approaches
835 % \frametitle{What is Data? Metadata?}
837 % Data: what we've observed
839 % MetaData: context for observations, enables semantics.
845 % % \begin{frame}[fragile]
846 % % \frametitle{Defining Variables}
847 % % \framesubtitle{Setting variables}
849 % % (setq <variable> <value>)
853 % % (setq ess-source-directory
854 % % "/home/rossini/R-src")
858 % % \begin{frame}[fragile]
859 % % \frametitle{Defining on the fly}
861 % % (setq ess-source-directory
862 % % (lambda () (file-name-as-directory
863 % % (expand-file-name
864 % % (concat (default-directory)
865 % % ess-suffix "-src")))))
867 % % (Lambda-expressions are anonymous functions, i.e. ``instant-functions'')
871 % % \begin{frame}[fragile]
872 % % \frametitle{Function Reuse}
873 % % By naming the function, we could make the previous example reusable
876 % % (defun my-src-directory ()
877 % % (file-name-as-directory
878 % % (expand-file-name
879 % % (concat (default-directory)
880 % % ess-suffix "-src"))))
884 % % (setq ess-source-directory (my-src-directory))
890 % % \frametitle{Equality Among Packages}
892 % % \item more/less equal can be described specifically through
893 % % overriding imports.
898 % \subsection<presentation>*{For Further Reading}
900 % \begin{frame}[allowframebreaks]
901 % \frametitle<presentation>{Related Material}
903 % \begin{thebibliography}{10}
905 % \beamertemplatebookbibitems
906 % % Start with overview books.
908 % \bibitem{LispStat1990}
910 % \newblock {\em LispStat}.
912 % \beamertemplatearticlebibitems
913 % % Followed by interesting articles. Keep the list short.
915 % \bibitem{Rossini2001}
917 % \newblock Literate Statistical Practice
918 % \newblock {\em Proceedings of the Conference on Distributed
919 % Statistical Computing}, 2001.
921 % \bibitem{RossiniLeisch2003}
922 % AJ.~Rossini and F.~Leisch
923 % \newblock Literate Statistical Practice
924 % \newblock {\em Technical Report Series, University of Washington
925 % Department of Biostatistics}, 2003.
927 % \beamertemplatearrowbibitems
928 % % Followed by interesting articles. Keep the list short.
931 % Common Lisp Stat, 2008.
932 % \newblock \url{http://repo.or.cz/CommonLispStat.git/}
934 % \end{thebibliography}