6 \setbeamercovered{transparent}
9 \usepackage[english]{babel}
10 \usepackage[latin1]{inputenc}
12 \usepackage[T1]{fontenc}
14 \title[CLS]{Common Lisp Statistics}
15 \subtitle{Using History to design better data analysis environments}
16 \author[Rossini]{Anthony~(Tony)~Rossini}
18 \institute[Novartis and University of Washington] % (optional, but mostly needed)
20 Group Head, Modeling and Simulation\\
21 Novartis Pharma AG, Switzerland
23 Affiliate Assoc Prof, Biomedical and Health Informatics\\
24 University of Washington, USA}
26 \date[Rice 09]{Rice, Mar 2009}
27 \subject{Statistical Computing Environments}
35 \begin{frame}{Outline}
39 % Structuring a talk is a difficult task and the following structure
40 % may not be suitable. Here are some rules that apply for this
43 % - Exactly two or three sections (other than the summary).
44 % - At *most* three subsections per section.
45 % - Talk about 30s to 2min per frame. So there should be between about
46 % 15 and 30 frames, all told.
48 % - A conference audience is likely to know very little of what you
49 % are going to talk about. So *simplify*!
50 % - In a 20min talk, getting the main ideas across is hard
51 % enough. Leave out details, even if it means being less precise than
52 % you think necessary.
53 % - If you omit details that are vital to the proof/implementation,
54 % just say so once. Everybody will be happy with that.
56 \section{Motivation for CLS}
65 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
68 \section{Preliminaries}
72 \begin{frame}{Goals for this Talk}{(define, strategic approach,
76 \item To describe the concept of \alert{computable and executable
77 statistics}, placing it in a historical context.
79 \item To demonstrate that \alert{a research program}
80 implemented through simple steps can increase the efficiency of
81 statistical computing approaches by clearly describing both:
83 \item numerical characteristics of procedures,
84 \item statistical concepts driving them.
87 \item To justify that the \alert{approach is worthwhile} and
88 represents a staged effort towards \alert{increased use of best
91 (unfortunately, the last is still incomplete)
95 \begin{frame}{Historical Computing Languages}
97 \item FORTRAN : FORmula TRANslator. Original numerical computing
98 language, designed for clean implementation of numerical
100 \item LISP : LISt Processor. Associated with symbolic
101 manipulation, AI, and knowledge approaches
104 They represent the 2 generalized needs of statistical computing,
105 which could be summarized as
107 \item algorithms/numerics,
108 \item elicitation, communication, and generation of knowledge (``data
113 \begin{frame}{Statistical Computing Environments}
117 \item SPSS / BMDP / SAS
118 \item S ( S, S-PLUS, R)
119 \item LispStat ( XLispStat, ViSta, ARC , CommonLispStat ) ; QUAIL
120 \item XGobi (Orca / GGobi / Statistical Reality Engine)
124 \item Augsburg Impressionist series (MANET,
131 \begin{frame}{How many are left?}
139 \item very few others...
141 ``R is the Microsoft of the statistical computing world'' -- anonymous.
144 \begin{frame}{Selection Pressure}
146 \item the R user population is growing rapidly, fueled by critical
147 mass, quality, and value
148 \item R is a great system for applied data analysis
149 \item R is not such a great system for research into statistical
150 computing (backwards compatibility, inertia due to user population)
152 There is a need for alternative experiments for developing new
153 approaches/ideas/concepts.
156 \begin{frame}{Philosophically, why Common Lisp?}
159 \item Lisp can cleanly present computational intentions, both
160 symbolically and numerically.
161 \item Semantics and context are important: well supported by Lisp
163 \item Lisp's parentheses describe singular, multi-scale,
164 \alert{complete thoughts}.
169 \begin{frame}{Technically, why Common Lisp?}
171 \item interactive COMPILED language (``R with a compiler'')
172 \item CLOS is R's S4 object system ``done right''.
173 \item clean semantics: modality, typing, can be expressed the way
175 \item programs are data, data are programs, leading to
176 \item Most modern computing tools available (XML, WWW technologies)
177 \item ``executable XML''
179 Common Lisp is very close in usage to how people currently use R
180 (mostly interactive, some batch, and a wish for compilation efficiency).
183 \subsection{Background}
186 \frametitle{Desire: Semantics and Statistics}
188 \item The semantic web (content which is self-descriptive) is an
189 interesting and potentially useful idea.
192 Biological informatics support (GO, Entrez) has allowed for
193 precise definitions of concepts in biology.
195 \item It is a shame that a field like statistics, requiring such
196 precision, has less than an imprecise and temporally instable
197 field such as biology\ldots
200 How can we express statistical work (research, applied work) which
201 is both human and computer readable (perhaps subject to
202 transformations first)?
206 % \subsection{Context}
208 % \begin{frame}{Context}{(where I'm coming from, my ``priors'')}
210 % \item Pharmaceutical Industry
211 % \item Modeling and Simulation uses mathematical models/constructs to
212 % record beliefs (biology, pharmacology, clinical science) for
213 % explication, clinical team alignment, decision support, and
215 % \item My work at Novartis is at the intersection of biomedical
216 % informatics, statistics, and mathematical modeling.
217 % \item As manager: I need a mix of applications and novel research development to
218 % solve our challenges better, faster, more efficiently.
219 % \item Data analysis is a specialized approach to computer
220 % programming, \alert{different} than applications programming or
221 % systems programming.
225 \section{Computable and Executable Statistics}
227 \begin{frame}{Can we compute with them?}
232 \item Reimplementation
234 Consider whether one can ``compute'' with the information given?
237 \begin{frame}[fragile]{Example 1: Theory\ldots}
239 Let $f(x;\theta)$ describe the likelihood of XX under the following
245 Then if we use the following algorithm:
250 then $\hat{\theta}$ should be $N(0,\hat\sigma^2)$ with the following
251 characteristics\ldots
255 \frametitle{Can we compute, using this description?}
256 Given the information at hand:
258 \item we ought to have a framework for initial coding for the
259 actual simulations (test-first!)
260 \item the implementation is somewhat clear
261 \item We should ask: what theorems have similar assumptions?
262 \item We should ask: what theorems have similar conclusions but
263 different assumptions?
267 \begin{frame}[fragile]{Realizing Theory}
270 (define-theorem my-proposed-theorem
271 (:theorem-type '(distribution-properties
274 (:assumes '(assumption-1 assumption-2))
276 (defun likelihood (data theta gamma)
277 (exponential-family theta gamma)))
280 (compute-starting-values thetahat gammahat)
283 (or (step-1 thetahat)
284 (step-2 gammahat))))))
286 (and (equal-distribution thetahat 'normal)
287 (equal-distribution gammahat 'normal)))))
292 \begin{frame}[fragile]{It would be nice to have}
294 (theorem-veracity 'my-proposed-theorem)
298 \begin{frame}[fragile]{and why not...?}
300 (when (theorem-veracity
301 'my-proposed-theorem)
302 (write-paper 'my-proposed-theorem
309 \begin{frame}{Comments}
311 \item The general problem is very difficult
312 \item Some progress has been made in small areas of basic
313 statistics: currently working on linear regression (LS-based,
314 Normal-bayesian) and the T-test.
315 \item Areas targetted for medium-term future: resampling methods and
322 \frametitle{Example 2: Practice\ldots}
324 The dataset comes from a series of clinical trials. We model the
325 primary endpoint, ``relief'', as a binary random variable. There is
326 a random trial effect on relief as well as severity due to
327 differences in recruitment and inclusion/exclusion criteria.
331 \frametitle{Can we compute, using this description?}
333 \item With a real such description, it is clear what some of the
334 potential models might be for this dataset
335 \item It should be clear how to start thinking of a data dictionary
340 \begin{frame}[fragile]{Can we compute?}
342 (dataset-metadata paper-1
343 :context 'clinical-trials
344 :variables '((relief :model-type dependent
345 :distribution binary)
346 (trial :model-type independent
347 :distribution categorical)
349 :metadata '(inclusion-criteria
352 (propose-analysis paper-1)
354 ; (logistic regression))
358 \begin{frame}{Example 3: The Round-trip\ldots}
360 The first examples describe ``ideas $\rightarrow$ code''
362 Consider the last time you read someone else's implementation of a
363 statistical procedure (i.e. R package code). When you read the
366 \item the assumptions used?
367 \item the algorithm implemented?
368 \item practical guidance for when you might select the algorithm
370 \item practical guidance for when you might select the
371 implementation over others?
373 These are usually components of any reasonable journal article.
374 \textit{(Q: have you actually read an R package that wasn't yours?)}
377 \begin{frame}{Exercise left to the reader!}
379 (aside: I have been looking at the \textbf{stats} and \textbf{lme4}
380 packages recently -- \textit{for me}, very clear numerically, much
381 less so statistically)
386 \subsection{Literate Programming is insufficient}
388 \begin{frame}{Literate Statistical Practice.}
390 \item Literate Programming applied to data analysis (Rossini, 1997/2001)
391 \item among the \alert{most annoying} techniques to integrate into
392 work-flow if one is not perfectly methodological.
395 \item ESS: supports interactive creation of literate programs.
396 \item Sweave: tool which exemplifies reporting context; odfWeave
397 primarily simplifies reporting.
398 \item Roxygen: primarily supports a literate programming
399 documentation style, not a literate data analysis programming
402 \item ROI demonstrated in specialized cases: BioConductor.
403 \item \alert{usually done after the fact} (final step of work-flow)
404 as a documentation/computational reproducibility technique, rarely
405 integrated into work-flow.
408 Knuth, Claerbout, Carey, de Leeuw, Leisch, Gentleman, Temple-Lang,
413 \frametitle{Literate Programming}
414 \framesubtitle{Why isn't it enough for Data Analysis?}
416 Only 2 contexts: (executable) code and documentation. Fine for
417 application programming, but for data analysis, we could benefit
420 \item classification of statistical procedures
421 \item descriptions of assumptions
422 \item pragmatic recommendations
423 \item inheritance of structure through the work-flow of a
424 statistical methodology or data analysis project
425 \item datasets and metadata
427 Concept: ontologies describing mathematical assumptions, applications
428 of methods, work-flow, and statistical data structures can enable
429 machine communication.
431 (i.e. informatics framework ala biology)
435 \begin{frame}{Communication in Statistical Practice}{\ldots is essential for \ldots}
440 \item receiving information
442 \alert{``machine-readable'' communication/computation lets the
444 Semantic Web is about ``machine-enabled computability''.
447 \begin{frame} \frametitle{Semantics}
448 \framesubtitle{One definition: description and context}
450 Interoperability is the key, with respect to
452 \item ``Finding things''
453 \item Applications and activities with related functionality
455 \item moving information from one state to another (paper, journal
456 article, computer program)
457 \item computer programs which implement solutions to similar tasks
463 \begin{frame}{Statistical Practice is somewhat restricted}
464 {...but in a good sense, enabling potential for semantics...}
466 There is a restrictable set of intended actions for what can be done
467 -- the critical goal is to be able to make a difference by
468 accelerating activities that should be ``computable'':
470 \item restricted natural language processing
471 \item mathematical translation
472 \item common description of activities for simpler programming/data
473 analysis (S approach to objects and methods)
475 R is a good basic start (model formulation approach, simple
476 ``programming with data'' paradigm); we should see if we can do
480 \begin{frame}{Computable and Executable Statistics requires}
483 \item approaches to describe data and metadata (``data'')
486 \item metadata management and integration, driving
487 \item data integration
489 \item approaches to describe data analysis methods (``models'')
491 \item quantitatively: many ontologies (AMS, etc), few meeting
493 \item many substantive fields have implementations
494 (bioinformatics, etc) but not well focused.
496 \item approaches to describe the specific form of interaction
497 (``instances of models'')
499 \item Original idea behind ``Literate Statistical Analysis''.
500 \item That idea is suboptimal, more structure needed (not
501 necessarily built upon existing...).
506 \subsection{Common Lisp Statistics}
509 \frametitle{Interactive Programming}
510 \framesubtitle{Everything goes back to being Lisp-like}
512 \item Interactive programming (as originating with Lisp): works
513 extremely well for data analysis (Lisp being the original
514 ``programming with data'' language).
515 \item Theories/methods for how to do this are reflected in styles
520 \begin{frame}[fragile]
523 Lisp (LISt Processor) is different than most high-level computing
524 languages, and is very old (1956). Lisp is built on lists of things
525 which are evaluatable.
527 (functionName data1 data2 data3)
531 '(functionName data1 data2 data3)
533 which is shorthand for
535 (list functionName data1 data2 data3)
537 The difference is important -- lists of data (the second/third) are
538 not (yet?!) functions applied to (unencapsulated lists of) data (the first).
542 \frametitle{Features}
544 \item Data and Functions semantically the same
545 \item Natural interactive use through functional programming with
547 \item Batch is a simplification of interactive -- not a special mode!
553 \begin{frame}[fragile]{Representation: XML and Lisp}{executing your data}
554 Many people are familiar with XML:
556 <name phone="+41793674557">Tony Rossini</name>
558 which is shorter in Lisp:
560 (name "Tony Rossini" :phone "+41613674557")
563 \item Lisp ``parens'', universally hated by unbelievers, are
564 wonderful for denoting when a ``concept is complete''.
565 \item Why can't your data self-execute?
569 \begin{frame}[fragile]{Numerics with Lisp}
571 \item addition of rational numbers and arithmetic
572 \item example for mean
575 (checktype x 'vector-like)
576 (/ (loop for i from 0 to (- (nelts *x*) 1)
577 summing (vref *x* i))
580 \item example for variance
583 (let ((meanx (mean x))
584 (nm1 (1- (nelts x))))
585 (/ (loop for i from 0 to nm1
586 summing (power (- (vref *x* i) meanx) 2)
589 \item But through macros, \verb+(vref *x* i)+ could be
590 \verb+#V(X[i])+ or your favorite syntax.
596 \begin{frame}{Common Lisp Statistics 1}
598 \item Originally based on LispStat (reusability)
599 \item Re-factored structure (some numerics worked with a 1990-era code base).
600 \item Current activities:
602 \item numerics redone using CFFI-based BLAS/LAPLACK (cl-blapack)
603 \item matrix interface based on MatLisp
604 \item starting design of a user interface system (interfaces,
606 \item general framework for model specification (regression,
608 \item general framework for algorithm specification (bootstrap,
609 MLE, algorithmic data anaylsis methods).
614 \begin{frame}{Common Lisp Statistics 2}
617 \item Implemented using SBCL. Contributed fixes for
618 Clozure/OpenMCL. Goal to target CLISP
619 \item Supports LispStat prototype object system
620 \item Package-based design -- only use the components you need, or
621 the components whose API you like.
630 \item Semantics and Computability have captured a great deal of
631 attention in the informatics and business computing R\&D worlds
632 \item Statistically-driven Decision Making and Knowledge Discovery
633 is, with high likelihood, the next challenging stage after data
635 \item Statistical practice (theory and application) can be enhanced,
636 made more efficient, providing increased benefit to organizations
637 and groups using appropriate methods.
638 \item Lisp as a language, shares characteristics of both Latin
639 (difficult dead language useful for classical training) and German
640 (difficult living language useful for general life). Of course,
641 for some people, they are not difficult.
647 The research program described in this talk is currently driving the
648 design of CommonLisp Stat, which leverages concepts and approaches
649 from the dead and moribund LispStat project.
652 \item \url{http://repo.or.cz/w/CommonLispStat.git/}
653 \item \url{http://www.github.com/blindglobe/}
657 \begin{frame}{Final Comment}
660 \item In the Pharma industry, it is all about getting the right
661 drugs to the patient faster. Data analysis systems seriously
662 impact this process, being potentially an impediment or an
666 \item \alert{Information technologies can increase the efficiency
667 of statistical practice}, though innovation change management
668 must be taking into account. (i.e. Statistical practice, while
669 considered by some an ``art form'', can benefit from
671 \item \alert{Lisp's features match the basic requirements we need}
672 (dichotomy: programs as data, data as programs). Sales pitch,
674 \item Outlook: Lots of work and experimentation to do!
676 \item {\tiny Gratuitous Advert: We are hiring, have student
677 internships (undergrad, grad students), and a visiting faculty
678 program. Talk with me if possibly interested.}
683 % % All of the following is optional and typically not needed.
687 % \section<presentation>*{\appendixname}
690 % \begin{frame} \frametitle{Complements and Backup}
691 % No more, stop here. Questions? (now or later).
694 % \begin{frame}{The Industrial Challenge.}{Getting the Consulting Right.}
695 % % - A title should summarize the slide in an understandable fashion
696 % % for anyone how does not follow everything on the slide itself.
699 % \item Recording assumptions for the next data analyst, reviewer.
700 % Use \texttt{itemize} a lot.
702 % Use very short sentences or short phrases.
707 % \begin{frame}{The Industrial Challenge.}{Getting the Right Research Fast.}
708 % % - A title should summarize the slide in an understandable fashion
709 % % for anyone how does not follow everything on the slide itself.
713 % Use \texttt{itemize} a lot.
715 % Use very short sentences or short phrases.
720 % \begin{frame}{Explicating the Work-flow}{QA/QC-based improvements.}
725 % \section{Motivation}
727 % \subsection{IT Can Speed up Deliverables in Statistical Practice}
729 % \begin{frame}{Our Generic Work-flow and Life-cycle}
730 % {describing most data analytic activities}
733 % \item Scope out the problem
734 % \item Sketch out a potential solution
735 % \item Implement until road-blocks appear
736 % \item Deliver results
742 % \item 1st e-draft of text/code/date (iterate to \#1, discarding)
743 % \item cycle through work
745 % \item ``throw-away''
747 % but there is valuble information that could enable the next
751 % \begin{frame}[fragile]{Paper $\rightarrow$ Computer $\rightarrow$ Article $\rightarrow$ Computer}{Cut and Paste makes for large errors.}
753 % \item Problems in a regulatory setting
754 % \item Regulatory issues are just ``best practices''
757 % Why do we ``copy/paste'', or analogously, restart our work?
761 % \item every time we repeat, we reinforce the idea in our brain
762 % \item review of ideas can help improve them
767 % \item introduction of mistakes
768 % \item loss of historical context
769 % \item changes to earlier work (on a different development branch)
774 % \section{Semantics and Statistical Practice}
778 % \frametitle{Statistical Activity Leads to Reports}
779 % \framesubtitle{You read what you know, do you understand it?}
781 % How can we improve the communication of the ideas we have?
783 % Precision of communication?
789 % \begin{frame} \frametitle{Communication Requires Context}
790 % \framesubtitle{Intentions imply more than one might like...}
793 % \item Consideration of what we might do
794 % \item Applications with related functionality
801 % \frametitle{Design Patterns}
802 % \framesubtitle{Supporting Work-flow Transitions}
804 % (joint work with H Wickham): The point of this research program is
805 % not to describe what to do at any particular stage of work, but to
806 % encourage researchers and practitioners to consider how the
807 % translation and transfer of information between stages so that work
810 % Examples of stages in a work-flow:
812 % \item planning, execution, reporting;
813 % \item scoping, illustrative examples or counter examples, algorithmic construction,
815 % \item descriptive statistics, preliminary inferential analysis,
816 % model/assumption checking, final inferential analysis,
817 % communication of scientific results
819 % Description of work-flows is essential to initiating discussions on
820 % quality/efficiency of approaches to work.
823 % \section{Design Challenges}
826 % \frametitle{Activities are enhanced by support}
829 % \item Mathematical manipulation can be enhanced by symbolic
831 % \item Statistical programming can be enabled by examples and related
832 % algorithm implementation
833 % \item Datasets, to a limited extent, can self-describe.
838 % \frametitle{Executable and Computable Science}
840 % Use of algorithms and construction to describe how things work.
842 % Support for agent-based approaches
847 % \frametitle{What is Data? Metadata?}
849 % Data: what we've observed
851 % MetaData: context for observations, enables semantics.
857 % % \begin{frame}[fragile]
858 % % \frametitle{Defining Variables}
859 % % \framesubtitle{Setting variables}
861 % % (setq <variable> <value>)
865 % % (setq ess-source-directory
866 % % "/home/rossini/R-src")
870 % % \begin{frame}[fragile]
871 % % \frametitle{Defining on the fly}
873 % % (setq ess-source-directory
874 % % (lambda () (file-name-as-directory
875 % % (expand-file-name
876 % % (concat (default-directory)
877 % % ess-suffix "-src")))))
879 % % (Lambda-expressions are anonymous functions, i.e. ``instant-functions'')
883 % % \begin{frame}[fragile]
884 % % \frametitle{Function Reuse}
885 % % By naming the function, we could make the previous example reusable
888 % % (defun my-src-directory ()
889 % % (file-name-as-directory
890 % % (expand-file-name
891 % % (concat (default-directory)
892 % % ess-suffix "-src"))))
896 % % (setq ess-source-directory (my-src-directory))
902 % % \frametitle{Equality Among Packages}
904 % % \item more/less equal can be described specifically through
905 % % overriding imports.
910 % \subsection<presentation>*{For Further Reading}
912 % \begin{frame}[allowframebreaks]
913 % \frametitle<presentation>{Related Material}
915 % \begin{thebibliography}{10}
917 % \beamertemplatebookbibitems
918 % % Start with overview books.
920 % \bibitem{LispStat1990}
922 % \newblock {\em LispStat}.
924 % \beamertemplatearticlebibitems
925 % % Followed by interesting articles. Keep the list short.
927 % \bibitem{Rossini2001}
929 % \newblock Literate Statistical Practice
930 % \newblock {\em Proceedings of the Conference on Distributed
931 % Statistical Computing}, 2001.
933 % \bibitem{RossiniLeisch2003}
934 % AJ.~Rossini and F.~Leisch
935 % \newblock Literate Statistical Practice
936 % \newblock {\em Technical Report Series, University of Washington
937 % Department of Biostatistics}, 2003.
939 % \beamertemplatearrowbibitems
940 % % Followed by interesting articles. Keep the list short.
943 % Common Lisp Stat, 2008.
944 % \newblock \url{http://repo.or.cz/CommonLispStat.git/}
946 % \end{thebibliography}