6 \setbeamercovered{transparent
}
9 \usepackage[english
]{babel
}
10 \usepackage[latin1]{inputenc}
12 \usepackage[T1]{fontenc}
14 \title[CLS
]{Common Lisp Statistics
}
15 \subtitle{Using History to design better data analysis environments
}
16 \author[Rossini
]{Anthony~(Tony)~Rossini
}
18 \institute[Novartis and University of Washington
] % (optional, but mostly needed)
20 Group Head, Modeling and Simulation\\
21 Novartis Pharma AG, Switzerland
23 Affiliate Assoc Prof, Biomedical and Health Informatics\\
24 University of Washington, USA
}
26 \date[Rice
09]{Rice, Mar
2009}
27 \subject{Statistical Computing Environments
}
35 \begin{frame
}{Outline
}
39 % Structuring a talk is a difficult task and the following structure
40 % may not be suitable. Here are some rules that apply for this
43 % - Exactly two or three sections (other than the summary).
44 % - At *most* three subsections per section.
45 % - Talk about 30s to 2min per frame. So there should be between about
46 % 15 and 30 frames, all told.
48 % - A conference audience is likely to know very little of what you
49 % are going to talk about. So *simplify*!
50 % - In a 20min talk, getting the main ideas across is hard
51 % enough. Leave out details, even if it means being less precise than
52 % you think necessary.
53 % - If you omit details that are vital to the proof/implementation,
54 % just say so once. Everybody will be happy with that.
56 \section{Preliminaries
}
58 \begin{frame
}{Goals
}{(define, strategic approach, justify)
}
61 \item To describe the concept of
\alert{computable and executable
64 \item To demonstrate that
\alert{there exists a research program
}
65 consisting of simple steps which is adaptable to a practitioner's
66 work habits, which is feasible and introduces relatively minimal
69 \item To justify that the
\alert{approach is worthwhile
} and
70 represents a staged effort towards
\alert{increased use of best
73 (unfortunately, the last is still incomplete)
76 \subsection{Background
}
78 \begin{frame
}{Many systems existed concurrently for statistical
82 \item LispStat (ViSta, ARC)
88 \item XGobi (Orca, GGobi, Statistical Reality Engine)
95 \begin{frame
}{Why Lisp?
}
97 \item Lisp as an ancient ``AI'' language; Statistics as ``artificial
98 intelligence'' (not real intelligence,
\alert{humans are too
99 flawed and inconsistent
} for Bayesian work to be anything but
101 \item Semantics and context are important: well supported by Lisp
103 \item Lisp's parentheses describe single, multi-scale,
104 \alert{complete thought
}. See \
#1 for why that could make it
107 Aside: Common Lisp is the building block for all my current research
112 \frametitle{Semantics and Statistics
}
115 There have been many wonderful talks about the semantic web which \\
116 \alert{demonstrated its coolness
} \\
118 \alert{failed to demonstrate its usefulness
}.\\
119 This talk follows in the tradition of such giants
\ldots{}
122 Biological informatics support (GO, Entrez) has allowed for
123 precise definitions of concepts in biology.
125 \item It is a shame that a field like statistics, requiring such
126 precision, has less than an imprecise and temporally instable
127 field such as biology
\ldots
134 \begin{frame
}{Context
}{(where I'm coming from, my ``priors'')
}
136 \item Pharmaceutical Industry
137 \item Modeling and Simulation uses mathematical models/constructs to
138 record beliefs for explication, clinical team alignment, decision
139 support, and quality management.
140 \item My major role at Novartis is to work at the intersection of
141 biomedical informatics, statistics, and mathematical modeling.
142 \item I need a mix of applications and novel research development to
143 solve challenges better, faster, more efficiently.
144 \item Data analysis is a specialized approach to computer
145 programming,
\alert{different
} than applications programming or
147 \item \alert{Nearly all of the research challenges I face today
148 existed for me in academia, and vice-versa.
}
152 \subsection{Illustrating Computable / Executable Statistics
}
154 \begin{frame
}{Can we compute?
}
156 For the following examples, the critical question becomes:
157 \centerline{\alert{Can we compute with it?
}}
161 \begin{frame
}[fragile
]{Example
1: Theory
\ldots}
163 Let $f(x;
\theta)$ describe the likelihood of XX under the following
169 Then if we use the following algorithm:
174 then $
\hat{\theta}$ should be $N(
0,
\hat\sigma^
2)$ with the following
175 characteristics
\ldots
179 \frametitle{Can we compute, using this description?
}
180 Given the information at hand:
182 \item we ought to have a framework for initial coding for the
183 actual simulations (test-first!)
184 \item the implementation is somewhat clear
190 \frametitle{Example
2: Practice
\ldots}
192 The dataset comes from a series of clinical trials. We model the
193 primary endpoint, ``relief'', as a binary random variable. There is a random
194 trial effect on relief as well as severity due to differences in
195 recruitment and inclusion/exclusion criteria.
199 \frametitle{Can we compute, using this description?
}
201 \item With a real such description, it is clear what some of the
202 potential models might be for this dataset
203 \item It should be clear how to start thinking of a data dictionary
208 \begin{frame
}{Example
3: The Round-trip
\ldots}
210 The first examples describe ``ideas $
\rightarrow$ code''
212 Consider the last time you read someone else's implementation of a
213 statistical procedure (i.e. R package code). When you read the
216 \item the assumptions used?
217 \item the algorithm implemented?
218 \item practical guidance for when you might select the algorithm
220 \item practical guidance for when you might select the
221 implementation over others?
223 These are usually components of any reasonable journal article.
224 \textit{(Q: have you actually read an R package that wasn't yours?)
}
229 \subsection{IT Can Speed up Deliverables in Statistical Practice
}
231 \begin{frame
}{Our Generic Work-flow and Life-cycle
}
232 {describing most data analytic activities
}
235 \item Scope out the problem
236 \item Sketch out a potential solution
237 \item Implement until road-blocks appear
238 \item Deliver results
244 \item 1st e-draft of text/code/date (iterate to \
#1, discarding)
245 \item cycle through work
249 but there is valuble information that could enable the next
253 \begin{frame
}[fragile
]{Paper $
\rightarrow$ Computer $
\rightarrow$ Article $
\rightarrow$ Computer
}{Cut and Paste makes for large errors.
}
255 \item Problems in a regulatory setting
256 \item Regulatory issues are just ``best practices''
259 Why do we ``copy/paste'', or analogously, restart our work?
263 \item every time we repeat, we reinforce the idea in our brain
264 \item review of ideas can help improve them
269 \item introduction of mistakes
270 \item loss of historical context
271 \item changes to earlier work (on a different development branch)
276 \subsection{Literate Programming is insufficient
}
278 \begin{frame
}{Literate Statistical Practice.
}
280 \item Literate Programming applied to data analysis
281 \item among the
\alert{most annoying
} techniques to integrate into
282 work-flow if one is not perfectly methodological.
285 \item ESS: supports interactive creation of literate programs.
286 \item Sweave: tool which exemplifies reporting context; odfWeave
287 primarily simplifies reporting.
288 \item Roxygen: primarily supports a literate programming
289 documentation style, not a literate data analysis programming
292 \item ROI demonstrated in specialized cases: BioConductor.
293 \item \alert{usually done after the fact
} (final step of work-flow)
294 as a documentation/computational reproducibility technique, rarely
295 integrated into work-flow.
297 Many contributors to this general theory/approach:
298 Knuth, Claerbout, de Leeuw, Leisch, Gentleman, Temple-Lang,
303 % \frametitle{Literate Programming}
304 % \framesubtitle{Why is it not enough?}
308 % \item used for statistics since mid 90s (Emacs/ESS support in 1997)
309 % \item active popular use with R (Leisch, 2001)
312 % but it provides a work-flow which is difficult and unnatural for many
313 % people (no perceived ROI).
316 \begin{frame
}{Related work
}
318 Mathematica Workbooks for mathematics concepts
320 \item Mathematical storage and reproducibility, what bout Statistical
322 \item Not open, but freely reproducible.
323 \item Some semantics, hopefully this will improve.
326 Electronic Lab Notebooks for data and the data/data analytics
327 interaction (but not quantitative methodological development).
330 \section{Results/Contribution
}
334 % \begin{frame}{Semantic Web}{How do we communicate "things"?}
335 % Recall Monday evening talk: What kinds of communication problems can we have?
337 % \item I say "reinigung", you say "waschen"
338 % \item I say "clean", you say "sauber"
340 % In the context of our work, how do we communicate what we've done?
343 \begin{frame
}{Communication in Statistical Practice
}{\ldots is essential for
\ldots}
348 \item receiving information
350 \alert{``machine-readable'' communication/computation lets the
352 Semantic Web is about ``machine-enabled computability''.
356 \frametitle{Literate Programming
}
357 \framesubtitle{Why isn't it enough for Data Analysis?
}
359 Only
2 contexts: (executable) code and documentation. Fine for
360 application programming, but for data analysis, we could benefit
363 \item classification of statistical procedures
364 \item descriptions of assumptions
365 \item pragmatic recommendations
366 \item inheritance of structure through the work-flow of a
367 statistical methodology or data analysis project
368 \item datasets and metadata
370 Concept: ontologies describing mathematical assumptions, applications
371 of methods, work-flow, and statistical data structures can enable
372 machine communication.
374 (i.e. informatics framework ala biology)
377 \begin{frame
} \frametitle{Semantics
}
378 \framesubtitle{One definition: description and context
}
380 Interoperability is the key, with respect to
382 \item ``Finding things''
383 \item Applications and activities with related functionality
385 \item moving information from one state to another (paper, journal
386 article, computer program)
387 \item computer programs which implement solutions to similar tasks
392 \begin{frame
}{Statistical Practice is somewhat restricted
}
393 {...but in a good sense, enabling potential for semantics...
}
395 There is a restrictable set of intended actions for what can be done
396 -- the critical goal is to be able to make a difference by
397 accelerating activities that should be ``computable'':
399 \item restricted natural language processing
400 \item mathematical translation
401 \item common description of activities for simpler programming/data
402 analysis (S approach to objects and methods)
404 R is a good primitive start (model formulation approach, simple
405 ``programming with data'' paradigm); we should see if we can do
410 % \begin{frame}{Semantics}{Capturing Ideas, Concepts, Proposals.}
412 % \item Capturing the historical state and corresponding decisions is
413 % essential for developing improved approaches. A common problem in
414 % ``product development'' (stat research, drug development) is
415 % cycling through the same issues repeatedly.
416 % \item These should be captured semantically
417 % \item Conversion of concepts to computable semantics is sensible
418 % when you need it, difficult without a compelling reasons
423 % \begin{frame}{Lowering the bounds to interactive work.}
425 % \item Limitations of object-orientation and information-hiding
426 % routines: require context in order to keep the context.
427 % \item Statistical and Data analysis: context is central and obvious.
431 \subsection{Current Approach / Implementation
}
433 \begin{frame
}{Computable and Executable Statistics requires
}
436 \item approaches to describe data and metadata (``data'')
439 \item metadata management and integration, driving
440 \item data integration
442 \item approaches to describe data analysis methods (``models'')
444 \item quantitatively: many ontologies (AMS, etc), few meeting
446 \item many substantive fields have implementations
447 (bioinformatics, etc) but not well focused.
449 \item approaches to describe the specific form of interaction
450 (``instances of models'')
452 \item Original idea behind ``Literate Statistical Analysis''.
453 \item That idea is suboptimal, more structure needed (not
454 necessarily built upon existing...).
459 \begin{frame
}[fragile
]{Representation: XML and Lisp
}{executing your data
}
460 Many people are familiar with XML:
462 <name phone="+
41793674557">Tony Rossini</name>
464 which is shorter in Lisp:
466 (name "Tony Rossini" :phone "+
41613674557")
469 \item Lisp ``parens'', universally hated by unbelievers, are
470 wonderful for denoting when a ``concept is complete''.
471 \item Why can't your data self-execute?
475 \begin{frame
}{Common Lisp Stat.
}
476 Ross talked about Lisp. I generally agree. My current
477 research program dates back over
3 years, and:
479 \item Originally based on LispStat (reusability)
480 \item Re-factored structure (some numerics worked with a
1990-era code base).
481 \item Current activities:
483 \item numerics redone using CFFI-based BLAS/LAPLACK (cl-blapack)
484 \item matrix interface based on MatLisp
485 \item design of graphics system on-going; constraint system
486 (Cells) supporting interactivity.
487 \item general framework for model specification (regression,
493 \begin{frame
}{Common Lisp Stat
}
495 Source code available!
497 (but it is ugly, works only in
10 cases, and changes with my moods).
502 % \begin{frame}{Delivering Better Data Analyses Faster}
503 % Industrial settings:
505 % \item Pharmaceutical companies
506 % \item Academic departments
507 % \item Review-centric organizations (Health Authorities, Regulators)
511 \begin{frame
}{Summary
}
514 \item In the Pharma industry, it is all about getting the right
515 drugs to the patient faster. Data analysis systems seriously
516 impact this process, being potentially an impediment or an
520 \item \alert{Information technologies can increase the efficiency
521 of statistical practice
}, though innovation change management
522 must be taking into account. (i.e. Statistical practice, while
523 considered by some an ``art form'', can benefit from
525 \item \alert{Lisp's features match the basic requirements we need
}
526 (dichotomy: programs as data, data as programs). Sales pitch,
528 \item Outlook: Lots of work and experimentation to do!
530 \item {\tiny Gratuitous Advert: We are hiring, have student
531 internships (undergrad, grad students), and a visiting faculty
532 program. Talk with me if possibly interested.
}
536 % All of the following is optional and typically not needed.
540 \section<presentation>*
{\appendixname}
543 \begin{frame
} \frametitle{Complements and Backup
}
544 No more, stop here. Questions? (now or later).
547 \begin{frame
}{The Industrial Challenge.
}{Getting the Consulting Right.
}
548 % - A title should summarize the slide in an understandable fashion
549 % for anyone how does not follow everything on the slide itself.
552 \item Recording assumptions for the next data analyst, reviewer.
553 Use
\texttt{itemize
} a lot.
555 Use very short sentences or short phrases.
560 \begin{frame
}{The Industrial Challenge.
}{Getting the Right Research Fast.
}
561 % - A title should summarize the slide in an understandable fashion
562 % for anyone how does not follow everything on the slide itself.
566 Use
\texttt{itemize
} a lot.
568 Use very short sentences or short phrases.
573 \begin{frame
}{Explicating the Work-flow
}{QA/QC-based improvements.
}
578 \section{Semantics and Statistical Practice
}
582 \frametitle{Statistical Activity Leads to Reports
}
583 \framesubtitle{You read what you know, do you understand it?
}
585 How can we improve the communication of the ideas we have?
587 Precision of communication?
593 \begin{frame
} \frametitle{Communication Requires Context
}
594 \framesubtitle{Intentions imply more than one might like...
}
597 \item Consideration of what we might do
598 \item Applications with related functionality
605 \frametitle{Design Patterns
}
606 \framesubtitle{Supporting Work-flow Transitions
}
608 (joint work with H Wickham): The point of this research program is
609 not to describe what to do at any particular stage of work, but to
610 encourage researchers and practitioners to consider how the
611 translation and transfer of information between stages so that work
614 Examples of stages in a work-flow:
616 \item planning, execution, reporting;
617 \item scoping, illustrative examples or counter examples, algorithmic construction,
619 \item descriptive statistics, preliminary inferential analysis,
620 model/assumption checking, final inferential analysis,
621 communication of scientific results
623 Description of work-flows is essential to initiating discussions on
624 quality/efficiency of approaches to work.
627 \section{Design Challenges
}
630 \frametitle{Activities are enhanced by support
}
633 \item Mathematical manipulation can be enhanced by symbolic
635 \item Statistical programming can be enabled by examples and related
636 algorithm implementation
637 \item Datasets, to a limited extent, can self-describe.
642 \frametitle{Executable and Computable Science
}
644 Use of algorithms and construction to describe how things work.
646 Support for agent-based approaches
651 \frametitle{What is Data? Metadata?
}
653 Data: what we've observed
655 MetaData: context for observations, enables semantics.
664 \item Semantics and Computability have captured a great deal of
665 attention in the informatics and business computing R\&D worlds
666 \item Statistically-driven Decision Making and Knowledge Discovery
667 is, with high likelihood, the next challenging stage after data
669 \item Statistical practice (theory and application) can be enhanced,
670 made more efficient, providing increased benefit to organizations
671 and groups using appropriate methods.
672 % \item Lisp as a language, shares characteristics of both Latin
673 % (difficult dead language useful for classical training) and German
674 % (difficult living language useful for general life).
675 % Of course, for some people, they are not difficult.
678 The research program described in this talk is currently driving the
679 design of CommonLisp Stat, which leverages concepts and approaches
680 from the dead and moribund XLisp-Stat project.
682 \url{http://repo.or.cz/w/CommonLispStat.git/
}
685 \section{Common Lisp Statistics
}
688 \frametitle{Interactive Programming
}
689 \framesubtitle{Everything goes back to being Lisp-like
}
691 \item Interactive programming (as originating with Lisp): works
692 extremely well for data analysis (Lisp being the original
693 ``programming with data'' language).
694 \item Theories/methods for how to do this are reflected in styles
699 \begin{frame
}[fragile
]
702 Lisp (LISt Processor) is different than most high-level computing
703 languages, and is very old (
1956). Lisp is built on lists of things
704 which are evaluatable.
706 (functionName data1 data2 data3)
710 '(functionName data1 data2 data3)
712 which is shorthand for
714 (list functionName data1 data2 data3)
716 The difference is important -- lists of data (the second/third) are
717 not (yet?!) functions applied to (unencapsulated lists of) data (the first).
721 \frametitle{Features
}
723 \item Data and Functions semantically the same
724 \item Natural interactive use through functional programming with
726 \item Batch is a simplification of interactive -- not a special mode!
732 % \begin{frame}[fragile]
733 % \frametitle{Defining Variables}
734 % \framesubtitle{Setting variables}
736 % (setq <variable> <value>)
740 % (setq ess-source-directory
741 % "/home/rossini/R-src")
745 % \begin{frame}[fragile]
746 % \frametitle{Defining on the fly}
748 % (setq ess-source-directory
749 % (lambda () (file-name-as-directory
751 % (concat (default-directory)
752 % ess-suffix "-src")))))
754 % (Lambda-expressions are anonymous functions, i.e. ``instant-functions'')
758 % \begin{frame}[fragile]
759 % \frametitle{Function Reuse}
760 % By naming the function, we could make the previous example reusable
763 % (defun my-src-directory ()
764 % (file-name-as-directory
766 % (concat (default-directory)
767 % ess-suffix "-src"))))
771 % (setq ess-source-directory (my-src-directory))
777 % \frametitle{Equality Among Packages}
779 % \item more/less equal can be described specifically through
780 % overriding imports.
785 \subsection<presentation>*
{For Further Reading
}
787 \begin{frame
}[allowframebreaks
]
788 \frametitle<presentation>
{Related Material
}
790 \begin{thebibliography
}{10}
792 \beamertemplatebookbibitems
793 % Start with overview books.
795 \bibitem{LispStat1990
}
797 \newblock {\em LispStat
}.
799 \beamertemplatearticlebibitems
800 % Followed by interesting articles. Keep the list short.
802 \bibitem{Rossini2001
}
804 \newblock Literate Statistical Practice
805 \newblock {\em Proceedings of the Conference on Distributed
806 Statistical Computing
},
2001.
808 \bibitem{RossiniLeisch2003
}
809 AJ.~Rossini and F.~Leisch
810 \newblock Literate Statistical Practice
811 \newblock {\em Technical Report Series, University of Washington
812 Department of Biostatistics
},
2003.
814 \beamertemplatearrowbibitems
815 % Followed by interesting articles. Keep the list short.
818 Common Lisp Stat,
2008.
819 \newblock \url{http://repo.or.cz/CommonLispStat.git/
}
821 \end{thebibliography
}