document some tasks in dataframe.lisp that need resolution.
[CommonLispStat.git] / src / data / data.lisp
blobbc7d35f9ae38cba2493111123638dae70c61cbe0
1 ;;; -*- mode: lisp -*-
3 ;;; Time-stamp: <2009-12-22 22:13:48 tony>
4 ;;; Creation: <2005-08-xx 21:34:07 rossini>
5 ;;; File: data.lisp
6 ;;; Author: AJ Rossini <blindglobe@gmail.com>
7 ;;; Copyright: (c)2005--2009, AJ Rossini. GPLv2
8 ;;; Purpose: data package for lispstat
10 ;;; What is this talk of 'release'? Klingons do not make software
11 ;;; 'releases'. Our software 'escapes', leaving a bloody trail of
12 ;;; designers and quality assurance people in its wake.
14 ;;; This organization and structure is new to the 21st Century
15 ;;; version.
17 (in-package :cls-data)
19 ;;; The purpose of this package is to manage data which will be
20 ;;; processed by LispStat. In particular, it will be important to
21 ;;; register variables, datasets, relational structures, and other
22 ;;; objects which could be the target for statistical modeling and
23 ;;; inference.
25 ;;; data management, in the context of this system, needs to consider
26 ;;; the following:
27 ;;; # multiscale metadata: describing at the variable, observation,
28 ;;; dataset, and collection-of-datasets scales
29 ;;; # data import: push-based (ETL-functionality (extract, t###,
30 ;;; load) externally driven); and pull-based (std data import
31 ;;; functionality)
32 ;;; # triggers or conditions, for automating situational events
33 ;;; # data export functionality
35 ;;; consider that data has 3 genotypic characteristics.
36 ;;; #1: storage form -- scalar, vector, array.
37 ;;; #2: datarep ("computer science simplistic data") type, such as
38 ;;; integer, real, string, symbol.
39 ;;; #3: statrep (statistical type), "usually handled by computer
40 ;;; science approaches via metadata", augmenting datarep type with
41 ;;; use in a statistical context, i.e. that would include nominal,
42 ;;; ordinal, integer, continous, interval (orderable subtypes).
43 ;;;
44 ;;; Clearly, the statistical type can be inherited, likewise the
45 ;;; numerical type as well. The form can be pushed up or simplified
46 ;;; as necessary, but this can be challenging.
48 ;;; The first approach considered is for CLS to handle this as
49 ;;; lisp-only structures. When we realize an "abstract" model, the
50 ;;; data should be able to be pushed by appropriate triggers (either
51 ;;; "en masse", or "on-demand") into an appropriate linear algebra
52 ;;; framework.
54 ;;; There is some excellent material on this by John Chambers in one
55 ;;; of his earlier books. Reference is being ignored to encourage
56 ;;; people to read them all. With all due respect to John, they've
57 ;;; lasted quite well, but need to be updated.
60 ;;; Data (storage) Types, dt-{.*}
61 ;;;
62 ;;; Data types are the representation of data from a computer-science
63 ;;; perspective, i.e. what it is that they contain, in the sense of
64 ;;; scalars, arrays, networks, but not the actual values or
65 ;;; statistical behavour of the values. These types include
66 ;;; particular forms of compound types (i.e. dataframe is array-like,
67 ;;; but types differ, difference is row-wise, while array is a
68 ;;; compound of elements of the same type.
69 ;;;
70 ;;; This is completely subject to change, AND HAS. We use a class
71 ;;; heirarchy to generate the types, deriving from the virtual
72 ;;; dataframe-like and matrix-like classes to construct what we think
73 ;;; we might need, in terms of variables, observations, datasets, and
74 ;;; collections-of-datasets.
76 ;;; Statistical Variable Types, sv-{.*} or statistical-variable-{.*}
77 ;;;
78 ;;; Statistical variable types work to represent the statistical
79 ;;; category represented by the variable, i.e. nominal, ordinal,
80 ;;; integral, continous, ratio. This metadata can be used to hint at
81 ;;; appropriate analysis methods -- or perhaps more critically, to
82 ;;; define how these methods will fail in the final interrpretation.
84 ;;; originally, these were considered to be types, but now, we
85 ;;; consider this in terms of abstract classes and mix-ins.
87 ;;; STATISTICAL VARIABLES SHOULD BE XARRAY'd
89 ;; Need to distinguish between empirical and model-based realizations
90 ;; of variables. Do we balance by API design, or should we ensure that
91 ;; one or the other is more critical (via naming convention of adding
92 ;; description to class name)?
95 (defclass empirical-statistical-variable
97 ((number-of-observations :initform 0
98 :initarg :nobs
99 :accessor nobs
100 ;; :type generalized-sequence ; sequence or
101 ;; array
102 :documentation "number of statistically
103 independent observations in the current context (assuming design,
104 marginalization, and conditioning to create the current dataset
105 from which this variable came from)."))
106 (:documentation "basic class indicating that we are working with a
107 statistical variable (arising from a actual set of observation or
108 a virtual / hypothesized set)."))
110 (defclass modelbased-statistical-variable
112 ((density/mass-function :initform nil
113 :initarg :pdmf
114 :accessor pdmf
115 :type function
116 :documentation "core function indicating
117 probability of a set of observations")
118 (draw-function :initform nil
119 :initarg :drawf
120 :accessor draw ; must match cl-random API
121 :type function
122 :documentation "function for drawing an observation,
123 should take an optional RV arg for selecting the stream to draw
124 from."))
125 (:documentation "model-based statistical variables have observations
126 which come from a model. Core information is simply how to
127 compute probabilities, and how to draw a new realization. All
128 else should be deriveable from these two. Possibly we need
129 additional metadata for working with these?"))
131 (defclass categorical-statistical-variable
132 (statistical-variable)
133 ((factor-levels :initform nil
134 :initarg :factor-levels
135 :accessor factor-levels
136 :type sequence
137 :documentation "the possible levels which the
138 variable may take. These should be a (possibly proper) superset
139 of the actual current levels observed in the variable.")))
141 (defclass nominal-statistical-variable
142 (categorical-statistical-variable)
144 (:documentation "currently identical to categorical variable, no
145 true difference from the most general state."))
147 (defclass ordinal-statistical-variable
148 (nominal-statistical-variable)
149 ((ordering :initform nil
150 :initarg :ordering
151 :accessor ordering
152 :type sequence
153 :documentation "levels are completely ordered, and this
154 should be an ordered sequence (prefer array/vector?) of unique
155 levels. (do we need a partially ordered variant?)"))
156 (:documentation "categorical variable whose levels are completely ordered."))
158 (defclass continuous-statistical-variable
159 (statistical-variable)
160 ((support :initform nil
161 :accessor support
162 :type sequence
163 :documentation "Support is used in the sense of
164 probability support, and should be a range, list of ranges, t
165 (indicating whole space), or nil (indicating measure-0 space)."))
166 (:documentation "empirical characteristics for a continuous
167 statistical variable"))
169 (defmethod print-object ((object statistical-variable) stream)
170 "Need to work through how to print various objects. Statvars don't
171 necessarily have data yet!"
172 (print-unreadable-object (object stream :type t)
173 (format stream "nobs=~d" (nobs object))))
175 (defmethod print-object ((object categorical-statistical-variable) stream)
176 "Need to work through how to print various objects. Statvars don't
177 necessarily have data yet! Here, we should print out the stat-var
178 information, (pass to superclass) and then print out factor levels if
179 short enough (exact class). Useful to review methods-mixing for
180 this, first bit should be indentical to stat-var."
181 (print-unreadable-object (object stream :type t)
182 (format stream "nobs=~d" (nobs object))
183 (format stream "levels=~A" (factor-levels object))))
185 ;;; Observations
188 (defclass statistical-observation ()
189 ((measurement-types :initform nil
190 :initarg measurement-types
191 :accessor measurement-types
192 :type sequence
193 :documentation "sequence of types corresponding
194 to the classes of entries which have been measured/recorded to form
195 the observation.")
196 (record :initform nil
197 :initarg record
198 :accessor record
199 :type sequence
200 :documentation "the sequence of data which is a realization
201 of the corresponding measurement type"))
202 (:documentation "denotes a vector of measurements, not necesarily
203 simple (i.e. entries could be scalar, array, network) which can be
204 assumed to be independent or at least conditionally independent
205 given measurements external to the collected dataset. Failure of
206 this condition implies a single observation, not multiple
207 observations."))
209 ;;; At this point, from a dataframe, which is just a simple holding
210 ;;; structure, we should be able to extract variables and
211 ;;; observations, which ought to be coherent, atomic, complex objects.
212 ;;; (to create a wonderful contradiction: consider the time-series
213 ;;; from the Dow Jones Industrial Average -- in this case, we need
214 ;;; would have a dataset consisting of 1 observation and 1 variable --
215 ;;; which would be the singular time series (at whatever temporal
216 ;;; resolution was desired).
218 ;;; For now, we need to have a means of extracting components of the
219 ;;; dataframe into corresponding variables and observations as
220 ;;; needed. We don't build up the dataframe directly from variables
221 ;;; (yet -- this could change as we consider the workflow/API
222 ;;; approach) but rather we tear-down the dataframe through
223 ;;; consideration of variables and observations.
225 ;;; this naturally means that this is metadata on top of the
226 ;;; dataframe, rather than building the dataframe on top of metadata.
227 ;;; For pragmatic reasons, it isn't always clear that the dataframe
228 ;;; MUST correspond to the particular instance of the practical
229 ;;; statistical philosophy espoused in this system. But at some more
230 ;;; mature point, it should be.