3 ;;; Time-stamp: <2009-08-31 17:52:26 tony>
4 ;;; Creation: <2005-08-xx 21:34:07 rossini>
6 ;;; Author: AJ Rossini <blindglobe@gmail.com>
7 ;;; Copyright: (c)2005--2009, AJ Rossini. GPLv2
8 ;;; Purpose: data package for lispstat
10 ;;; What is this talk of 'release'? Klingons do not make software
11 ;;; 'releases'. Our software 'escapes', leaving a bloody trail of
12 ;;; designers and quality assurance people in its wake.
14 ;;; This organization and structure is new to the 21st Century
17 (in-package :lisp-stat-data
)
19 ;;; The purpose of this package is to manage data which will be
20 ;;; processed by LispStat. In particular, it will be important to
21 ;;; register variables, datasets, relational structures, and other
22 ;;; objects which could be the target for statistical modeling and
25 ;;; data management, in the context of this system, needs to consider
27 ;;; # multiscale metadata: describing at the variable, observation,
28 ;;; dataset, and collection-of-datasets scales
29 ;;; # data import: push-based (ETL-functionality (extract, t###,
30 ;;; load) externally driven); and pull-based (std data import
32 ;;; # triggers or conditions, for automating situational events
33 ;;; # data export functionality
35 ;;; consider that data has 3 genotypic characteristics.
36 ;;; #1: storage form -- scalar, vector, array.
37 ;;; #2: datarep ("computer science simplistic data") type, such as
38 ;;; integer, real, string, symbol.
39 ;;; #3: statrep (statistical type), "usually handled by computer
40 ;;; science approaches via metadata", augmenting datarep type with
41 ;;; use in a statistical context, i.e. that would include nominal,
42 ;;; ordinal, integer, continous, interval (orderable subtypes).
44 ;;; Clearly, the statistical type can be inherited, likewise the
45 ;;; numerical type as well. The form can be pushed up or simplified
46 ;;; as necessary, but this can be challenging.
48 ;;; The first approach considered is for CLS to handle this as
49 ;;; lisp-only structures. When we realize an "abstract" model, the
50 ;;; data should be able to be pushed by appropriate triggers (either
51 ;;; "en masse", or "on-demand") into an appropriate linear algebra
54 ;;; There is some excellent material on this by John Chambers in one
55 ;;; of his earlier books. Reference is being ignored to encourage
56 ;;; people to read them all. With all due respect to John, they've
57 ;;; lasted quite well, but need to be updated.
60 ;;; Data (storage) Types, dt-{.*}
62 ;;; Data types are the representation of data from a computer-science
63 ;;; perspective, i.e. what it is that they contain, in the sense of
64 ;;; scalars, arrays, networks, but not the actual values or
65 ;;; statistical behavour of the values. These types include
66 ;;; particular forms of compound types (i.e. dataframe is array-like,
67 ;;; but types differ, difference is row-wise, while array is a
68 ;;; compound of elements of the same type.
70 ;;; This is completely subject to change, AND HAS. We use a class
71 ;;; heirarchy to generate the types, deriving from the virtual
72 ;;; dataframe-like and matrix-like classes to construct what we think
73 ;;; we might need, in terms of variables, observations, datasets, and
74 ;;; collections-of-datasets.
76 ;;; Statistical Variable Types, sv-{.*} or statistical-variable-{.*}
78 ;;; Statistical variable types work to represent the statistical
79 ;;; category represented by the variable, i.e. nominal, ordinal,
80 ;;; integral, continous, ratio. This metadata can be used to hint at
81 ;;; appropriate analysis methods -- or perhaps more critically, to
82 ;;; define how these methods will fail in the final interrpretation.
84 ;;; originally, these were considered to be types, but now, we
85 ;;; consider this in terms of abstract classes and mix-ins.
87 ;;; STATISTICAL VARIABLES SHOULD BE XARRAY'd
89 ;; Need to distinguish between empirical and model-based realizations
90 ;; of variables. Do we balance by API design, or should we ensure that
91 ;; one or the other is more critical (via naming convention of adding
92 ;; description to class name)?
95 (defclass empirical-statistical-variable
97 ((number-of-observations :initform
0
100 ;; :type generalized-sequence ; sequence or
102 :documentation
"number of statistically
103 independent observations in the current context (assuming design,
104 marginalization, and conditioning to create the current dataset
105 from which this variable came from)."))
106 (:documentation
"basic class indicating that we are working with a
107 statistical variable (arising from a actual set of observation or
108 a virtual / hypothesized set)."))
110 (defclass modelbased-statistical-variable
112 ((density/mass-function
:initform nil
116 :documentation
"core function indicating
117 probability of a set of observations")
118 (draw-function :initform nil
120 :accessor draw
; must match cl-random API
122 :documentation
"function for drawing an observation,
123 should take an optional RV arg for selecting the stream to draw
125 (:documentation
"model-based statistical variables have observations
126 which come from a model. Core information is simply how to
127 compute probabilities, and how to draw a new realization. All
128 else should be deriveable from these two. Possibly we need
129 additional metadata for working with these?"))
131 (defclass categorical-statistical-variable
132 (statistical-variable)
133 ((factor-levels :initform nil
134 :initarg
:factor-levels
135 :accessor factor-levels
137 :documentation
"the possible levels which the
138 variable may take. These should be a (possibly proper) superset
139 of the actual current levels observed in the variable.")))
141 (defclass nominal-statistical-variable
142 (categorical-statistical-variable)
144 (:documentation
"currently identical to categorical variable, no
145 true difference from the most general state."))
147 (defclass ordinal-statistical-variable
148 (nominal-statistical-variable)
149 ((ordering :initform nil
153 :documentation
"levels are completely ordered, and this
154 should be an ordered sequence (prefer array/vector?) of unique
155 levels. (do we need a partially ordered variant?)"))
156 (:documentation
"categorical variable whose levels are completely ordered."))
158 (defclass continuous-statistical-variable
159 (statistical-variable)
160 ((support :initform nil
163 :documentation
"Support is used in the sense of
164 probability support, and should be a range, list of ranges, t
165 (indicating whole space), or nil (indicating measure-0 space)."))
166 (:documentation
"empirical characteristics for a continuous
167 statistical variable"))
170 (defmethod print-object ((object statistical-variable
) stream
)
171 "Need to work through how to print various objects. Statvars don't
172 necessarily have data yet!"
173 (print-unreadable-object (object stream
:type t
)
174 (format stream
"nobs=~d" (nobs object
))))
176 (defmethod print-object ((object categorical-statistical-variable
) stream
)
177 "Need to work through how to print various objects. Statvars don't
178 necessarily have data yet! Here, we should print out the stat-var
179 information, (pass to superclass) and then print out factor levels if
180 short enough (exact class). Useful to review methods-mixing for
182 (print-unreadable-object (object stream
:type t
)
183 (format stream
"nobs=~d" (nobs object
))
184 (format stream
"levels=~A" (factor-levels object
))))