more stat var stuff, no more cxls stuff.
[CommonLispStat.git] / src / data / data.lisp
blob23fc016846add9363bbe253281c270b8a6d127d6
1 ;;; -*- mode: lisp -*-
3 ;;; Time-stamp: <2009-08-31 17:52:26 tony>
4 ;;; Creation: <2005-08-xx 21:34:07 rossini>
5 ;;; File: data.lisp
6 ;;; Author: AJ Rossini <blindglobe@gmail.com>
7 ;;; Copyright: (c)2005--2009, AJ Rossini. GPLv2
8 ;;; Purpose: data package for lispstat
10 ;;; What is this talk of 'release'? Klingons do not make software
11 ;;; 'releases'. Our software 'escapes', leaving a bloody trail of
12 ;;; designers and quality assurance people in its wake.
14 ;;; This organization and structure is new to the 21st Century
15 ;;; version.
17 (in-package :lisp-stat-data)
19 ;;; The purpose of this package is to manage data which will be
20 ;;; processed by LispStat. In particular, it will be important to
21 ;;; register variables, datasets, relational structures, and other
22 ;;; objects which could be the target for statistical modeling and
23 ;;; inference.
25 ;;; data management, in the context of this system, needs to consider
26 ;;; the following:
27 ;;; # multiscale metadata: describing at the variable, observation,
28 ;;; dataset, and collection-of-datasets scales
29 ;;; # data import: push-based (ETL-functionality (extract, t###,
30 ;;; load) externally driven); and pull-based (std data import
31 ;;; functionality)
32 ;;; # triggers or conditions, for automating situational events
33 ;;; # data export functionality
35 ;;; consider that data has 3 genotypic characteristics.
36 ;;; #1: storage form -- scalar, vector, array.
37 ;;; #2: datarep ("computer science simplistic data") type, such as
38 ;;; integer, real, string, symbol.
39 ;;; #3: statrep (statistical type), "usually handled by computer
40 ;;; science approaches via metadata", augmenting datarep type with
41 ;;; use in a statistical context, i.e. that would include nominal,
42 ;;; ordinal, integer, continous, interval (orderable subtypes).
43 ;;;
44 ;;; Clearly, the statistical type can be inherited, likewise the
45 ;;; numerical type as well. The form can be pushed up or simplified
46 ;;; as necessary, but this can be challenging.
48 ;;; The first approach considered is for CLS to handle this as
49 ;;; lisp-only structures. When we realize an "abstract" model, the
50 ;;; data should be able to be pushed by appropriate triggers (either
51 ;;; "en masse", or "on-demand") into an appropriate linear algebra
52 ;;; framework.
54 ;;; There is some excellent material on this by John Chambers in one
55 ;;; of his earlier books. Reference is being ignored to encourage
56 ;;; people to read them all. With all due respect to John, they've
57 ;;; lasted quite well, but need to be updated.
60 ;;; Data (storage) Types, dt-{.*}
61 ;;;
62 ;;; Data types are the representation of data from a computer-science
63 ;;; perspective, i.e. what it is that they contain, in the sense of
64 ;;; scalars, arrays, networks, but not the actual values or
65 ;;; statistical behavour of the values. These types include
66 ;;; particular forms of compound types (i.e. dataframe is array-like,
67 ;;; but types differ, difference is row-wise, while array is a
68 ;;; compound of elements of the same type.
69 ;;;
70 ;;; This is completely subject to change, AND HAS. We use a class
71 ;;; heirarchy to generate the types, deriving from the virtual
72 ;;; dataframe-like and matrix-like classes to construct what we think
73 ;;; we might need, in terms of variables, observations, datasets, and
74 ;;; collections-of-datasets.
76 ;;; Statistical Variable Types, sv-{.*} or statistical-variable-{.*}
77 ;;;
78 ;;; Statistical variable types work to represent the statistical
79 ;;; category represented by the variable, i.e. nominal, ordinal,
80 ;;; integral, continous, ratio. This metadata can be used to hint at
81 ;;; appropriate analysis methods -- or perhaps more critically, to
82 ;;; define how these methods will fail in the final interrpretation.
84 ;;; originally, these were considered to be types, but now, we
85 ;;; consider this in terms of abstract classes and mix-ins.
87 ;;; STATISTICAL VARIABLES SHOULD BE XARRAY'd
89 ;; Need to distinguish between empirical and model-based realizations
90 ;; of variables. Do we balance by API design, or should we ensure that
91 ;; one or the other is more critical (via naming convention of adding
92 ;; description to class name)?
95 (defclass empirical-statistical-variable
97 ((number-of-observations :initform 0
98 :initarg :nobs
99 :accessor nobs
100 ;; :type generalized-sequence ; sequence or
101 ;; array
102 :documentation "number of statistically
103 independent observations in the current context (assuming design,
104 marginalization, and conditioning to create the current dataset
105 from which this variable came from)."))
106 (:documentation "basic class indicating that we are working with a
107 statistical variable (arising from a actual set of observation or
108 a virtual / hypothesized set)."))
110 (defclass modelbased-statistical-variable
112 ((density/mass-function :initform nil
113 :initarg :pdmf
114 :accessor pdmf
115 :type function
116 :documentation "core function indicating
117 probability of a set of observations")
118 (draw-function :initform nil
119 :initarg :drawf
120 :accessor draw ; must match cl-random API
121 :type function
122 :documentation "function for drawing an observation,
123 should take an optional RV arg for selecting the stream to draw
124 from."))
125 (:documentation "model-based statistical variables have observations
126 which come from a model. Core information is simply how to
127 compute probabilities, and how to draw a new realization. All
128 else should be deriveable from these two. Possibly we need
129 additional metadata for working with these?"))
131 (defclass categorical-statistical-variable
132 (statistical-variable)
133 ((factor-levels :initform nil
134 :initarg :factor-levels
135 :accessor factor-levels
136 :type sequence
137 :documentation "the possible levels which the
138 variable may take. These should be a (possibly proper) superset
139 of the actual current levels observed in the variable.")))
141 (defclass nominal-statistical-variable
142 (categorical-statistical-variable)
144 (:documentation "currently identical to categorical variable, no
145 true difference from the most general state."))
147 (defclass ordinal-statistical-variable
148 (nominal-statistical-variable)
149 ((ordering :initform nil
150 :initarg :ordering
151 :accessor ordering
152 :type sequence
153 :documentation "levels are completely ordered, and this
154 should be an ordered sequence (prefer array/vector?) of unique
155 levels. (do we need a partially ordered variant?)"))
156 (:documentation "categorical variable whose levels are completely ordered."))
158 (defclass continuous-statistical-variable
159 (statistical-variable)
160 ((support :initform nil
161 :accessor support
162 :type sequence
163 :documentation "Support is used in the sense of
164 probability support, and should be a range, list of ranges, t
165 (indicating whole space), or nil (indicating measure-0 space)."))
166 (:documentation "empirical characteristics for a continuous
167 statistical variable"))
170 (defmethod print-object ((object statistical-variable) stream)
171 "Need to work through how to print various objects. Statvars don't
172 necessarily have data yet!"
173 (print-unreadable-object (object stream :type t)
174 (format stream "nobs=~d" (nobs object))))
176 (defmethod print-object ((object categorical-statistical-variable) stream)
177 "Need to work through how to print various objects. Statvars don't
178 necessarily have data yet! Here, we should print out the stat-var
179 information, (pass to superclass) and then print out factor levels if
180 short enough (exact class). Useful to review methods-mixing for
181 this."
182 (print-unreadable-object (object stream :type t)
183 (format stream "nobs=~d" (nobs object))
184 (format stream "levels=~A" (factor-levels object))))