From 953ae3b42df8d65bc9f471a5cc9a4d132729fec8 Mon Sep 17 00:00:00 2001
From: Petr Baudis <pasky@ucw.cz>
Date: Thu, 11 Mar 2010 21:31:11 +0100
Subject: [PATCH] tex: Data Mining cleanups all over the map

---
 tex/gostyle.tex | 132 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 66 insertions(+), 66 deletions(-)

diff --git a/tex/gostyle.tex b/tex/gostyle.tex
index c89e231..e6bd684 100644
--- a/tex/gostyle.tex
+++ b/tex/gostyle.tex
@@ -450,65 +450,69 @@ only moves played by the appropriate color in the game are collected.
 \section{Data Mining}
 \label{data-mining}
 
-To assess the properties of gathered \emph{pattern vectors}
+To assess the properties of gathered pattern vectors
 and their influence on playing styles,
-we have processes the data using a~few basic data minining techniques.
+we process the data by several basic data minining techniques.
 
-The first two methods ({\em analytic}) rely purely on data gathered
+The first two methods {\em (analytic)} rely purely on data gathered
 from the game collection
 and serve to show internal structure and correlations within the data set.
 
-Principal component analysis finds orthogonal vector components that
+Principal Component Analysis finds orthogonal vector components that
 have the largest variance.
 Reversing the process can indicate which patterns correlate with each component.
-Additionally, PCA can be used as a vector-preprocessing for methods
-that are (negatively) sensitive to \emph{pattern vector} component correlations.
+Additionally, PCA can be used as vector preprocessing for methods
+that are negatively sensitive to pattern vector component correlations.
 
-A~second method -- Kohonen maps -- is based on the theory of self-organizing maps
-of abstract units (neurons) that
+The~second method of Kohonen Maps
+is based on the theory of self-organizing maps of abstract units (neurons) that
 compete against each other for the representation of the input space.
 Because neurons in the network are organized in a two-dimensional plane,
-the trained network virtually spreads vectors to the 2D plane,
-allowing for simple visualization of clusters of players with similar ``properties''.
+the trained network spreads the vectors on a 2D plane,
+allowing for visualization of clusters of players with similar properties.
 
 
-Furthermore, we have used two \emph{classification} methods that assign
-each \emph{pattern vector} $\vec P$ some additional data (\emph{output vector} $\vec O$),
-representing e.g.~information about styles, player's strength or even a country of origin.
-Initially, the methods must be nonetheless calibrated (trained) on some expert or prior knowledge,
-usually in the form of pairs of \emph{reference pattern vectors} and their \emph{output vectors}.
+Furthermore, we use two \emph{classification} methods that assign
+each pattern vector $\vec P$ an \emph{output vector $\vec O$,
+representing e.g.~information about styles, player's strength or even
+meta-information like the player's era or a country of origin.
+Initially, the methods must be calibrated (trained) on some prior knowledge,
+usually in the form of \emph{reference pairs} of pattern vectors
+and the associated output vectors.
 
 Moreover, the reference set can be divided into training and testing pairs
-and the methods can be compared by the square error on testing data set (difference of
-\emph{output vectors} approximated by the method and their real desired value).
+and the methods can be compared by the mean square error on testing data set
+(difference of output vectors approximated by the method and their real desired value).
 
 %\footnote{However, note that dicrete characteristics such as country of origin are
 %not very feasible to use here, since WHAT??? is that even true?? }
 
-$k$-Nearest Neighbor \cite{CoverHart1967} classifier (the first method)
-approximates $\vec O$ by composing the \emph{output vectors}
-of $k$ \emph{reference pattern vectors} closest to $\vec P$.
+The $k$-Nearest Neighbors \cite{CoverHart1967} classifier
+approximates $\vec O$ by composing the output vectors
+of $k$ reference pattern vectors closest to $\vec P$.
 
-The other classifier is based on a~multi-layer feed-forward Artificial Neural Network:
+The other classifier is a~multi-layer feed-forward Artificial Neural Network:
 the neural network can learn correlations between input and output vectors
 and generalize the ``knowledge'' to unknown vectors; it can be more flexible
 in the interpretation of different pattern vector elements and discern more
-complex relations than the kNN classifier, but e.g.~requires larger training sample.
+complex relations than the kNN classifier,
+but may not be as stable and requires larger training sample.
 
 \subsection{Principal Component Analysis}
 \label{data-mining}
 We use Principal Component Analysis \emph{PCA} \cite{Jolliffe1986}
-to reduce the dimensions of the \emph{pattern vectors} while preserving
-as much information as possible.
+to reduce the dimensions of the pattern vectors while preserving
+as much information as possible, assuming inter-dependencies between
+pattern vector dimensions are linear.
 
-Briefly, PCA is an eigenvalue decomposition of a~covariance matrix of centered \emph{pattern vectors},
+Briefly, PCA is an eigenvalue decomposition of a~covariance matrix of centered pattern vectors,
 producing a~linear mapping $o$ from $n$-dimensional vector space
 to a~reduced $m$-dimensional vector space.
 The $m$ eigenvectors of the original vectors' covariance matrix
 with the largest eigenvalues are used as the base of the reduced vector space;
-the eigenvectors form the transformation matrix $W$.
+the eigenvectors form projection matrix $W$.
 
-For each original \emph{pattern vector} $\vec p_i$,
+For each original pattern vector $\vec p_i$,
 we obtain its new representation $\vec r_i$ in the PCA base
 as shown in the following equation:
 \begin{equation}
@@ -521,7 +525,7 @@ The whole process is described in the Algorithm \ref{alg:pca}.
 \caption{PCA -- Principal Component Analysis}
 \begin{algorithmic}
 \label{alg:pca}
-\REQUIRE{$m > 0$, set of players $R$ with \emph{pattern vectors} $p_r$}
+\REQUIRE{$m > 0$, set of players $R$ with pattern vectors $p_r$}
 \STATE $\vec \mu \leftarrow 1/|R| \cdot \sum_{r \in R}{\vec p_r}$
 \FOR{ $r \in R$}
 \STATE $\vec p_r \leftarrow \vec p_r - \vec \mu$
@@ -539,10 +543,11 @@ The whole process is described in the Algorithm \ref{alg:pca}.
 \end{algorithm}
 
 \label{pearson}
-We will want to find correlations between PCA dimensions and
+We want to find correlations between PCA dimensions and
 some prior knowledge (player rank, style vector).
-We compute the well-known {\em Pearson product-moment correlation coefficient} \cite{Pearson}
-values for this purpose, measuring the strength of the linear dependence%
+For this purpose, we compute the well-known
+{\em Pearson product-moment correlation coefficient} \cite{Pearson},
+measuring the strength of the linear dependence%
 \footnote{A desirable property of PMCC is that it is invariant to translations and rescaling
 of the vectors.}
 between the dimensions:
@@ -551,26 +556,26 @@ $$ r_{X,Y} = {{\rm cov}(X,Y) \over \sigma_X \sigma_Y} $$
 
 \subsection{Kohonen Maps}
 \label{koh}
-Kohonen map is a self-organizing network with neurons organized in a~two-dimensional plane.
-Neurons in the map compete for representation of portions of the input vector space.
-Each neuron $\vec n$ represents a vector and the network is trained so that the neurons
-that are topologically close tend to represent vectors that are close as well.
+Kohonen map is a self-organizing network with neurons spread evenly over a~two-dimensional plane.
+Neurons $\vec n$ in the map compete for representation of portions of the input vector space,
+each vector being represented by some neuron.
+The network is trained so that the neurons
+that are topologically close tend to represent vectors that are close in suitable metric as well.
 
 First, a~randomly initialized network is sequentially trained;
 in each iteration, we choose a~random training vector $\vec t$
-and find the neuron $\vec w$ that is closest to $\vec t$ in Euclidean metric
-(we call $\vec w$ a~\emph{winner}).
+and find the {\em winner neuron} $\vec w$ that is closest to $\vec t$ in Euclidean metric.
 
-We then adapt neurons $n$ from the neighbourhood of $\vec w$ employing an equation:
+We then adapt neurons $n$ from the neighborhood of $\vec w$ employing the equation
 \begin{equation}
 \vec n = \vec n + \alpha \cdot \mathit{Influence}(\vec w, \vec n) \cdot (\vec t - \vec n) 
 \end{equation}
 where $\alpha$ is a learning parameter, usually decreasing in time.
 $Influence()$ is a function that forces neurons to spread.
 Such function is usually realised using a mexican hat function or a difference-of-gaussians
-(see \cite{TODO} for details).
+\cite{TODO}.
 The state of the network can be evaluated by calculating mean square difference
-between each $\vec t \in T$ and its corresponding \emph{winner neuron} $\vec w_t$:
+between each $\vec t \in T$ and its corresponding winner neuron $\vec w_t$:
 \begin{equation}
 \mathit{Error}(N,T) = \sum_{\vec t \in T}{|\vec w_t - \vec t|}
 \end{equation}
@@ -600,9 +605,10 @@ between each $\vec t \in T$ and its corresponding \emph{winner neuron} $\vec w_t
 
 \subsection{k-nearest Neighbors Classifier}
 \label{knn}
-Our goal is to approximate player's \emph{output vector} $\vec O$; we know his \emph{pattern vector} $\vec P$.
-We further assume that similarities in players' \emph{pattern vectors}
-uniformly correlate with similarities in players' \emph{output vectors}.
+Our goal is to approximate player's output vector $\vec O$;
+we know his pattern vector $\vec P$.
+We further assume that similarities in players' pattern vectors
+uniformly correlate with similarities in players' output vectors.
 
 We require a set of reference players $R$ with known \emph{pattern vectors} $\vec p_r$
 and \emph{output vectors} $\vec o_r$.
@@ -610,7 +616,7 @@ and \emph{output vectors} $\vec o_r$.
 $\vec O$ is approximated as a~weighted average of \emph{output vectors}
 $\vec o_i$ of $k$ players with \emph{pattern vectors} $\vec p_i$ closest to $\vec P$.
 This is illustrated in the Algorithm \ref{alg:knn}.
-Note that the weight is a function of distance and it is not explicitly defined in Algorithm \ref{alg:knn}.
+Note that the weight is a function of distance and is not explicitly defined in Algorithm \ref{alg:knn}.
 During our research, exponentially decreasing weight has proven to be sufficient.
 
 \begin{algorithm}
@@ -632,11 +638,9 @@ During our research, exponentially decreasing weight has proven to be sufficient
 \subsection{Neural Network Classifier}
 \label{neural-net}
 
-Feedforward neural networks \cite{TODO} are known for their ability to generalize
-and find correlations and patterns between input and output data, working as a classifier.
-
+Feed-forward neural networks \cite{ANN} are known for their ability to generalize
+and find correlations between input patterns and output classifications.
 Before use, the network is iteratively trained on the training data
-(again consisting of pairs of \emph{pattern vectors} as input and \emph{output vectors})
 until the error on the training set is reasonably small.
 
 %Neural network is an adaptive system that must undergo a training
@@ -644,7 +648,8 @@ until the error on the training set is reasonably small.
 %of reference vectors for the k-Nearest Neighbors algorithm above.
 
 \subsubsection{Computation and activation of the NN}
-Technically, the neural network is a network of interconnected computational units called neurons.
+Technically, the neural network is a network of interconnected
+computational units called neurons.
 A feedforward neural network has a layered topology;
 it usually has one \emph{input layer}, one \emph{output layer}
 and an arbitrary number of \emph{hidden layers} between.
@@ -671,19 +676,17 @@ $\sigma(x)=\frac{1}{1+e^{-(rx+k)}}$; parameters control the growth rate ($r$)
 and the x-position ($k$).}
 
 \subsubsection{Training}
-The training of the feed-forward neural network usually involves some
-modification of supervised Backpropagation learning algorithm. \cite{TODO}
-We use first-order optimization algorithm called RPROP \cite{Riedmiller1993}.
+Training of the feed-forward neural network usually involves some
+modification of supervised Backpropagation learning algorithm.
+We use first-order optimization algorithm called RPROP. \cite{Riedmiller1993}
 
 %Because the \emph{reference set} is usually not very large,
 %we have devised a simple method for its extension.
 %This enhancement is based upon adding random linear combinations
 %of \emph{style and pattern vectors} to the training set.
 
-As outlined above, the training set $T$ consists of pairs of
-input vectors (\emph{pattern vectors} $\vec p_i)$ and
-desired \emph{output vectors} $\vec o_i$.
-
+As outlined above, the training set $T$ consists of
+$(\vec p_i, \vec o_i)$ pairs.
 The training algorithm is shown in Algorithm \ref{alg:tnn}.
 
 \begin{algorithm}
@@ -711,20 +714,17 @@ The training algorithm is shown in Algorithm \ref{alg:tnn}.
 \end{algorithmic}
 \end{algorithm}
 
-\subsubsection{Architecture details}
-TODO num layers, num neurons, ..
-TODO patri to vubec sem, spise ne
-
 \subsection{Implementation}
 
-We have implemented the data mining methods as an open-source framework ``gostyle'' \cite{TODO},
+We have implemented the data mining methods as the
+``gostyle'' open-source framework \cite{GoStyle},
 made available under the GNU GPL licence.
-The majority of our basic processing and the analysis parts are implemented in the Python \cite{Python2005} programming language. 
-
-Nonetheless, we use a number of external libraries, such as the MDP library \cite{MDP} (used for PCA analysis),
-Kohonen library \cite{KohonenPy}.
 
-The neural network part of the project is written using the excellent libfann C library\cite{Nissen2003}.
+The majority of our basic processing and the analysis parts
+are implemented in the Python \cite{Python2005} programming language.
+We use several external libraries, most notably the MDP library \cite{MDP} (used for PCA analysis)
+and Kohonen library \cite{KohonenPy}.
+The neural network part of the project is written using the libfann C library\cite{Nissen2003}.
 
 
 \section{Strength Estimator}
-- 
2.11.4.GIT