From 953ae3b42df8d65bc9f471a5cc9a4d132729fec8 Mon Sep 17 00:00:00 2001
From: Petr Baudis
Date: Thu, 11 Mar 2010 21:31:11 +0100
Subject: [PATCH] tex: Data Mining cleanups all over the map

tex/gostyle.tex  132 ++++++++++++++++++++++++++++
1 file changed, 66 insertions(+), 66 deletions()
diff git a/tex/gostyle.tex b/tex/gostyle.tex
index c89e231..e6bd684 100644
 a/tex/gostyle.tex
+++ b/tex/gostyle.tex
@@ 450,65 +450,69 @@ only moves played by the appropriate color in the game are collected.
\section{Data Mining}
\label{datamining}
To assess the properties of gathered \emph{pattern vectors}
+To assess the properties of gathered pattern vectors
and their influence on playing styles,
we have processes the data using a~few basic data minining techniques.
+we process the data by several basic data minining techniques.
The first two methods ({\em analytic}) rely purely on data gathered
+The first two methods {\em (analytic)} rely purely on data gathered
from the game collection
and serve to show internal structure and correlations within the data set.
Principal component analysis finds orthogonal vector components that
+Principal Component Analysis finds orthogonal vector components that
have the largest variance.
Reversing the process can indicate which patterns correlate with each component.
Additionally, PCA can be used as a vectorpreprocessing for methods
that are (negatively) sensitive to \emph{pattern vector} component correlations.
+Additionally, PCA can be used as vector preprocessing for methods
+that are negatively sensitive to pattern vector component correlations.
A~second method  Kohonen maps  is based on the theory of selforganizing maps
of abstract units (neurons) that
+The~second method of Kohonen Maps
+is based on the theory of selforganizing maps of abstract units (neurons) that
compete against each other for the representation of the input space.
Because neurons in the network are organized in a twodimensional plane,
the trained network virtually spreads vectors to the 2D plane,
allowing for simple visualization of clusters of players with similar ``properties''.
+the trained network spreads the vectors on a 2D plane,
+allowing for visualization of clusters of players with similar properties.
Furthermore, we have used two \emph{classification} methods that assign
each \emph{pattern vector} $\vec P$ some additional data (\emph{output vector} $\vec O$),
representing e.g.~information about styles, player's strength or even a country of origin.
Initially, the methods must be nonetheless calibrated (trained) on some expert or prior knowledge,
usually in the form of pairs of \emph{reference pattern vectors} and their \emph{output vectors}.
+Furthermore, we use two \emph{classification} methods that assign
+each pattern vector $\vec P$ an \emph{output vector $\vec O$,
+representing e.g.~information about styles, player's strength or even
+metainformation like the player's era or a country of origin.
+Initially, the methods must be calibrated (trained) on some prior knowledge,
+usually in the form of \emph{reference pairs} of pattern vectors
+and the associated output vectors.
Moreover, the reference set can be divided into training and testing pairs
and the methods can be compared by the square error on testing data set (difference of
\emph{output vectors} approximated by the method and their real desired value).
+and the methods can be compared by the mean square error on testing data set
+(difference of output vectors approximated by the method and their real desired value).
%\footnote{However, note that dicrete characteristics such as country of origin are
%not very feasible to use here, since WHAT??? is that even true?? }
$k$Nearest Neighbor \cite{CoverHart1967} classifier (the first method)
approximates $\vec O$ by composing the \emph{output vectors}
of $k$ \emph{reference pattern vectors} closest to $\vec P$.
+The $k$Nearest Neighbors \cite{CoverHart1967} classifier
+approximates $\vec O$ by composing the output vectors
+of $k$ reference pattern vectors closest to $\vec P$.
The other classifier is based on a~multilayer feedforward Artificial Neural Network:
+The other classifier is a~multilayer feedforward Artificial Neural Network:
the neural network can learn correlations between input and output vectors
and generalize the ``knowledge'' to unknown vectors; it can be more flexible
in the interpretation of different pattern vector elements and discern more
complex relations than the kNN classifier, but e.g.~requires larger training sample.
+complex relations than the kNN classifier,
+but may not be as stable and requires larger training sample.
\subsection{Principal Component Analysis}
\label{datamining}
We use Principal Component Analysis \emph{PCA} \cite{Jolliffe1986}
to reduce the dimensions of the \emph{pattern vectors} while preserving
as much information as possible.
+to reduce the dimensions of the pattern vectors while preserving
+as much information as possible, assuming interdependencies between
+pattern vector dimensions are linear.
Briefly, PCA is an eigenvalue decomposition of a~covariance matrix of centered \emph{pattern vectors},
+Briefly, PCA is an eigenvalue decomposition of a~covariance matrix of centered pattern vectors,
producing a~linear mapping $o$ from $n$dimensional vector space
to a~reduced $m$dimensional vector space.
The $m$ eigenvectors of the original vectors' covariance matrix
with the largest eigenvalues are used as the base of the reduced vector space;
the eigenvectors form the transformation matrix $W$.
+the eigenvectors form projection matrix $W$.
For each original \emph{pattern vector} $\vec p_i$,
+For each original pattern vector $\vec p_i$,
we obtain its new representation $\vec r_i$ in the PCA base
as shown in the following equation:
\begin{equation}
@@ 521,7 +525,7 @@ The whole process is described in the Algorithm \ref{alg:pca}.
\caption{PCA  Principal Component Analysis}
\begin{algorithmic}
\label{alg:pca}
\REQUIRE{$m > 0$, set of players $R$ with \emph{pattern vectors} $p_r$}
+\REQUIRE{$m > 0$, set of players $R$ with pattern vectors $p_r$}
\STATE $\vec \mu \leftarrow 1/R \cdot \sum_{r \in R}{\vec p_r}$
\FOR{ $r \in R$}
\STATE $\vec p_r \leftarrow \vec p_r  \vec \mu$
@@ 539,10 +543,11 @@ The whole process is described in the Algorithm \ref{alg:pca}.
\end{algorithm}
\label{pearson}
We will want to find correlations between PCA dimensions and
+We want to find correlations between PCA dimensions and
some prior knowledge (player rank, style vector).
We compute the wellknown {\em Pearson productmoment correlation coefficient} \cite{Pearson}
values for this purpose, measuring the strength of the linear dependence%
+For this purpose, we compute the wellknown
+{\em Pearson productmoment correlation coefficient} \cite{Pearson},
+measuring the strength of the linear dependence%
\footnote{A desirable property of PMCC is that it is invariant to translations and rescaling
of the vectors.}
between the dimensions:
@@ 551,26 +556,26 @@ $$ r_{X,Y} = {{\rm cov}(X,Y) \over \sigma_X \sigma_Y} $$
\subsection{Kohonen Maps}
\label{koh}
Kohonen map is a selforganizing network with neurons organized in a~twodimensional plane.
Neurons in the map compete for representation of portions of the input vector space.
Each neuron $\vec n$ represents a vector and the network is trained so that the neurons
that are topologically close tend to represent vectors that are close as well.
+Kohonen map is a selforganizing network with neurons spread evenly over a~twodimensional plane.
+Neurons $\vec n$ in the map compete for representation of portions of the input vector space,
+each vector being represented by some neuron.
+The network is trained so that the neurons
+that are topologically close tend to represent vectors that are close in suitable metric as well.
First, a~randomly initialized network is sequentially trained;
in each iteration, we choose a~random training vector $\vec t$
and find the neuron $\vec w$ that is closest to $\vec t$ in Euclidean metric
(we call $\vec w$ a~\emph{winner}).
+and find the {\em winner neuron} $\vec w$ that is closest to $\vec t$ in Euclidean metric.
We then adapt neurons $n$ from the neighbourhood of $\vec w$ employing an equation:
+We then adapt neurons $n$ from the neighborhood of $\vec w$ employing the equation
\begin{equation}
\vec n = \vec n + \alpha \cdot \mathit{Influence}(\vec w, \vec n) \cdot (\vec t  \vec n)
\end{equation}
where $\alpha$ is a learning parameter, usually decreasing in time.
$Influence()$ is a function that forces neurons to spread.
Such function is usually realised using a mexican hat function or a differenceofgaussians
(see \cite{TODO} for details).
+\cite{TODO}.
The state of the network can be evaluated by calculating mean square difference
between each $\vec t \in T$ and its corresponding \emph{winner neuron} $\vec w_t$:
+between each $\vec t \in T$ and its corresponding winner neuron $\vec w_t$:
\begin{equation}
\mathit{Error}(N,T) = \sum_{\vec t \in T}{\vec w_t  \vec t}
\end{equation}
@@ 600,9 +605,10 @@ between each $\vec t \in T$ and its corresponding \emph{winner neuron} $\vec w_t
\subsection{knearest Neighbors Classifier}
\label{knn}
Our goal is to approximate player's \emph{output vector} $\vec O$; we know his \emph{pattern vector} $\vec P$.
We further assume that similarities in players' \emph{pattern vectors}
uniformly correlate with similarities in players' \emph{output vectors}.
+Our goal is to approximate player's output vector $\vec O$;
+we know his pattern vector $\vec P$.
+We further assume that similarities in players' pattern vectors
+uniformly correlate with similarities in players' output vectors.
We require a set of reference players $R$ with known \emph{pattern vectors} $\vec p_r$
and \emph{output vectors} $\vec o_r$.
@@ 610,7 +616,7 @@ and \emph{output vectors} $\vec o_r$.
$\vec O$ is approximated as a~weighted average of \emph{output vectors}
$\vec o_i$ of $k$ players with \emph{pattern vectors} $\vec p_i$ closest to $\vec P$.
This is illustrated in the Algorithm \ref{alg:knn}.
Note that the weight is a function of distance and it is not explicitly defined in Algorithm \ref{alg:knn}.
+Note that the weight is a function of distance and is not explicitly defined in Algorithm \ref{alg:knn}.
During our research, exponentially decreasing weight has proven to be sufficient.
\begin{algorithm}
@@ 632,11 +638,9 @@ During our research, exponentially decreasing weight has proven to be sufficient
\subsection{Neural Network Classifier}
\label{neuralnet}
Feedforward neural networks \cite{TODO} are known for their ability to generalize
and find correlations and patterns between input and output data, working as a classifier.

+Feedforward neural networks \cite{ANN} are known for their ability to generalize
+and find correlations between input patterns and output classifications.
Before use, the network is iteratively trained on the training data
(again consisting of pairs of \emph{pattern vectors} as input and \emph{output vectors})
until the error on the training set is reasonably small.
%Neural network is an adaptive system that must undergo a training
@@ 644,7 +648,8 @@ until the error on the training set is reasonably small.
%of reference vectors for the kNearest Neighbors algorithm above.
\subsubsection{Computation and activation of the NN}
Technically, the neural network is a network of interconnected computational units called neurons.
+Technically, the neural network is a network of interconnected
+computational units called neurons.
A feedforward neural network has a layered topology;
it usually has one \emph{input layer}, one \emph{output layer}
and an arbitrary number of \emph{hidden layers} between.
@@ 671,19 +676,17 @@ $\sigma(x)=\frac{1}{1+e^{(rx+k)}}$; parameters control the growth rate ($r$)
and the xposition ($k$).}
\subsubsection{Training}
The training of the feedforward neural network usually involves some
modification of supervised Backpropagation learning algorithm. \cite{TODO}
We use firstorder optimization algorithm called RPROP \cite{Riedmiller1993}.
+Training of the feedforward neural network usually involves some
+modification of supervised Backpropagation learning algorithm.
+We use firstorder optimization algorithm called RPROP. \cite{Riedmiller1993}
%Because the \emph{reference set} is usually not very large,
%we have devised a simple method for its extension.
%This enhancement is based upon adding random linear combinations
%of \emph{style and pattern vectors} to the training set.
As outlined above, the training set $T$ consists of pairs of
input vectors (\emph{pattern vectors} $\vec p_i)$ and
desired \emph{output vectors} $\vec o_i$.

+As outlined above, the training set $T$ consists of
+$(\vec p_i, \vec o_i)$ pairs.
The training algorithm is shown in Algorithm \ref{alg:tnn}.
\begin{algorithm}
@@ 711,20 +714,17 @@ The training algorithm is shown in Algorithm \ref{alg:tnn}.
\end{algorithmic}
\end{algorithm}
\subsubsection{Architecture details}
TODO num layers, num neurons, ..
TODO patri to vubec sem, spise ne

\subsection{Implementation}
We have implemented the data mining methods as an opensource framework ``gostyle'' \cite{TODO},
+We have implemented the data mining methods as the
+``gostyle'' opensource framework \cite{GoStyle},
made available under the GNU GPL licence.
The majority of our basic processing and the analysis parts are implemented in the Python \cite{Python2005} programming language.

Nonetheless, we use a number of external libraries, such as the MDP library \cite{MDP} (used for PCA analysis),
Kohonen library \cite{KohonenPy}.
The neural network part of the project is written using the excellent libfann C library\cite{Nissen2003}.
+The majority of our basic processing and the analysis parts
+are implemented in the Python \cite{Python2005} programming language.
+We use several external libraries, most notably the MDP library \cite{MDP} (used for PCA analysis)
+and Kohonen library \cite{KohonenPy}.
+The neural network part of the project is written using the libfann C library\cite{Nissen2003}.
\section{Strength Estimator}

2.10.5.GIT