From 953ae3b42df8d65bc9f471a5cc9a4d132729fec8 Mon Sep 17 00:00:00 2001 From: Petr Baudis Date: Thu, 11 Mar 2010 21:31:11 +0100 Subject: [PATCH] tex: Data Mining cleanups all over the map --- tex/gostyle.tex | 132 ++++++++++++++++++++++++++++---------------------------- 1 file changed, 66 insertions(+), 66 deletions(-) diff --git a/tex/gostyle.tex b/tex/gostyle.tex index c89e231..e6bd684 100644 --- a/tex/gostyle.tex +++ b/tex/gostyle.tex @@ -450,65 +450,69 @@ only moves played by the appropriate color in the game are collected. \section{Data Mining} \label{data-mining} -To assess the properties of gathered \emph{pattern vectors} +To assess the properties of gathered pattern vectors and their influence on playing styles, -we have processes the data using a~few basic data minining techniques. +we process the data by several basic data minining techniques. -The first two methods ({\em analytic}) rely purely on data gathered +The first two methods {\em (analytic)} rely purely on data gathered from the game collection and serve to show internal structure and correlations within the data set. -Principal component analysis finds orthogonal vector components that +Principal Component Analysis finds orthogonal vector components that have the largest variance. Reversing the process can indicate which patterns correlate with each component. -Additionally, PCA can be used as a vector-preprocessing for methods -that are (negatively) sensitive to \emph{pattern vector} component correlations. +Additionally, PCA can be used as vector preprocessing for methods +that are negatively sensitive to pattern vector component correlations. -A~second method -- Kohonen maps -- is based on the theory of self-organizing maps -of abstract units (neurons) that +The~second method of Kohonen Maps +is based on the theory of self-organizing maps of abstract units (neurons) that compete against each other for the representation of the input space. Because neurons in the network are organized in a two-dimensional plane, -the trained network virtually spreads vectors to the 2D plane, -allowing for simple visualization of clusters of players with similar properties''. +the trained network spreads the vectors on a 2D plane, +allowing for visualization of clusters of players with similar properties. -Furthermore, we have used two \emph{classification} methods that assign -each \emph{pattern vector} $\vec P$ some additional data (\emph{output vector} $\vec O$), -representing e.g.~information about styles, player's strength or even a country of origin. -Initially, the methods must be nonetheless calibrated (trained) on some expert or prior knowledge, -usually in the form of pairs of \emph{reference pattern vectors} and their \emph{output vectors}. +Furthermore, we use two \emph{classification} methods that assign +each pattern vector $\vec P$ an \emph{output vector $\vec O$, +representing e.g.~information about styles, player's strength or even +meta-information like the player's era or a country of origin. +Initially, the methods must be calibrated (trained) on some prior knowledge, +usually in the form of \emph{reference pairs} of pattern vectors +and the associated output vectors. Moreover, the reference set can be divided into training and testing pairs -and the methods can be compared by the square error on testing data set (difference of -\emph{output vectors} approximated by the method and their real desired value). +and the methods can be compared by the mean square error on testing data set +(difference of output vectors approximated by the method and their real desired value). %\footnote{However, note that dicrete characteristics such as country of origin are %not very feasible to use here, since WHAT??? is that even true?? } -$k$-Nearest Neighbor \cite{CoverHart1967} classifier (the first method) -approximates $\vec O$ by composing the \emph{output vectors} -of $k$ \emph{reference pattern vectors} closest to $\vec P$. +The $k$-Nearest Neighbors \cite{CoverHart1967} classifier +approximates $\vec O$ by composing the output vectors +of $k$ reference pattern vectors closest to $\vec P$. -The other classifier is based on a~multi-layer feed-forward Artificial Neural Network: +The other classifier is a~multi-layer feed-forward Artificial Neural Network: the neural network can learn correlations between input and output vectors and generalize the knowledge'' to unknown vectors; it can be more flexible in the interpretation of different pattern vector elements and discern more -complex relations than the kNN classifier, but e.g.~requires larger training sample. +complex relations than the kNN classifier, +but may not be as stable and requires larger training sample. \subsection{Principal Component Analysis} \label{data-mining} We use Principal Component Analysis \emph{PCA} \cite{Jolliffe1986} -to reduce the dimensions of the \emph{pattern vectors} while preserving -as much information as possible. +to reduce the dimensions of the pattern vectors while preserving +as much information as possible, assuming inter-dependencies between +pattern vector dimensions are linear. -Briefly, PCA is an eigenvalue decomposition of a~covariance matrix of centered \emph{pattern vectors}, +Briefly, PCA is an eigenvalue decomposition of a~covariance matrix of centered pattern vectors, producing a~linear mapping $o$ from $n$-dimensional vector space to a~reduced $m$-dimensional vector space. The $m$ eigenvectors of the original vectors' covariance matrix with the largest eigenvalues are used as the base of the reduced vector space; -the eigenvectors form the transformation matrix $W$. +the eigenvectors form projection matrix $W$. -For each original \emph{pattern vector} $\vec p_i$, +For each original pattern vector $\vec p_i$, we obtain its new representation $\vec r_i$ in the PCA base as shown in the following equation: @@ -521,7 +525,7 @@ The whole process is described in the Algorithm \ref{alg:pca}. \caption{PCA -- Principal Component Analysis} \begin{algorithmic} \label{alg:pca} -\REQUIRE{$m > 0$, set of players $R$ with \emph{pattern vectors} $p_r$} +\REQUIRE{$m > 0$, set of players $R$ with pattern vectors $p_r$} \STATE $\vec \mu \leftarrow 1/|R| \cdot \sum_{r \in R}{\vec p_r}$ \FOR{ $r \in R$} \STATE $\vec p_r \leftarrow \vec p_r - \vec \mu$ @@ -539,10 +543,11 @@ The whole process is described in the Algorithm \ref{alg:pca}. \end{algorithm} \label{pearson} -We will want to find correlations between PCA dimensions and +We want to find correlations between PCA dimensions and some prior knowledge (player rank, style vector). -We compute the well-known {\em Pearson product-moment correlation coefficient} \cite{Pearson} -values for this purpose, measuring the strength of the linear dependence% +For this purpose, we compute the well-known +{\em Pearson product-moment correlation coefficient} \cite{Pearson}, +measuring the strength of the linear dependence% \footnote{A desirable property of PMCC is that it is invariant to translations and rescaling of the vectors.} between the dimensions: @@ -551,26 +556,26 @@ $$r_{X,Y} = {{\rm cov}(X,Y) \over \sigma_X \sigma_Y}$$ \subsection{Kohonen Maps} \label{koh} -Kohonen map is a self-organizing network with neurons organized in a~two-dimensional plane. -Neurons in the map compete for representation of portions of the input vector space. -Each neuron $\vec n$ represents a vector and the network is trained so that the neurons -that are topologically close tend to represent vectors that are close as well. +Kohonen map is a self-organizing network with neurons spread evenly over a~two-dimensional plane. +Neurons $\vec n$ in the map compete for representation of portions of the input vector space, +each vector being represented by some neuron. +The network is trained so that the neurons +that are topologically close tend to represent vectors that are close in suitable metric as well. First, a~randomly initialized network is sequentially trained; in each iteration, we choose a~random training vector $\vec t$ -and find the neuron $\vec w$ that is closest to $\vec t$ in Euclidean metric -(we call $\vec w$ a~\emph{winner}). +and find the {\em winner neuron} $\vec w$ that is closest to $\vec t$ in Euclidean metric. -We then adapt neurons $n$ from the neighbourhood of $\vec w$ employing an equation: +We then adapt neurons $n$ from the neighborhood of $\vec w$ employing the equation \vec n = \vec n + \alpha \cdot \mathit{Influence}(\vec w, \vec n) \cdot (\vec t - \vec n) where $\alpha$ is a learning parameter, usually decreasing in time. $Influence()$ is a function that forces neurons to spread. Such function is usually realised using a mexican hat function or a difference-of-gaussians -(see \cite{TODO} for details). +\cite{TODO}. The state of the network can be evaluated by calculating mean square difference -between each $\vec t \in T$ and its corresponding \emph{winner neuron} $\vec w_t$: +between each $\vec t \in T$ and its corresponding winner neuron $\vec w_t$: $$\mathit{Error}(N,T) = \sum_{\vec t \in T}{|\vec w_t - \vec t|}$$ @@ -600,9 +605,10 @@ between each $\vec t \in T$ and its corresponding \emph{winner neuron} $\vec w_t \subsection{k-nearest Neighbors Classifier} \label{knn} -Our goal is to approximate player's \emph{output vector}$\vec O$; we know his \emph{pattern vector}$\vec P$. -We further assume that similarities in players' \emph{pattern vectors} -uniformly correlate with similarities in players' \emph{output vectors}. +Our goal is to approximate player's output vector$\vec O$; +we know his pattern vector$\vec P$. +We further assume that similarities in players' pattern vectors +uniformly correlate with similarities in players' output vectors. We require a set of reference players$R$with known \emph{pattern vectors}$\vec p_r$and \emph{output vectors}$\vec o_r$. @@ -610,7 +616,7 @@ and \emph{output vectors}$\vec o_r$.$\vec O$is approximated as a~weighted average of \emph{output vectors}$\vec o_i$of$k$players with \emph{pattern vectors}$\vec p_i$closest to$\vec P$. This is illustrated in the Algorithm \ref{alg:knn}. -Note that the weight is a function of distance and it is not explicitly defined in Algorithm \ref{alg:knn}. +Note that the weight is a function of distance and is not explicitly defined in Algorithm \ref{alg:knn}. During our research, exponentially decreasing weight has proven to be sufficient. \begin{algorithm} @@ -632,11 +638,9 @@ During our research, exponentially decreasing weight has proven to be sufficient \subsection{Neural Network Classifier} \label{neural-net} -Feedforward neural networks \cite{TODO} are known for their ability to generalize -and find correlations and patterns between input and output data, working as a classifier. - +Feed-forward neural networks \cite{ANN} are known for their ability to generalize +and find correlations between input patterns and output classifications. Before use, the network is iteratively trained on the training data -(again consisting of pairs of \emph{pattern vectors} as input and \emph{output vectors}) until the error on the training set is reasonably small. %Neural network is an adaptive system that must undergo a training @@ -644,7 +648,8 @@ until the error on the training set is reasonably small. %of reference vectors for the k-Nearest Neighbors algorithm above. \subsubsection{Computation and activation of the NN} -Technically, the neural network is a network of interconnected computational units called neurons. +Technically, the neural network is a network of interconnected +computational units called neurons. A feedforward neural network has a layered topology; it usually has one \emph{input layer}, one \emph{output layer} and an arbitrary number of \emph{hidden layers} between. @@ -671,19 +676,17 @@$\sigma(x)=\frac{1}{1+e^{-(rx+k)}}$; parameters control the growth rate ($r$) and the x-position ($k$).} \subsubsection{Training} -The training of the feed-forward neural network usually involves some -modification of supervised Backpropagation learning algorithm. \cite{TODO} -We use first-order optimization algorithm called RPROP \cite{Riedmiller1993}. +Training of the feed-forward neural network usually involves some +modification of supervised Backpropagation learning algorithm. +We use first-order optimization algorithm called RPROP. \cite{Riedmiller1993} %Because the \emph{reference set} is usually not very large, %we have devised a simple method for its extension. %This enhancement is based upon adding random linear combinations %of \emph{style and pattern vectors} to the training set. -As outlined above, the training set$T$consists of pairs of -input vectors (\emph{pattern vectors}$\vec p_i)$and -desired \emph{output vectors}$\vec o_i$. - +As outlined above, the training set$T$consists of +$(\vec p_i, \vec o_i)\$ pairs. The training algorithm is shown in Algorithm \ref{alg:tnn}. \begin{algorithm} @@ -711,20 +714,17 @@ The training algorithm is shown in Algorithm \ref{alg:tnn}. \end{algorithmic} \end{algorithm} -\subsubsection{Architecture details} -TODO num layers, num neurons, .. -TODO patri to vubec sem, spise ne - \subsection{Implementation} -We have implemented the data mining methods as an open-source framework gostyle'' \cite{TODO}, +We have implemented the data mining methods as the +gostyle'' open-source framework \cite{GoStyle}, made available under the GNU GPL licence. -The majority of our basic processing and the analysis parts are implemented in the Python \cite{Python2005} programming language. - -Nonetheless, we use a number of external libraries, such as the MDP library \cite{MDP} (used for PCA analysis), -Kohonen library \cite{KohonenPy}. -The neural network part of the project is written using the excellent libfann C library\cite{Nissen2003}. +The majority of our basic processing and the analysis parts +are implemented in the Python \cite{Python2005} programming language. +We use several external libraries, most notably the MDP library \cite{MDP} (used for PCA analysis) +and Kohonen library \cite{KohonenPy}. +The neural network part of the project is written using the libfann C library\cite{Nissen2003}. \section{Strength Estimator} -- 2.10.5.GIT