From d283be7bddb2101b6d0edd311ecadd90415fc632 Mon Sep 17 00:00:00 2001
From: Petr Baudis <pasky@ucw.cz>
Date: Tue, 9 Mar 2010 15:46:14 +0100
Subject: [PATCH] tex: Style, grammar corrections and other minor changes all
 over the map

---
 tex/gostyle.tex | 209 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 123 insertions(+), 86 deletions(-)

diff --git a/tex/gostyle.tex b/tex/gostyle.tex
index 1fc9397..f3a6efa 100644
--- a/tex/gostyle.tex
+++ b/tex/gostyle.tex
@@ -435,7 +435,7 @@ not statistically significant.
 We have chosen an intuitive and simple approach inspired by pattern features
 used when computing ELO ratings for candidate patterns in Computer Go play.
 \cite{ELO} Each pattern is a~combination of several {\em pattern features}
-(name-value pairs) matched at the position of the played move.
+(name--value pairs) matched at the position of the played move.
 We use these features:
 
 \begin{itemize}
@@ -450,7 +450,7 @@ We use these features:
 
 The spatial patterns are normalized (using a dictionary) to be always
 black-to-play and maintain translational and rotational symmetry.
-Configurations with diameter between 2 and 9 in the gridcular metric%
+Configurations of radius between 2 and 9 in the gridcular metric%
 \footnote{The {\em gridcular} metric
 $d(x,y) = |\delta x| + |\delta y| + \max(|\delta x|, |\delta y|)$ defines
 a circle-like structure on the Go board square grid. \cite{SpatPat} }
@@ -462,8 +462,10 @@ We have implemented the data extraction by making use of the pattern
 features matching implementation within the Pachi go-playing program
 (TODO). We extract information on players by converting the SGF game
 records to GTP (TODO) stream that feeds Pachi's {\tt patternscan}
-engine which outputs a~single patternspec per move. We can then gather
-all encountered patternspecs belonging to a~given player and summarize
+engine which outputs a~single {\em patternspec} (string representation
+of the particular pattern features combination) per move.
+
+We can then gather all patternspecs played by a~given player and summarize
 them; the $\vec p$ vector then consists of normalized counts of
 the given $n$ most frequent patternspecs.
 
@@ -471,44 +473,60 @@ the given $n$ most frequent patternspecs.
 \section{Data Mining}
 \label{data-mining}
 
-To assess the properties of gathered \emph{pattern vectors} and their influence on playing styles,
-we have analysed the data by a~few basic data minining techniques.
-
-First two methods rely purely on data gathered from the GoGoD database. Principal component
-analysis finds orthogonal vector components that have biggest variance. Reversing the process then
-indicates which patterns correlate with each style. Additionally, PCA can be used as a vector-preprocessing
-for methods that are negatively sensitive to \emph{pattern vector} component correlations.
-
-A~second method -- Kohonen's networks -- is based on the theory of self-organizing maps of neurons that
-compete against each other for representation of the input space. Because neurons in the network are
-organized in a two-dimensional plane, the trained network virtually spreads vectors to the 2D plane,
-allowing for simple visualization.
-
-In addition to methods operating just upon the set of \emph{pattern vectors}, we have used and compared two methods
-that approximate an unknown \emph{style vector} $\vec S$ of a player with known \emph{pattern vector} $\vec P$.
-
-First of them is called $k$-Nearest Neighbor (kNN) classification.
-Simpy said kNN approximates $\vec S$ by the ``average'' of \emph{style vectors} of $k$ nearest \emph{pattern vectors}.
-The last method is based on the theory of Neural Networks -- networks of artificial neurons, that are used
-for their generalization abilities. Neural network can learn correlations between input and output vectors and
-generalize the ``knowledge'' to unknown vectors sufficiently good.
-
-TODO rozdelit na algo/results?? 
+To assess the properties of gathered \emph{pattern vectors}
+and their influence on playing styles,
+we have processes the data using a~few basic data minining techniques.
+
+The first two methods ({\em analytic}) rely purely on data gathered
+from the game collection
+and serve to show internal structure and correlations within the data set.
+Principal component analysis finds orthogonal vector components that
+have the largest variance.
+Reversing the process then indicates which patterns correlate with each style.
+Additionally, PCA can be used as a vector-preprocessing for methods
+that are negatively sensitive to \emph{pattern vector} component correlations.
+
+A~second method -- Kohonen maps -- is based on the theory of self-organizing maps of abstract neurons that
+compete against each other for representation of the input space.
+Because neurons in the network are organized in a two-dimensional plane,
+the trained network virtually spreads vectors to the 2D plane,
+allowing for simple visualization of clusters of players with similar styles.
+
+TODO: style vector -> output vector?
+
+Furthermore, we have used and compared two \emph{classification} methods
+that approximate well-defined but unknown \emph{style vector} $\vec S$
+based on input \emph{pattern vector} $\vec P$.
+The methods are calibrated based on expert or prior knowledge about
+training pattern vectors and then their error is measured on a testing
+set of pattern vectors.
+
+One of the methods is $k$-Nearest Neighbor (kNN) classifier:
+we approximate $\vec S$ by composing the \emph{style vectors} of $k$ nearest \emph{pattern vectors}.
+The other is based on a multi-layer feed-forward Artificial Neural Network:
+the neural network can learn correlations between input and output vectors
+and generalize the ``knowledge'' to unknown vectors; it can be more flexible
+in the interpretation of different pattern vector elements and discern more
+complex relations than the kNN classifier, but e.g. requires larger training sample.
+
+TODO: Dava ta posledni veta nejaky smysl?!
 
 \subsection{Principal Component Analysis}
 \label{data-mining}
-Principal Component Analysis \emph{PCA} \cite{Jolliffe1986} is a~method we use to reduce the dimensions of
-\emph{pattern vectors} while preserving as much information as possible.
-
-Shortly, PCA is an eigenvalue decomposition of a~covariance matrix of centered \emph{pattern vectors}.
-It can be thought of as a~mapping $o$ from $n$-dimensional vector space to a~reduced $m$-dimensional vector space.
-The base of this reduced vector space comprises $m$ eigenvectors of original vectors' covariance matrix.
-We choose them to be the eigenvectors with biggest eigenvalues. 
-Ordered by decreasing eigenvalues, the eigenvectors form rows of the transformation matrix $W$.
-
-Finally, we represent reduced \emph{pattern vectors} as a vector of coeficients of this eigenvector-base.
-For each original \emph{pattern vector} $\vec p_i$, we obtain its new representation $\vec r_i$ as shown
-in the following equation:
+We use Principal Component Analysis \emph{PCA} \cite{Jolliffe1986}
+to reduce the dimensions of the \emph{pattern vectors} while preserving
+as much information as possible.
+
+Briefly, PCA is an eigenvalue decomposition of a~covariance matrix of centered \emph{pattern vectors},
+producing a~linear mapping $o$ from $n$-dimensional vector space
+to a~reduced $m$-dimensional vector space.
+The $m$ eigenvectors of the original vectors' covariance matrix
+with the largest eigenvalues are used as the base of the reduced vector space;
+the eigenvectors form the transformation matrix $W$.
+
+For each original \emph{pattern vector} $\vec p_i$,
+we obtain its new representation $\vec r_i$ in the PCA base
+as shown in the following equation:
 \begin{equation}
 \vec r_i = W \cdot \vec p_i
 \end{equation}
@@ -516,7 +534,7 @@ in the following equation:
 The whole process is described in the Algorithm \ref{alg:pca}.
 
 \begin{algorithm}
-\caption{PCA -- Principal Component analysis}
+\caption{PCA -- Principal Component Analysis}
 \begin{algorithmic}
 \label{alg:pca}
 \REQUIRE{$m > 0$, set of players $R$ with \emph{pattern vectors} $p_r$}
@@ -528,8 +546,8 @@ The whole process is described in the Algorithm \ref{alg:pca}.
 \STATE $\mathit{Cov}[i,j] \leftarrow 1/|R| \cdot \sum_{r \in R}{\vec p_{ri} \cdot \vec p_{rj}}$
 \ENDFOR
 \STATE Compute Eigenvalue Decomposition of $\mathit{Cov}$ matrix
-\STATE Get $m$ biggest eigenvalues 
-\STATE According eigenvectors ordered by decreasing eigenvalues form rows of matrix $W$ 
+\STATE Get $m$ largest eigenvalues
+\STATE Most significant eigenvectors ordered by decreasing eigenvalues form the rows of matrix $W$
 \FOR{ $r \in R$}
 \STATE $\vec r_r\leftarrow W \vec p_r$
 \ENDFOR
@@ -538,21 +556,27 @@ The whole process is described in the Algorithm \ref{alg:pca}.
 
 \subsection{Kohonen Maps}
 \label{koh}
-Kohonen map is a self-organizing network with neurons organized in a two-dimensional plane.
-Neurons in the map compete for representation of input vector space. Each neuron $\vec n$ represents a vector
-and the network is trained so that the neurons that are topologically close tend to represent vectors that
-are close as well. That is realised as follows. Initially random network is sequentially trained;
-in each iterations we choose a random training vector $\vec t$ and find neuron $\vec w$ that is closest
-to $\vec t$ in Euclidean metric (we call $\vec w$ a \emph{winner neuron}).
+Kohonen map is a self-organizing network with neurons spread over a two-dimensional plane.
+Neurons in the map compete for representation of portions of the input vector space.
+Each neuron $\vec n$ represents a vector
+and the network is trained so that the neurons that are topologically close
+tend to represent vectors that are close as well.
+
+First, a randomly initialized network is sequentially trained;
+in each iteration, we choose a random training vector $\vec t$
+and find the neuron $\vec w$ that is closest to $\vec t$ in Euclidean metric
+(we call $\vec w$ a \emph{winner neuron}).
 
 We then adapt neurons from the neighbourhood of $\vec w$ employing an equation:
 \begin{equation}
 \vec n = \vec n + \alpha \cdot \mathit{Influence}(\vec w, \vec n) \cdot (\vec t - \vec n) 
 \end{equation}
-where $\alpha$ is a learning parameter, usually decreasing in time. $Influence()$ is a function that forces neurons
-to spread. Such function is usually realised using a mexican hat function or a difference-of-gaussians (see \cite{TODO}
-for details). A state of the network can be valued by calculating mean square difference between each $\vec t \in T$
-and its corresponding \emph{winner neuron} $\vec w_t$:
+where $\alpha$ is a learning parameter, usually decreasing in time.
+$Influence()$ is a function that forces neurons to spread.
+Such function is usually realised using a mexican hat function or a difference-of-gaussians
+(see \cite{TODO} for details).
+The state of the network can be evaluated by calculating mean square difference
+between each $\vec t \in T$ and its corresponding \emph{winner neuron} $\vec w_t$:
 \begin{equation}
 \mathit{Error}(N,T) = \sum_{\vec t \in T}{|\vec w_t - \vec t|}
 \end{equation}
@@ -582,17 +606,18 @@ and its corresponding \emph{winner neuron} $\vec w_t$:
 
 \subsection{k-nearest Neighbors Classifier}
 \label{knn}
-K-nearest neigbors is an essential classification technique.
-We use it to approximate player's \emph{style vector} $\vec S$, assuming that his \emph{pattern vector} $\vec P$ is known.
-To achieve this, we utilize \emph{reference style vectors} (see section \ref{style-vectors}).
-
-The idea is based on a assumption that similarities in players' \emph{pattern vectors}
-correlate with similarities in players' \emph{style vectors}. We try to approximate $\vec S$
-as a weighted average of \emph{style vectors}
+Our goal is to approximate player's \emph{style vector} $\vec S$
+based on their \emph{pattern vector} $\vec P$.
+To achieve this, we require prior knowledge of \emph{reference style vectors}
+(see section \ref{style-vectors}).
+
+In this method, we assume that similarities in players' \emph{pattern vectors}
+uniformly correlate with similarities in players' \emph{style vectors}.
+We try to approximate $\vec S$ as a weighted average of \emph{style vectors}
 $\vec s_i$ of $k$ players with \emph{pattern vectors} $\vec p_i$ closest to $\vec P$.
 This is illustrated in the Algorithm \ref{alg:knn}.
 Note that the weight is a function of distance and it is not explicitly defined in Algorithm \ref{alg:knn}.
-During our research, exponentialy decreasing weight has proven to be sufficient.
+During our research, exponentially decreasing weight has proven to be sufficient.
 
 \begin{algorithm}
 \caption{k-Nearest Neighbors}
@@ -613,45 +638,57 @@ During our research, exponentialy decreasing weight has proven to be sufficient.
 \subsection{Neural Network Classifier}
 \label{neural-net}
 
-As an alternative to the k-Nearest Neigbors algorithm (section \ref{knn}), we have used 
-a classificator based on feed-forward artificial neural networks \cite{TODO}.
-Neural networks (NN) are known for their ability to generalize and find correlations and patterns between
-input and output data. Neural network is an adaptive system and it 
-must undergo a certain training before it can be reasonably used. Basically, we use
-information for \emph{reference players} (for which either \emph{pattern vectors} and
-\emph{style vectors} are known) as training data.
+As an alternative to the k-Nearest Neigbors algorithm (section \ref{knn}),
+we have used a classificator based on feed-forward artificial neural networks \cite{TODO}.
+Neural networks (NN) are known for their ability to generalize
+and find correlations and patterns between input and output data.
+Neural network is an adaptive system that must undergo a training
+period before it can be reasonably used, similarly to the requirement
+of reference vectors for the k-Nearest Neighbors algorithm above.
 
 \subsubsection{Computation and activation of the NN}
 Technically, neural network is a network of interconnected computational units called neurons.
-A feedforward neural network has a layered topology; it usually has one \emph{input layer}, one \emph{output
-layer} and an arbitrary number of \emph{hidden layers} between.
+A feedforward neural network has a layered topology;
+it usually has one \emph{input layer}, one \emph{output layer}
+and an arbitrary number of \emph{hidden layers} inbetween.
 
 Each neuron $i$ is connected to all neurons in the previous layer and each connection has its weight $w_{ij}$
 
-The computation proceeds in a discrete time steps.
-In the first step, \emph{activation} of neurons in the \emph{input layer} is set according to the \emph{input vector}.
-Then, we iteratively compute output of each neuron in next layer until the output layer is reached. The activity of
-output layer is then presented as a result.
+The computation proceeds in discrete time steps.
+In the first step, the neurons in the \emph{input layer}
+are \emph{activated} according to the \emph{input vector}.
+Then, we iteratively compute output of each neuron in the next layer
+until the output layer is reached.
+The activity of output layer is then presented as the result.
 
-The activation $y_i$ of neuron $i$ from the layer $I$ is computed using the following equation:
+The activation $y_i$ of neuron $i$ from the layer $I$ is computed as
 \begin{equation}
 y_i = f(\sum_{j \in J}{w_{ij} y_j})
 \end{equation}
-where $J$ is a previous layer, while $y_j$ is the activation for neurons from $J$ layer. Function $f()$ is
-called \emph{activation function} and its purpose is to bound outputs of neurons. A typical example of an activation
-function is a sigmoid function.\footnote{The sigmoid function is a special case of the logistic function; it is defined by the formula
-$\sigma(x)=\frac{1}{1+e^{-(rx+k)}}$, parameters control the growth rate ($r$) and the x-position ($k$).}
+where $J$ is the previous layer, while $y_j$ is the activation for neurons from $J$ layer.
+Function $f()$ is so-called \emph{activation function}
+and its purpose is to bound the outputs of neurons.
+A typical example of an activation function is the sigmoid function.%
+\footnote{A special case of the logistic function, defined by the formula
+$\sigma(x)=\frac{1}{1+e^{-(rx+k)}}$; parameters control the growth rate ($r$)
+and the x-position ($k$).}
 
 \subsubsection{Training}
-The training of the feedforward neural network usually involves some
+The training of the feed-forward neural network usually involves some
 modification of supervised Backpropagation learning algorithm. \cite{TODO}
 We use first-order optimization algorithm called RPROP \cite{Riedmiller1993}.
 
-Because the \emph{reference set} is not usually very large, we have devised a simple method for its extension.
-This enhancement is based upon adding random linear combinations of \emph{style and pattern vectors} to the training set.
+Because the \emph{reference set} is usually not very large,
+we have devised a simple method for its extension.
+This enhancement is based upon adding random linear combinations
+of \emph{style and pattern vectors} to the training set.
+
+TODO: Tohle je puvodni napad?
 
-As insinuated above, the training set consist of pairs of input vectors (\emph{pattern vectors}) and
-desired output vectors (\emph{style vectors}). The training set $T$ is then enlarged by adding linear combinations:
+As outlined above, the training set consists of pairs of
+input vectors (\emph{pattern vectors}) and
+desired output vectors (\emph{style vectors}).
+The training set $T$ is then extended by adding the linear combinations:
 \begin{equation}
 T_\mathit{base} = \{(\vec p_r, \vec s_r) | r \in R\}\\
 \end{equation}
@@ -659,8 +696,8 @@ T_\mathit{base} = \{(\vec p_r, \vec s_r) | r \in R\}\\
 T_\mathit{ext} = \{(\vec p, \vec s) | \exists D \subseteq R : \vec p = \sum_{d \in D}{g_d \vec p_d}, \vec s = \sum_{d \in D}{g_d \vec s_d}\}
 \end{equation}
 TODO zabudovat $g_d$ dovnitr?
-where $g_d, d \in D$ are random coeficients, so that $\sum_{d \in D}{g_d} = 1$. The training set
-is then constructed as:
+where $g_d, d \in D$ are random coeficients, so that $\sum_{d \in D}{g_d} = 1$.
+The training set is then constructed as:
 \begin{equation}
 T = T_\mathit{base} \cup \mathit{SomeFiniteSubset}(T_\mathit{ext})
 \end{equation}
@@ -700,7 +737,7 @@ TODO num layers, num neurons, ..
 \subsection{Implementation}
 
 We have implemented the data mining methods as an open-source project ``gostyle'' \cite{TODO},
-licensed under GNU GPL.
+made available under the GNU GPL licence.
 PCA: In our implementation, we use a~library called MDP \cite{MDP}.
 TODO libfann
 
-- 
2.11.4.GIT