From 30a5ff575d420de4ba01747d83ecf5553f90b691 Mon Sep 17 00:00:00 2001
From: Joe Moudrik <j.moudrik@gmail.com>
Date: Wed, 5 Jun 2013 15:45:33 +0200
Subject: [PATCH] clanek: results and typocheck

---
 PAPERS/clanek_go_congress/clanek.bib |   1 +
 PAPERS/clanek_go_congress/clanek.tex | 135 +++++++++++++++++++++++++++--------
 2 files changed, 107 insertions(+), 29 deletions(-)

diff --git a/PAPERS/clanek_go_congress/clanek.bib b/PAPERS/clanek_go_congress/clanek.bib
index 2f2deaa..b2443bd 100644
--- a/PAPERS/clanek_go_congress/clanek.bib
+++ b/PAPERS/clanek_go_congress/clanek.bib
@@ -3,6 +3,7 @@
 @TECHREPORT{style_quest,
   author = "{M}oud{\v{r}\'{i}}k, {J}osef and {B}audi{\v{s}}, {P}etr",
   title = "Style Consensus: Style of Professional Players, Judged by Strong Players",
+  url = "http://gostyle.j2m.cz/FILES/style_consensus_27-05-2013.pdf",
   year = "2013",
   month = "May",
 }
diff --git a/PAPERS/clanek_go_congress/clanek.tex b/PAPERS/clanek_go_congress/clanek.tex
index b621189..588bd41 100644
--- a/PAPERS/clanek_go_congress/clanek.tex
+++ b/PAPERS/clanek_go_congress/clanek.tex
@@ -95,7 +95,7 @@ The professional games have also been used in computer Go;
 patterns from the professional games
 are used as a heuristic to improve the tree
 searching, e.g.~\citep{PatElo}. Apart from these, we are not aware of
-any other uses.
+any principally other uses.
 
 Following up our initial research \citep{GoStyleArxiv},
 we present a deeper approach. We extract different
@@ -125,7 +125,7 @@ Section~\ref{sec:disc} discusses applications and future work.
 \section{Feature Extraction}
 \label{sec:feat}
 This section presents the methods for extracting the evaluation 
-vector (call it $ev$) from a set of games. Because we should
+vector (we call it $ev$) from a set of games. Because we should
 distinguish between both players in any particular game,
 each game in the
 set is accompanied by the color which specifies our player of
@@ -141,9 +141,8 @@ Some of the explanations are simplified to fit the size of
 the paper, please see \citep{Moudrik13} for precise details and algorithms.
 
 \subsection{Raw Game Processing}
-Firstly, we need to specify how do we process the games.
-\footnote{
-We use the standard \emph{.sgf} file format as input, \cite{SGF}.
+Firstly, we need to specify how do we process the games.\footnote{
+    We use the standard \emph{.sgf} file format as input, \cite{SGF}.
 }
 We have used the Pachi Go
 Engine~\citep{Pachi} which -- apart
@@ -157,7 +156,7 @@ Pachi outputs a list of key-value pairs regarding the current move:
     \item \textbf{atari escape flag} --- whether the move saved own stones from atari,
     \item \textbf{capture} --- number of enemy stones the move captured,
     \item \textbf{contiguity to last move} --- the gridcular
-        distance \footnotemark{\ref{grid}} from the last move,
+        distance\footnotemark[2] from the last move,
     \item \textbf{board edge distance} --- the distance from
         the nearest edge of the board,
     \item \textbf{spatial pattern} --- configuration of stones around the played move.
@@ -176,9 +175,10 @@ Spatial patterns of sizes 2 to 6 are regarded.
 \subsection{Patterns}
 The first feature collects a statistics of $N = 400$ most frequently ocurring
 spatial patterns (together with both atari flags). The list of the $N$ most frequently
-played patterns is computed beforehand from the whole database of games.
+played patterns is computed beforehand from the whole database of games. The patterns
+are normalized to be black to play and to be invariant under rotation and mirroring.
 
-Given a set of of colored games $GC$ we then count how many times was each of the $N$
+Given a set of colored games $GC$ we then count how many times was each of the $N$
 patterns played -- thus obtaining a vector $c$ of counts ($|c| = 400$).
 With simple occurences count however, particular counts $c_i$ increase proportionally to 
 number of games in $GC$. To maintain invariancy under the number of games in the sample,
@@ -193,9 +193,7 @@ for human players, we restricted ourselves to what we call $\omega$-local (sente
 and gote) sequences. The simplification has a clear assumption -- the responses to
 a sente move are always local. We say, that a move is $\omega$-local (with respect
 to the previous move) if its gridcular distance from previous move
-is smaller than a fixed number $\omega$ (in this work, we used $\omega=10$\footnote{
-    Of course, different values are possible, see \citep{Moudrik13} for details.
-}).
+is smaller than a fixed number $\omega$; in this work, we used $\omega=10$.
 Of course, this assumption might not always hold, but
 the feature proves to be useful nonetheless.
 
@@ -228,8 +226,9 @@ $$ByDist = \{ \langle1, 2\rangle, \langle 3 \rangle, \langle4\rangle, \langle 5,
 higher plays for the rest).
 
 If we use the $ByMoves$ and $ByDist$ intervals to divide the domains, we obtain a histogram
-of total $|ByMoves| * |ByDist| = 16$ field. For each move, we increase the count in the
-appropriate histogram fields.  In the end, the whole histogram is normalized
+of total $|ByMoves| * |ByDist| = 16$ fields. For each move in the games $GC$,
+we increase the count in the
+appropriate histogram field. In the end, the whole histogram is normalized
 to establish invariancy under the number of games scanned by dividing the 
 histogram elements by $|GC|$. These 16 numbers form the third feature.
 
@@ -237,17 +236,20 @@ histogram elements by $|GC|$. These 16 numbers form the third feature.
 Apart from the border distance feature, we realized a two-dimensional histogram
 which counts numbers of captured stones in different game stages. The motivation is
 simple -- especially beginners tend to capture stones because ``they could'' instead of
-because it is the ''best move''. For example, in the opening such capture might
-be a grave mistake.
+because it is the ''best move''. For example, such capture might
+be a grave mistake in the opening.
 
 As before, one of the dimensions is given by intervals 
 $$ByMoves = \{ \langle1, 60\rangle, \langle 61, 240\rangle, \langle 241, \infty)\}$$
-which try to specify the game stages (opening, middle game, endgame).
+which try to specify the game stages (roughly opening, middle game, endgame).
 The second dimension has a fixed size of three bins. Along the number of captives
 of the player of interest (the first bin), we also count the number of his
 opponent's captives (the second bin) and a difference between the two numbers
 (the third bin). Together, we obtain a histogram of $|ByMoves| * 3 = 9$ elements.
-These 9 numbers (again normalized by dividing by $|GC|$) are the output of the fourth
+
+Again, the colored games $GC$ are processed move by move by increasing
+the counts of captivated stones (or 0) in the appropriate field.
+The 9 numbers (again normalized by dividing by $|GC|$) are the output of the fourth
 feature.
 
 \subsection{Win/Loss Statistics}
@@ -257,7 +259,7 @@ wins and losses and whether they were by points or by resignation\footnote{
     because the frequency of these events is so small it would
     require a very large dataset to utilize them reliably.
 }. 
-For example, quite a lot of weak players continues playing already lost games
+For example, quite a lot of weak players continue playing already lost games
 until the end, mainly because their counting is not very good (they do not
 know there is no way to win), while professionals do not hesitate to resign 
 if they think that nothing can be done.
@@ -304,25 +306,25 @@ Neural networks are a standard technique in machine learning. The network is
 composed of simple computational units which are organized in a layered topology.
 Please see the monograph by \citet{haykin_nn} to learn more.
 We have used a simple feedforward neural network with 20 hidden units, trained
-using the RPROP algorithm \citep{Riedmiller1993}.
+using the RPROP algorithm \citep{Riedmiller1993}, for at most 100 iterations.
 
 The bagging \citep{breimanbag96} is a method which combines an ensemble of
 $N$ models (trained on differently sampled data) to improve their
-performance and robustness. In this work, we used $N=20$. Please refer to the
-paper to learn more about bagging.
+performance and robustness. In this work, we used a bag of $N=20$ neural networks
+(previous paragraph). Please refer to the paper to learn more about bagging.
 
 \subsection{Measuring the Performance}
 Given a dataset $D_{ev}$, it would be nice to estimate performance of a given machine
 learning algorithm (in our case the bagged neural network). A performance measure
 allows to compare different algorithms and give estimates of method precision for
-unseen inputs. A standard way to do this is to divide the $D_{ev}$ in to training
+unseen inputs. A standard way to do this is to divide the $D_{ev}$ into training
 and testing parts and compute the error of the method on the testing part.
 
 A commonly used measure is the mean square error ($MSE$) which estimates variance of
 the error distribution. We use its square root ($RMSE$) which is an estimate of
 standard deviation of the predictions.
 
-$$ MSE = \frac{1}{|Ts|} \sum_{(ev, y) \in Ts}{ (predict(ev) - y)^2} $$
+$$ RMSE = \sqrt{\frac{1}{|Ts|} \sum_{(ev, y) \in Ts}{ (predict(ev) - y)^2}} $$
 
 Where the machine learning model $predict$ is trained on the
 training data $Tr$ and $Ts$ denotes the testing data.
@@ -338,7 +340,7 @@ iteratively compose the training and testing sets and measure errors.
 %all the remaining $k-1$ folds form the training data. The division into the folds is
 %done randomly, and so that the folds have approximately the
 %same size.
-Refer to~\citep{crossval} to learn more. In this work, we have used 10-fold
+Refer to~\citep{crossval} to learn more. In this work, we have used 5-fold
 cross validation.
 
 \section{Experiments and Results}
@@ -367,13 +369,46 @@ chosen from interval $\langle 10, 50\rangle$).\footnote{
 }
 
 For each of the 26 ranks, we gathered 120 such $GC_p$.
-The target variable $y$ we learn directly corresponds to the ranks:
+The target variable $y$ to learn from directly corresponds to the ranks:
 $y=20$ for rank of 20-kyu, $y=1$ for 1-kyu, $y=0$ for 1-dan, $y=-5$
 for 6-dan, other values similarly. (With increasing strength, the $y$
 decreases.)
 
 \subsubsection*{Results}
 
+The performance of the prediction of strength is given in Table~\ref{tab:str_reg_res}.
+The table compares the performance of the Bagged neural network (Section~\ref{sec:mach}),
+with simple reference method of Mean regression, which works by constantly 
+predicting average of the strengths in the dataset regardless of the evaluation vector.
+
+The results show that the prediction of strength has standard deviation $\sigma$
+(estimated by the $RMSE$ error)
+of approximately $2.7$ rank. Under the assumption of normality
+of errors, we can say that 68\% of predictions fall within distance of
+$\sigma$ from the real strength and 95\% of predictions are within $2\sigma$.
+
+\begin{table}
+\begin{center}
+\begin{tabular}{|c|c|c|}
+\hline
+\textbf{Machine learning method} & $\mathbf{RMSE}$ error\\
+
+\hline
+Mean regression & 7.507 \\ \hline
+Bagged NN & 2.712 \\
+
+\hline
+\end{tabular}
+\end{center}
+\caption{
+    $RMSE$ performance of the strength prediction. The mean regression is 
+    a reference model which predicts constant value (average of the
+    strengths in the dataset) regardless of the set of games. The results
+    are computed by 5-fold cross-validation.
+} 
+\label{tab:str_reg_res}
+\end{table}
+
 \subsection{Style}
 
 The second domain is the prediction of different aspects of player styles.
@@ -415,14 +450,50 @@ some of the traditionally perceived playing styles.\footnote{
 For example, the first scale (\emph{territoriality})
 stresses whether a player prefers safe, yet inherently smaller territory (number 10 on the scale),
 or roughly sketched large territory (\emph{moyo}, 1 on the scale), which is however insecure. 
+
+
 For each of the selected professionals, we took 192 of his games from the GoGoD database
 at random. We divided these games (at random) into 12 colored sets $GC$ of 16 games.
-
 The target variable (for each of the four styles) $y$ is given by average of the answers of
-the experts. Results of the questionnaire are published in~\citep{style_quest}.
+the experts. Results of the questionnaire are published online in~\citep{style_quest}.
 
 \subsubsection*{Results}
 
+The results of style prediction are given in Table~\ref{tab:sty_reg_res}. 
+Regarding that the style scales have range of 1 to 10, we consider the average 
+standard deviation from correct answers of 1.6 to be a good precision.
+
+However we should note that the mean regression has very small $RMSE$
+for the scale of thickness.
+This stems from the fact, that the experts' answers from the questionnaire 
+have themselves very little variance --- our conclusion is that the scale of
+thickness was not chosen well. Refer to~\citep{style_quest} for further discussion.
+
+\begin{table}[h]
+\begin{center}
+\begin{tabular}{|c|c|c|c|c|}
+\hline
+&
+\multicolumn{4}{|c|}{ $\mathbf{RMSE}$ error } \\
+\hline
+\textbf{Learner} & Territoriality & Orthodoxity & Aggressivity & Thickness \\
+
+
+\hline
+Mean regression & 2.403 & 2.421 & 2.179 & 1.682 \\
+Bagged NN & 1.527 & 1.734 & 1.548 & 1.572  \\
+\hline
+\end{tabular}
+\end{center}
+\caption{
+    $RMSE$ performance for prediction of different styles. The mean regression is 
+    a reference model which predicts constant value (average of the
+    values of each particular style) regardless of the set of games. The results
+    are computed by 5-fold cross-validation.
+}
+\label{tab:sty_reg_res}
+\end{table}
+
 
 \section{Discussion}
 \label{sec:disc}
@@ -433,7 +504,13 @@ different kinds of player attributes. This might have a number of possible appli
 So far, we have utilized some of our findings in an online web
 application\footnote{\url{http://gostyle.j2m.cz/webapp.html}}. Based on data submitted
 by an user, it computes his evaluation and predicts his playing style
-and recommends relevant professional players to review.
+and recommends relevant professional players to review. Of course,
+our methods for style estimation are trained on very strong players and they might
+not thus be fully generalizable to ordinary players. Weak players might not have a consistent
+style or the whole concept of style might not be even applicable for them. Estimating this 
+effect is however not easily possible, since we do not have data about weak players' styles.
+Our webapp allows the user to submit his own opinion about his style, so we plan to
+investigate this deeper in the future.
 
 Other possible applications include helping the ranking algorithms to converge faster ---
 usually, the ranking of a player is determined from his opponents' ranking by looking
@@ -448,7 +525,7 @@ Also, it is possible to study dependencies between single elements of the evalua
 and the target variable $y$ directly. By pinpointing e.g. the patterns
 that correlate most strongly with small strength (players who play them are weak), we can 
 warn the user not to play these. We have made some initial research into this in~\citep{Moudrik13},
-we do not present these results here, because of space constraints.
+we do not present these results here because of space constraints.
 
 \section{Conclusion}
 \label{sec:conc}
-- 
2.11.4.GIT