From 7314d436d82ea8b4532de68a04f65c6e58fefa04 Mon Sep 17 00:00:00 2001
From: Petr Baudis <pasky@ucw.cz>
Date: Thu, 11 Mar 2010 20:59:53 +0100
Subject: [PATCH] tex: Introduction, data extraction cleanups

---
 tex/gostyle.tex | 76 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 45 insertions(+), 31 deletions(-)

diff --git a/tex/gostyle.tex b/tex/gostyle.tex
index 54d60d6..204106d 100644
--- a/tex/gostyle.tex
+++ b/tex/gostyle.tex
@@ -337,9 +337,11 @@ neural networks, Kohonen maps, principal component analysis
 % and "HIS" in caps to complete the first word.
 \IEEEPARstart{T}{he} field of Computer Go usually focuses on the problem
 of creating a~program to play the game, finding the best move from a~given
-board position. We will make use of one method developed in the course
+board position. \cite{GellySilver2008}
+We will make use of one method developed in the course
 of such research and apply it to the analysis of existing game records
-with the aim of helping humans to play the game better instead.
+with the aim of helping humans to play and understand the game better
+instead.
 
 Go is a~two-player full-information board game played
 on a~square grid (usually $19\times19$ lines) with black and white
@@ -351,9 +353,8 @@ the internet) and review games played by others on computers as well.
 This means that large amounts of game records are collected and digitally
 stored, enabling easy processing of such collections. However, so far
 only little has been done with the available data --- we are aware
-only of uses for simple win/loss statistics (TODO: KGS Stats, KGS Analytics,
-Pro Go Rating) and ''next move'' statistics on a~specific position (TODO:
-Kombilo, Moyo Go Studio).
+only of uses for simple win/loss statistics \cite{KGSStats} \cite{KGSAnalytics} \cite{ProGoR}
+and ``next move'' statistics on a~specific position \cite{Kombilo} \cite{MoyoGo}.
 
 We present a~more in-depth approach --- from all played moves, we devise
 a~compact evaluation of each player. We then explore correlations between
@@ -367,29 +368,40 @@ should yield similar moves characteristics.
 \section{Data Extraction}
 \label{pattern-vectors}
 
-As the input of our analysis, we use large collections of game records\footnote{We
-use the SGF format (TODO) in our implementation.} organized by player names.
-In order to generate the required compact description of most frequently played moves,
-we construct a set of $n$ most occuring patterns (\emph{top patterns})
-across all players and games from the database.\footnote{We use $n=500$ in our analysis.}
-
-For each player, we then count how many times was each of those $n$ patterns played
-during all his games and finally assign him a~{\em pattern vector} $\vec p$ of dimension $n$, with each
-dimension corresponding to the relative number of occurences of a given pattern
-(relative with respect to player's most played \emph{top pattern}). Using relative numbers of occurences ensures that
-each dimension of player's \emph{pattern vector} is scaled to range $[0,1]$ and
-therefore even players with different number of games in the database have comparable \emph{pattern vectors}.
+As the input of our analysis, we use large collections of game records%
+\footnote{We use the SGF format \cite{SGF} in our implementation.}
+grouped by the primary object of analysis (player name, player rank, etc.).
+We process the games by object, generating a description for each
+played move -- a {\em pattern}, being a combination of several
+{\em pattern features} described below.
+
+We keep track of the most
+occuring patterns, finally composing $n$-dimensional {\em pattern vector}
+$\vec p$ of per-pattern counts from the $n$ globally most frequent patterns%
+\footnote{We use $n=500$ in our analysis.}
+(the mapping from patterns to vector elements is common for all objects).
+We can then process and compare just the pattern vectors.
+
+The pattern vector elements can have diverse values since for each object,
+we consider different number of games (and thus patterns).
+Therefore, we linearly rescale and normalize the values to range $[-1,1]$,
+the most frequent pattern having the value of $1$ and the least occuring
+one being $-1$.%
+\footnote{We did not investigate different methods of re-scaling the vectors;
+that might be a good way of improving accuracy of our analysis.}
+Thus, we obtain vectors describing relative frequency of played patterns
+independent on number of gathered patterns.
 
 \subsection{Pattern Features}
-We need to define how to compose the patterns we use to describe moves.
-However, there are some tradeoffs -- overly general descriptions carry too few
+When deciding how to compose the patterns we use to describe moves,
+we need to consider a specificity tradeoff --- overly general descriptions carry too few
 information to discern various player attributes; too specific descriptions
 gather too few specimen over the games sample and the vector differences are
 not statistically significant.
 
 We have chosen an intuitive and simple approach inspired by pattern features
-used when computing ELO ratings for candidate patterns in Computer Go play.
-\cite{ELO} Each pattern is a~combination of several {\em pattern features}
+used when computing Elo ratings for candidate patterns in Computer Go play.
+\cite{Elo} Each pattern is a~combination of several {\em pattern features}
 (name--value pairs) matched at the position of the played move.
 We use these features:
 
@@ -418,20 +430,22 @@ opponent's move elsewhere to return to an urgent local situation.
 The shapes most frequently correspond to opening moves
 (either in empty corners and sides, or as part of {\em joseki}
 --- commonly played sequences) characteristic for a certain
-strategic aim.
+strategic aim. In the opening, even a single-line difference
+in the distance from the border can have dramatic impact on
+further local and global development.
 
 \subsection{Implementation}
 
 We have implemented the data extraction by making use of the pattern
-features matching implementation within the Pachi go-playing program
-(TODO). We extract information on players by converting the SGF game
-records to GTP (TODO) stream that feeds Pachi's {\tt patternscan}
-engine which outputs a~single {\em patternspec} (string representation
-of the particular pattern features combination) per move.
-
-%We can then gather all patternspecs played by a~given player and summarize
-%them; the $\vec p$ vector then consists of normalized counts of
-%the given $n$ most frequent patternspecs.
+features matching implementation%
+\footnote{The pattern features matching was developed according
+to the Elo-rating playing scheme. \cite{Elo}}
+within the Pachi go-playing program \cite{Pachi}.
+We extract information on players by converting the SGF game
+records to GTP stream \cite{GTP} that feeds Pachi's {\tt patternscan}
+engine, outputting a~single {\em patternspec} (string representation
+of the particular pattern features combination) per move. Of course,
+only moves played by the appropriate color in the game are collected.
 
 \section{Data Mining}
 \label{data-mining}
-- 
2.11.4.GIT