clanek: ELO -> Elo
[gostyle.git] / PAPERS / clanek_go_congress / clanek.tex
blob01169d09597a2c48a5890f8295fc826acf5d918b
1 \documentclass[12pt,a4paper,notitlepage]{article}
3 \usepackage[a4paper,vmargin={20mm,20mm},hmargin={20mm,20mm}]{geometry}
5 %% Použité kódování znaků: obvykle latin2, cp1250 nebo utf8:
6 \usepackage[utf8]{inputenc}
8 %% Ostatní balíčky
9 \usepackage[titletoc]{appendix}
10 \usepackage{graphicx}
11 \usepackage{wrapfig}
12 \usepackage{color}
13 \usepackage[multiple]{footmisc}
14 \usepackage{amsthm}
15 \usepackage{amsmath}
16 \usepackage{threeparttable}
17 \usepackage{longtable}
18 \usepackage{tabularx}
19 \usepackage{amsfonts}
20 \usepackage{caption}
21 \usepackage[lined, ruled, boxed, linesnumbered]{algorithm2e}
23 \usepackage[round]{natbib} % sazba pouzite literatury
25 \usepackage{psfrag}
27 \usepackage{psgo,array}
28 \usepackage{url} % sazba URL
30 \usepackage[ps2pdf,unicode]{hyperref} % Musí být za všemi ostatními balíčky
31 \usepackage{breakurl}
34 %\hypersetup{pdftitle=Meta-learning methods for analyzing Go playing trends}
35 %\hypersetup{pdfauthor=Josef Moudřík}
37 \begin{document}
39 % paper title
40 % can use linebreaks \\ within to get better formatting as desired
41 %\title{On Move Pattern Trends\\in Large Go Games Corpus}
42 \title{Evaluating Go Game Records\\for Prediction of Player Attributes }
44 % use \thanks{} to gain access to the first footnote area
45 % a separate \thanks must be used for each paragraph as LaTeX2e's \thanks
46 % was not built to handle multiple paragraphs
47 \author{Josef~Moud\v{r}\'{i}k%
48 \thanks{J. Moud\v{r}\'{i}k is student at the Faculty of Math and Physics, Charles University, Prague, CZ.},~Petr~Baudi\v{s}%
49 \thanks{P. Baudi\v{s} is an independent researcher. Part of his work
50 in this area has been supported by the SUSE Labs and the Department
51 of Applied Mathematics, Faculty of Mathematics and Physics, Charles University.}}
52 \maketitle
54 \begin{abstract}
55 We propose a~way of extracting and aggrgating per-move evaluations from sets of Go game records.
56 The evaluations capture different aspects of the games such as the played patterns
57 or statistics of sente/gote sequences (among others); using machine learning
58 algorithms, they can be used to predict arbitrary relevant target variables.
59 We apply this methodology to predict strength and playing style (e.g.
60 territoriality or aggressivity) of a player and make our predictor
61 available as an online tool, a part of the GoStyle project.
62 %% No, na tohle neni v clanku misto, pze to ma mit jen 8 stranek
63 % navic bych tyhle veci chtel zverejnit i samy o sobe, nejak dukladnejc,
65 %By inspecting the dependencies between the evaluations and the target variable,
66 %we are able to tell which patterns are bad or good (in case of strength as the
67 %target variable), or which moves e.g. constitute the territorial style of play.
69 We propose a number of possible applications including seeding real-work ranks
70 of internet players, aiding in Go study and tuning of Go-playing programs, or
71 a contribution to Go-theoretical discussion on the scope of ``playing style''.
72 \end{abstract}
75 \section{Introduction}
76 The field of Computer Go usually focuses on the problem
77 of creating a~program to play the game, finding the best move from a~given
78 board position \cite{GellySilver2008}. We focus on analyzing existing game
79 records with the aim of helping humans to play and understand the game better
80 instead.
82 Go is a~two-player full-information board game played
83 on a~square grid (usually $19\times19$ lines) with black and white
84 stones; the goal of the game is to surround the most territory and
85 capture enemy stones. We assume basic familiarity with the game.
87 Since the game has a worldwide popularity, large collections
88 of Go game records have been compiled, covering both amateur and professional games,
89 e.g. \citep{KGS,GoGoD}.
90 So far, not much has been done to analyze these records using computers.
91 There are programs that serve as tools to study the opening phase of the game
92 by giving simple statistics of next move candidates based on professional
93 games~\citep{Kombilo,MoyoGo}.
94 The professional games have also been used in computer Go;
95 patterns from the professional games
96 are used as a heuristic to improve the tree
97 search, e.g.~\citep{PatElo}. Apart from these, we are not aware of
98 any principially different usage.
100 Following up on our initial research \citep{GoStyleArxiv},
101 we present a deeper approach. We extract different
102 kinds of information from the records to create a complex
103 evaluation of the game sample. The \emph{evaluation} is a vector
104 composed of independent features -- each of the features
105 captures different aspect of the sample. For example,
106 we use statistics of most frequent
107 local patterns played, statistics of high and low plays
108 in different game stages, etc.
110 Using machine learning, the evaluation of the sample
111 can be used to predict relevant variables. In this work
112 for instance,
113 the sample consists of the games of a player
114 and we predict his strength or playing style.
116 This paper is organized as follows. Section~\ref{sec:feat}
117 presents the features comprising the evaluation.
118 Section~\ref{sec:mach} gives details about the machine
119 learning method we have used.
120 In Section~\ref{sec:expe} we give details about our
121 datasets -- for prediction of strength and style -- and
122 show how precisely can the prediction be conducted.
123 Section~\ref{sec:disc} discusses applications and future work.
125 \section{Feature Extraction}
126 \label{sec:feat}
127 This section presents the methods for extracting the evaluation
128 vector (we call it $ev$) from a set of games. Because we aggregate
129 data by player, each game in the set is accompanied by the color
130 which specifies our player of interest.
131 The sample is therefore is regarded as a \emph{set
132 of colored games}, $GC = \{ (game_1, color_1), \ldots\}$.
134 The evaluation vector $ev$ is composed by concatenating several
135 sub-vectors we call \emph{features} -- examples include the
136 aforementioned local patterns or statistics of sente and gote
137 sequences. These will be detailed in the rest of this section.
138 Some of the details are edited for brevity,
139 please see \citep{Moudrik13} for an extended description.
141 \subsection{Raw Game Processing}
142 The games\footnote{
143 We use the standard \emph{.sgf} file format as input, \cite{SGF}.
144 } are processed by the Pachi Go
145 Engine~\citep{Pachi} which exports a variety of analytical data
146 about each move in the game.
147 For each move,
148 Pachi outputs a list of key-value pairs regarding the current move:
150 \begin{itemize}
151 \item \textbf{atari flag} --- whether the move put enemy stones in atari,
152 \item \textbf{atari escape flag} --- whether the move saved own stones from atari,
153 \item \textbf{capture} --- number of enemy stones the move captured,
154 \item \textbf{contiguity to last move} --- the gridcular
155 distance\footnotemark[2] from the last move,
156 \item \textbf{board edge distance} --- the distance from
157 the nearest edge of the board,
158 \item \textbf{spatial pattern} --- configuration of stones around the played move.
159 \end{itemize}
161 We use this information to compute the higher level features given below.
162 The spatial pattern pictures positions of stones around the current move up to
163 a certain distance.\footnote{
164 \label{grid}
165 The distance is given by the {\em gridcular} metric
166 $d(x,y) = |\delta x| + |\delta y| + \max(|\delta x|, |\delta y|)$, which produces
167 a circle-like structure on the Go board square grid \cite{SpatPat}.
168 Spatial patterns of sizes 2 to 6 are regarded.
171 \subsection{Patterns}
172 The first feature collects a statistics of $N = 400$ most frequently ocurring
173 spatial patterns (together with both atari flags). The list of the $N$ most frequently
174 played patterns is computed beforehand from the whole database of games. The patterns
175 are normalized to be black to play and are invariant under rotation and mirroring.
177 Given a set of colored games $GC$ we then count how many times was each of the $N$
178 patterns played -- thus obtaining a vector $\vec c$ of counts ($|\vec c| = 400$).
179 With simple occurence count however, particular counts $c_i$ increase proportionally to
180 number of games in $GC$. To maintain invariancy under the number of games in the sample,
181 a normalization is needed. We do this by dividing the $\vec c$ by $|GC|$, though other schemes
182 are possible, see \citep{Moudrik13}.
184 \subsection{$\omega$-local Sente and Gote Sequences}
185 Because the concept of sente and gote is very important in real games, we devised
186 a statistics which tries to capture distribution of sente and gote plays in the games
187 from the sample. Because deciding what moves are sente or gote can be hard even
188 for human players, we restricted ourselves to what we call $\omega$-local (sente
189 and gote) sequences. We say that a move is $\omega$-local (with respect
190 to the previous move) if its gridcular distance from previous move
191 is smaller than a fixed number $\omega$; in this work, we used $\omega=10$.
192 The simplification has a clear assumption -- the responses to
193 a sente move are always local.
194 Of course, this assumption might not always hold, but
195 the feature proves to be useful nonetheless.
197 We then partition each game into $\omega$-local sequences (that is, each move in the
198 sequence is $\omega$-local with respect to its directly previous move) and observe
199 whether the player who started the sequence is different from the player who ended it.
200 If it is so, the $\omega$-local sequence is said to be sente for the player who started it
201 because he gets to play somewhere else first (tenuki). Similarly if the player who
202 started the sequence had to respond last we say that the sequence is gote for him.
203 Based on this partitioning, we can count the average number of sente and gote
204 sequences per game from the sample $GC$. These two numbers, along with their difference,
205 form the second feature.
207 \subsection{Border Distance}
208 The third feature is a two dimensional histogram counting the average number of moves
209 in the sample played low or high in different game stages. The original idea was to help
210 to distinguish between territorial and influence based moves in the opening.
212 The first dimension is specified by the move's border distance,
213 the second one by the number of the current move from the beginning of the game.
214 The granularity of each dimension is given by intervals dividing the domains.
215 We use the
216 $$ByDist = \{ \langle1, 2\rangle, \langle 3 \rangle, \langle4\rangle, \langle 5, \infty)\}$$
217 division for the border distance dimension
218 (distinguishing between the first 2 lines, 3rd line of territory, 4th line of influence and
219 higher plays for the rest).
220 The move number division is given by
221 $$ByMoves = \{ \langle1, 10\rangle, \langle 11, 64\rangle, \langle 65,200\rangle, \langle 201, \infty)\}$$
222 with the motivation to (very roughly) distinguish
223 between the opening (say first ten moves), early middle game (moves 11-64), middle game
224 and endgame.
226 If we use the $ByMoves$ and $ByDist$ intervals to divide the domains, we obtain a histogram
227 of total $|ByMoves| \times |ByDist| = 16$ fields. For each move in the games $GC$,
228 we increase the count in the
229 appropriate histogram field. In the end, the whole histogram is normalized
230 to establish invariancy under the number of games scanned by dividing the
231 histogram elements by $|GC|$. These 16 numbers form the third feature.
233 \subsection{Captured Stones}
234 Apart from the border distance feature, we also maintain a two-dimensional histogram
235 which counts the numbers of captured stones in different game stages. The motivation is
236 simple -- especially beginners tend to capture stones because ``they could'' instead of
237 because it is the ''best move''. For example, such a capture might
238 be a grave mistake in the opening.
240 As before, one of the dimensions is given by the intervals
241 $$ByMoves = \{ \langle1, 60\rangle, \langle 61, 240\rangle, \langle 241, \infty)\}$$
242 which try to specify the game stages (roughly: opening, middle game, endgame).
243 The second dimension has a fixed size of three bins. Along the number of captives
244 of the player of interest (the first bin), we also count the number of his
245 opponent's captives (the second bin) and a difference between the two numbers
246 (the third bin). Together, we obtain a histogram of $|ByMoves| \times 3 = 9$ elements.
248 Again, the colored games $GC$ are processed move by move by increasing
249 the counts of captivated stones (or 0) in the appropriate field.
250 The 9 numbers (again normalized by dividing by $|GC|$) are the output of the fourth
251 feature.
253 \subsection{Win/Loss Statistics}
254 Finally, we used a simple feature that keeps statistics of
255 wins and losses and whether they were by points or by resignation\footnote{
256 We disregard forfeited, unfinished or jigo games in this feature
257 because the frequency of these events is so small it would
258 require a very large dataset to utilize them reliably.
260 For example, many weak players continue playing games that are already lost
261 until the end, mainly because their counting is not very good (they do not
262 know there is no way to win), while professionals do not hesitate to resign
263 if they think that nothing can be done.
265 In the colored games of $GC$, we count how many times did the player of interest:
266 \begin{itemize}
267 \item win by counting,
268 \item win by resignation,
269 \item lost by counting,
270 \item and lost by resignation.
271 \end{itemize}
272 Again, we divide these four numbers by $|GC|$ to maintain the invariancy under the number of games
273 in $GC$. Furthermore, for the games won or lost in counting we count the average
274 size of the win or loss in points. The six numbers form the last feature.
276 \section{Machine Learning}
277 \label{sec:mach}
278 So far, we have considered how we can turn a set of coloured games $GC$ into
279 an evaluation vector. Now, we are going to show how to utilize the evaluation.
280 To predict various player attributes, we use start with an input dataset $D$ consisting
281 of pairs $D=\{ (GC_i, y_i), \dots\}$, where $GC_i$
282 corresponds to a set of colored games of $i$-th player and $y_i$ is the
283 target attribute. The $y_i$ might be fairly arbitrary, as long as it has
284 \emph{some} relation to the $GC_i$. For example, $y_i$ might be $i$'s strength.
286 Now, let's denote our evaluation process we presented before as $eval$ and
287 let $ev_i$ be evaluation of $i$-th player, $ev_i = eval(GC_i)$. Then,
288 we can transform $D$ into $D_{ev} = \{(ev_i, y_i), \dots\}$, which forms
289 our training dataset.
291 The task of our machine learning algorithm is to generalize the knowledge
292 from the dataset $D_{ev}$ to predict correct $y_X$ even for previously unseen $GC_X$.
293 In the case of strength, we might therefore be able to predict strength $y_X$
294 of an unknown player $X$ given a set of his games $GC_X$ (from which we can
295 compute the evaluation $ev_X$).
297 After testing multiple approaches, we have settled on using a bagged artificial neural network
298 as the predictor.
299 Neural networks are a standard technique in machine learning. The network is
300 composed of simple computational units which are organized in a layered topology,
301 as described e.g. in monograph by \citet{haykin_nn}.
302 We have used a simple feedforward neural network with 20 hidden units, trained
303 using the RPROP algorithm \citep{Riedmiller1993} for at most 100 iterations.
305 The bagging \citep{breimanbag96} is a method that combines an ensemble of
306 $N$ models (trained on differently sampled data) to improve their
307 performance and robustness. In this work, we used a bag of $N=20$ neural networks
308 (previous paragraph).
310 \subsection{Measuring the Performance}
312 To assess the efficiency of our method and give estimates of its precision for
313 unseen inputs, we measure the performance of our algorithm given a dataset $D_{ev}$.
314 A standard way to do this is to divide the $D_{ev}$ into training
315 and testing parts and compute the error of the method on the testing part.
317 A commonly used measure is the mean square error ($MSE$) which estimates variance of
318 the error distribution. We use its square root ($RMSE$) which is an estimate of
319 standard deviation of the predictions
320 $$ RMSE = \sqrt{\frac{1}{|T_s|} \sum_{(ev, y) \in T_s}{ (predict(ev) - y)^2}} $$
321 where the machine learning model $predict$ is trained on the
322 training data $T_r$ and $T_s$ denotes the testing data.
324 \subsubsection*{Cross-Validation}
326 For the error estimation to be robust, we use the method of cross-validation \citep{crossval},
327 which is a standard statistical technique for robust estimation of parameters.
328 We split the data into $k$ disjunct subsets (called \emph{folds}) and then
329 iteratively compose the training and testing sets and measure errors.
330 %In each of the $k$ iterations, $k$-th fold is chosen as the testing data, and
331 %all the remaining $k-1$ folds form the training data. The division into the folds is
332 %done randomly, and so that the folds have approximately the
333 %same size.
334 In this work, we have used 5-fold cross validation.
336 \section{Experiments and Results}
337 \label{sec:expe}
339 \subsection{Strength}
340 One of the two major domains we have tested our framework on is the prediction of player
341 strengths.
342 \subsubsection*{Dataset}
343 We have collected a large sample of games from the public
344 archives of the Kiseido Go server~\citep{KGSArchives}.
345 The sample consists of over 100 000 records of games in the \emph{.sgf} format~\citep{SGF}.
347 For each rank $r$ in the range of 6-dan to 20-kyu, we gathered a
348 list of players $P_r$ of the particular rank. To avoid biases caused by
349 different strategies, the sample only consists of games played on $19 \times 19$ board without
350 handicap stones.
351 The set of colored games $GC_p$ for a~player $p \in P_r$ consists of the games player $p$
352 played when he had the rank $r$. We only use the $GC_p$ if the number of
353 games is not smaller than 10 games; if the sample is larger than 50 games, we
354 randomly choose a subset of the sample (the size of subset is uniformly randomly
355 chosen from interval $\langle 10, 50\rangle$).\footnote{
356 By cutting the number of games to a fixed number (say 50) for large
357 samples, we would create an artificial disproportion in sizes of $GC_p$,
358 which could introduce bias into the process.
361 For each of the 26 ranks, we gathered 120 such $GC_p$.
362 The target variable $y$ to learn from directly corresponds to the ranks:
363 $y=20$ for rank of 20-kyu, $y=1$ for 1-kyu, $y=0$ for 1-dan, $y=-5$
364 for 6-dan, other values similarly. (With increasing strength, the $y$
365 decreases.)
367 \subsubsection*{Results}
369 The performance of the prediction of strength is given in Table~\ref{tab:str_reg_res}.
370 The table compares the performance of the Bagged neural network (Section~\ref{sec:mach}),
371 with simple reference method of Mean regression, which just constantly
372 predicts average of the strengths in the dataset regardless of the evaluation vector.
374 The results show that the prediction of strength has standard deviation $\sigma$
375 (estimated by the $RMSE$ error)
376 of approximately $2.7$ rank. Under the assumption of normality
377 of errors, we can say that 68\% of predictions fall within distance of
378 $\sigma$ from the real strength and 95\% of predictions are within $2\sigma$.
380 \begin{table}
381 \begin{center}
382 \begin{tabular}{|c|c|c|}
383 \hline
384 \textbf{Machine learning method} & $\mathbf{RMSE}$ error\\
386 \hline
387 Mean regression & 7.507 \\ \hline
388 Bagged NN & 2.712 \\
390 \hline
391 \end{tabular}
392 \end{center}
393 \caption{
394 $RMSE$ performance of the strength prediction. The mean regression is
395 a reference model which predicts constant value (average of the
396 strengths in the dataset) regardless of the set of games. The results
397 are computed by 5-fold cross-validation.
399 \label{tab:str_reg_res}
400 \end{table}
402 \subsection{Style}
404 The second domain is the prediction of different aspects of player styles.
406 \subsubsection*{Dataset}
407 The collection of games in this dataset comes from the Games of Go on Disk database by \citet{GoGoD}.
408 This database contains more than 70 000 professional games, spanning from the ancient times
409 to the present.
411 We chose a small subset of popular professional players (mainly from the 20th century) and
412 asked several experts (professional and strong amateur players)
413 to evaluate these players using a questionnaire. The experts (Alexander
414 Dinerschtein 3-pro, Motoki Noguchi 7-dan,
415 Vladim\'{i}r Dan\v{e}k 5-dan and V\'{i}t Brunner 4-dan)
416 were asked to assess the players on four scales, each ranging from 1 to 10.
418 %\begin{table}[h!]
419 \begin{center}
420 %\caption{Styles}
421 \begin{tabular}{|c|c|c|}
422 \hline
423 \textbf{Style} & \textbf{1} & \textbf{10}\\ \hline
424 Territoriality & Moyo & Territory \\
425 Orthodoxity & Classic & Novel \\
426 Aggressivity& Calm & Fighting \\
427 Thickness & Safe & Shinogi \\ \hline
428 \end{tabular}
429 \end{center}
430 %\caption[Definition of the style scales]{
431 %The definition of the style scales.
433 %\label{tab:style_def}
434 %\end{table}
436 The scales try to reflect
437 some of the traditionally perceived playing styles.\footnote{
438 Refer also to~\citet{GoGoD:styles}, or~\citet{senseis:styles} regarding Go playing styles.
440 For example, the first scale (\emph{territoriality})
441 stresses whether a player prefers safe, yet inherently smaller territory (number 10 on the scale),
442 or roughly sketched large territory (\emph{moyo}, 1 on the scale), which is however insecure.
445 For each of the selected professionals, we took 192 of his games from the GoGoD database
446 at random. We divided these games (at random) into 12 colored sets $GC$ of 16 games.
447 The target variable (for each of the four styles) $y$ is given by average of the answers of
448 the experts. Results of the questionnaire are published online in~\citep{style_quest}.
450 \subsubsection*{Results}
452 The results of style prediction are given in Table~\ref{tab:sty_reg_res}.
453 Given that the style scales have range of 1 to 10, we consider the average
454 standard deviation from correct answers of around 1.6 to be a good precision.
456 We should note that the mean regression has very small $RMSE$
457 for the scale of thickness.
458 This stems from the fact that the experts' answers from the questionnaire
459 have themselves very little variance --- our conclusion is that the scale of
460 thickness was not chosen well. Refer to~\citep{style_quest} for further discussion.
462 \begin{table}[h]
463 \begin{center}
464 \begin{tabular}{|c|c|c|c|c|}
465 \hline
467 \multicolumn{4}{|c|}{ $\mathbf{RMSE}$ error } \\
468 \hline
469 \textbf{Method} & Territoriality & Orthodoxity & Aggressivity & Thickness \\
472 \hline
473 Mean regression & 2.403 & 2.421 & 2.179 & 1.682 \\
474 Bagged NN & 1.527 & 1.734 & 1.548 & 1.572 \\
475 \hline
476 \end{tabular}
477 \end{center}
478 \caption{
479 $RMSE$ performance for prediction of different styles. The mean regression is
480 a reference model which predicts constant value (average of the
481 values of each particular style) regardless of the set of games. The results
482 are computed by 5-fold cross-validation.
484 \label{tab:sty_reg_res}
485 \end{table}
488 \section{Discussion}
489 \label{sec:disc}
491 The results in both domains have shown that our evaluations are useful in predicting
492 different kinds of player attributes. This can have a number of possible applications.
494 So far, we have utilized some of our findings in an online web
495 application\footnote{\url{http://gostyle.j2m.cz/webapp.html}}. Based on data submitted
496 by the users, it evaluates their games and predicts their playing style
497 and recommends relevant professional players to review. Of course,
498 our methods for style estimation are trained on very strong players and they might
499 not thus be fully generalizable to ordinary players. Weak players might not have a consistent
500 style or the whole concept of style might not be even applicable for them. Estimating this
501 effect is however not easily possible, since we do not have data about weak players' styles.
502 Our webapp allows the users to submit their own opinion about their style, therefore we should
503 be able to consider this effect in the future research.
505 Other possible applications include helping the ranking algorithms to converge faster ---
506 usually, the ranking of a player is determined from his opponents' ranking by looking
507 at the numbers of wins and losses (e.g. by computing an Elo rating). Our methods might improve this
508 by including the domain knowledge.
509 Similarly, a computer Go program can quickly classify the level of its
510 human opponent based on the evaluation from their previous games
511 and auto-adjust its difficulty settings accordingly
512 to provide more even games for beginners.
514 It is also possible to study dependencies between single elements of the evaluation vector
515 and the target variable $y$ directly. By pinpointing e.g. the patterns
516 of the strongest correlation with bad strength (players who play them are weak), we can
517 warn the users not to play these. We have made some initial research into this in~\citep{Moudrik13},
518 we do not present these results here because of space constraints.
520 \section{Conclusion}
521 \label{sec:conc}
522 We presented a method for evaluating players based on a sample of their games.
523 These summary evaluations turn out to be useful in many cases --- they allow us to predict
524 different player attributes (such as strength, or playing style) with reasonable accuracy.
525 We believe that the applications of these findings can help to improve both human and computer
526 understanding in the game of Go.
528 \section{Implementation}
529 \label{sec:impl}
531 The code used in this work
532 is released online as a part of GoStyle project~\citep{GoStyleWeb}.
533 The majority of the source code is implemented in
534 the Python programming language~\citep{Python27}.
536 The machine learning was implemented using the
537 Orange Datamining suite~\citep{curk05} and
538 the Fast Artificial Neural Network library FANN~\citep{Nissen2003}.
539 We used the Pachi Go engine~\citep{Pachi} for the raw game processing.
541 \bibliographystyle{abbrvnat}
542 \bibliography{clanek}
544 \end{document}