From 9877d700502ddeea82a7eb6309c87335536482ef Mon Sep 17 00:00:00 2001
From: Petr Baudis <pasky@ucw.cz>
Date: Sat, 13 Mar 2010 19:38:55 +0100
Subject: [PATCH] tex: Cover log-rescale experiments

---
 tex/gostyle.tex | 79 +++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 60 insertions(+), 19 deletions(-)

diff --git a/tex/gostyle.tex b/tex/gostyle.tex
index 2efe0c1..ba9c4c3 100644
--- a/tex/gostyle.tex
+++ b/tex/gostyle.tex
@@ -383,16 +383,6 @@ $\vec p$ of per-pattern counts from the $n$ globally most frequent patterns%
 (the mapping from patterns to vector elements is common for all objects).
 We can then process and compare just the pattern vectors.
 
-The pattern vector elements can have diverse values since for each object,
-we consider different number of games (and thus patterns).
-Therefore, we linearly rescale and normalize the values to range $[-1,1]$,
-the most frequent pattern having the value of $1$ and the least occuring
-one being $-1$.%
-\footnote{We did not investigate different methods of re-scaling the vectors;
-that might be a good way of improving accuracy of our analysis.}
-Thus, we obtain vectors describing relative frequency of played patterns
-independent on number of gathered patterns.
-
 \subsection{Pattern Features}
 When deciding how to compose the patterns we use to describe moves,
 we need to consider a specificity tradeoff --- overly general descriptions carry too few
@@ -435,6 +425,43 @@ strategic aim. In the opening, even a single-line difference
 in the distance from the border can have dramatic impact on
 further local and global development.
 
+\subsection{Vector Rescaling}
+
+The pattern vector elements can have diverse values since for each object,
+we consider different number of games (and thus patterns).
+Therefore, we normalize the values to range $[-1,1]$,
+the most frequent pattern having the value of $1$ and the least occuring
+one being $-1$.
+Thus, we obtain vectors describing relative frequency of played patterns
+independent on number of gathered patterns.
+But there are multiple ways to approach the normalization.
+
+\subsubsection{Linear Normalization}
+
+One is simply to linearly re-scale the values using:
+$$y_i = {x_i - x_{\rm min} \over x_{\rm max}}$$
+This is the default approach; we have used data processed by only this
+computation unless we note otherwise.
+As shown on fig. \ref{fig:patcountdist}, most of the spectrum is covered
+by the few most-occuring patterns (describing mostly large-diameter
+shapes from the game opening). This means that most patterns will be
+always represented by only very small values near the lower bound.
+
+\subsubsection{Extended Normalization}
+\label{xnorm}
+
+To alleviate this problem, we have also tried to modify the linear
+normalization by applying two steps --- {\em pre-processing}
+the raw counts using
+$$x_i' = \log (x_i + 1)$$
+and {\em post-processing} the re-scaled values by the logistic function:
+$$y_i' = {2 \over 1 + e^{-cy_i}}-1$$
+However, we have found that this method is not universally beneficial.
+In our styles case study (sec. \ref{styleest}), this normalization
+produced PCA decomposition with significant dimensions corresponding
+better to some of the prior knowledge and more instructive for manual
+inspection, but ultimately worsened accuracy of our classifiers.
+
 \subsection{Implementation}
 
 We have implemented the data extraction by making use of the pattern
@@ -778,7 +805,9 @@ using Pearson's $r$ (see \ref{pearson}), yielding quite satisfying value of $r=0
 implying extremely strong correlation.
 Using the eigenvector position directly for classification
 of players within the test group yields MSE TODO, thus providing
-reasonably satisfying accuracy by itself.
+reasonably satisfying accuracy by itself.%
+\footnote{Extended vector normalization (sec. \ref{xnorm})
+produced noticeably less clear-cut results.}
 
 To further enhance the strength estimator accuracy,
 we have tried to train a NN classifier on our train set, consisting
@@ -790,6 +819,7 @@ on average.
 
 
 \section{Style Estimator}
+\label{styleest}
 
 As a~second case study for our pattern analysis,
 we investigate pattern vectors $\vec p$ of various well-known players,
@@ -1150,11 +1180,21 @@ information in the opening stage.%
 but here all the spatial patterns are big enough to reach to the edge
 on their own.}
 
-We have not found significant correspondence to the style aspects
-representing aggressiveness and novelty of play; this means either
-these are not as well defined, the prior information do not represent
-them accurately, or we cannot capture them well with our chosen pattern
-extraction techniques.
+The PCA results presented above do not show much correlation between
+the significant PCA dimensions and the $\omega$ and $\alpha$ style dimensions.
+However, when we applied the extended vector normalization
+(sec. \ref{xnorm}; see fig. \ref{fig:style_normpca}),
+some less significant PCA dimensions exhibited clear correlations.%
+\footnote{We have found that $c=6$ in the post-processing logistic function
+produces the most instructive PCA output on our particular game collection.}
+It appears that less-frequent patterns that appear only in the middle-game
+phase\footnote{In the middle game, the board is much more filled and thus
+particular specific-shape patterns repeat less often.} are defining
+for these dimensions, and these are not represented in the pattern vectors
+as well as the common opening patterns.
+However, we do not use the extended normalization results since
+they produced noticeably less accurate classifiers in all dimensions,
+including $\omega$ and $\alpha$.
 
 We believe that the next step
 in interpreting our results will be more refined prior information input
@@ -1345,9 +1385,10 @@ Since we are not aware of any previous research on this topic and we
 are limited by space and time constraints, plenty of research remains
 to be done, in all parts of our analysis --- we have already noted
 many in the text above. Most significantly, different methods of generating
-the $\vec p$ vectors can be explored and other data mining methods could
-be investigated. Better ways of visualising the relationships would be
-desirable, together with thorough dissemination of internal structure
+and normalizing the $\vec p$ vectors can be explored
+and other data mining methods could be investigated.
+Better ways of visualising the relationships would be desirable,
+together with thorough dissemination of internal structure
 of the player pattern vectors space.
 
 It can be argued that many players adjust their style by game conditions
-- 
2.11.4.GIT