From a432722e196b340d152f0bece917aca1780168cf Mon Sep 17 00:00:00 2001
From: r6144 <rainy6144@gmail.com>
Date: Fri, 15 Jul 2011 15:27:34 +0800
Subject: [PATCH] Now prints last_urgency to help debugging.  Seems that the
 problematic moves in 110713-85x is indeed due to the average being biased by
 the not-so-good follow-up moves, whose weight has increased when the
 algorithm does more exploring.  Consequently, even when some opponent moves
 have a poor outcome for us, the value of our preceding move may still be
 quite high.

---
 NOTES                 | 4 +++-
 uct/policy/ucb1amaf.c | 1 +
 uct/tree.c            | 3 ++-
 uct/tree.h            | 1 +
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/NOTES b/NOTES
index 6b98032..7dd34ee 100644
--- a/NOTES
+++ b/NOTES
@@ -22,4 +22,6 @@ Generate passes a bit earlier by giving it the appropriate reward or prior value
 
 UCT computes the upper confidence interval by assuming that the outcome of each playout is i.i.d., but actually the outcome distribution can change due to the discovery of a new follow-up move, change in dynkomi, etc.  If we assume that the distribution changes with probability proportional to 1/(n0+n), where n0 is a constant (maybe 10 or 1000) and n is the playout count at that node, a good (I don't know whether it is optimal) estimate of the mean of the distribution at playout n would be one with the outcome of playout i weighted by (n0+n)^beta, where beta is a constant at maybe 2 or 4.  In case the distribution does not change, this weighting will only increase the variance by a constant factor of about (beta+1)^2/(2*beta+1), so this isn't a big problem.  To implement this, just store the weighted partial sum along with the playout count; the total weight can be uniquely determined by the playout count.  Merging of other statistics needs some care; perhaps we can just regard the priors as the outcomes of some initial playouts that should no longer matter much after a significant number of playouts are available.
 
-The above change will allow promising follow-up moves to affect the value of their parent nodes after a smaller number of playouts.  However, when these follow-up moves are initially discovered, their playout count may still be too small to affect the parent value much, so the parent node may not receive enough playouts to allow the follow-up move to be played more.  Perhaps we should instead use the expectation of the min/max value of the children as the value of the parent, and use the upper confidence level as the urgency; this computation is straightforward if we approximate all distributions as e.g. Laplacian (a distribution with an easy-to-compute cdf in general, preferably limited in [0,1]) and then compute the cdf of the minimum.  As the pdf of min/max random variables are often highly asymmetric, it may be a good idea to record the 5% and 95% quantiles rather than the means, and regard the distributions as uniform between neighboring quantile lines.  Be careful when pruning turns a non-leaf node back into a leaf node, though.
+The above change will allow promising follow-up moves to affect the value of their parent nodes after a smaller number of playouts.  However, when these follow-up moves are initially discovered, their playout count may still be too small to affect the parent value much, so the parent node may not receive enough playouts to allow the follow-up move to be played more.  Perhaps we should instead use the expectation of the min/max value of the children as the value of the parent, and use the upper confidence level as the urgency; this computation is straightforward if we approximate all distributions as e.g. Laplacian (a distribution with an easy-to-compute cdf in general, preferably limited in [0,1]) and then compute the cdf of the minimum.  As the pdf of min/max random variables are often highly asymmetric, it may be a good idea to record the 5% and 95% quantiles rather than the means, and regard the distributions as uniform between neighboring quantile lines.  When a leaf node is expanded into a non-leaf node, its previous playout results can be turned into another component of the min/max operation.  Also be careful when pruning turns a non-leaf node back into a leaf node.  (FIXME: This cdf method is not very robust, as the result will change suddenly when a new child node with very few playouts is introduced.)  Since only leaf node now contains playout data, in the absence of dynkomi the weighting method in the above paragraph should not be necessary; in the presence of varying dynkomi, the method introduced earlier should help somewhat, and indeed scale_value() might no longer be so important when min/max is used.
+
+Adjust the dynkomi thresholds to obtain the correct aggressiveness.  Currently the green winrate threshold is 0.5, which is low and may be causing the engine to play a bit too aggressively, since only overplays can win.
diff --git a/uct/policy/ucb1amaf.c b/uct/policy/ucb1amaf.c
index e1f6b5c..d4f63e0 100644
--- a/uct/policy/ucb1amaf.c
+++ b/uct/policy/ucb1amaf.c
@@ -153,6 +153,7 @@ ucb1rave_descend(struct uct_policy *p, struct tree *tree, struct uct_descent *de
 	uctd_try_node_children(tree, descent, allow_pass, parity, p->uct->tenuki_d, di, urgency) {
 		struct tree_node *ni = di.node;
 		urgency = ucb1rave_evaluate(p, tree, &di, parity);
+		ni->last_urgency = tree_node_get_value(tree, parity, urgency); // convert it back to a black-oriented value
 
 		if (explore_p > 0) {
 			/* It is probably safe to include passes, although passes will still be generated without the extra urgency. */
diff --git a/uct/tree.c b/uct/tree.c
index ccaf202..17d3281 100644
--- a/uct/tree.c
+++ b/uct/tree.c
@@ -195,8 +195,9 @@ tree_node_dump(struct tree *tree, struct tree_node *node, int l, int thres)
 		children++;
 	/* We use 1 as parity, since for all nodes we want to know the
 	 * win probability of _us_, not the node color. */
-	fprintf(stderr, "[%s] %.3f/%d [prior %.3f/%d amaf %.3f/%d crit %.3f] h=%x c#=%d <%"PRIhash">\n",
+	fprintf(stderr, "[%s] last_urgency=%.3f [u %.3f/%d prior %.3f/%d amaf %.3f/%d crit %.3f] h=%x c#=%d <%"PRIhash">\n",
 		coord2sstr(node->coord, tree->board),
+		tree_node_get_value(tree, 1, node->last_urgency),
 		tree_node_get_value(tree, 1, node->u.value), node->u.playouts,
 		tree_node_get_value(tree, 1, node->prior.value), node->prior.playouts,
 		tree_node_get_value(tree, 1, node->amaf.value), node->amaf.playouts,
diff --git a/uct/tree.h b/uct/tree.h
index fdcb933..8500785 100644
--- a/uct/tree.h
+++ b/uct/tree.h
@@ -90,6 +90,7 @@ struct tree_node {
 	 * of the tree coordinate corresponding to the node */
 	struct move_stats winner_owner; // owner == winner
 	struct move_stats black_owner; // owner == black
+	floating_t last_urgency; // the urgency of this node when last considered in ucb1rave_descend, also in black's perspective; just for debugging
 };
 
 struct tree_hash;
-- 
2.11.4.GIT