From 9287acff08e1e92ef0a84866449d712795c9d7c3 Mon Sep 17 00:00:00 2001 From: r6144 Date: Fri, 15 Jul 2011 16:07:31 +0800 Subject: [PATCH] More notes on the possible min/max method. --- NOTES | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/NOTES b/NOTES index 7dd34ee..3bad032 100644 --- a/NOTES +++ b/NOTES @@ -22,6 +22,6 @@ Generate passes a bit earlier by giving it the appropriate reward or prior value UCT computes the upper confidence interval by assuming that the outcome of each playout is i.i.d., but actually the outcome distribution can change due to the discovery of a new follow-up move, change in dynkomi, etc. If we assume that the distribution changes with probability proportional to 1/(n0+n), where n0 is a constant (maybe 10 or 1000) and n is the playout count at that node, a good (I don't know whether it is optimal) estimate of the mean of the distribution at playout n would be one with the outcome of playout i weighted by (n0+n)^beta, where beta is a constant at maybe 2 or 4. In case the distribution does not change, this weighting will only increase the variance by a constant factor of about (beta+1)^2/(2*beta+1), so this isn't a big problem. To implement this, just store the weighted partial sum along with the playout count; the total weight can be uniquely determined by the playout count. Merging of other statistics needs some care; perhaps we can just regard the priors as the outcomes of some initial playouts that should no longer matter much after a significant number of playouts are available. -The above change will allow promising follow-up moves to affect the value of their parent nodes after a smaller number of playouts. However, when these follow-up moves are initially discovered, their playout count may still be too small to affect the parent value much, so the parent node may not receive enough playouts to allow the follow-up move to be played more. Perhaps we should instead use the expectation of the min/max value of the children as the value of the parent, and use the upper confidence level as the urgency; this computation is straightforward if we approximate all distributions as e.g. Laplacian (a distribution with an easy-to-compute cdf in general, preferably limited in [0,1]) and then compute the cdf of the minimum. As the pdf of min/max random variables are often highly asymmetric, it may be a good idea to record the 5% and 95% quantiles rather than the means, and regard the distributions as uniform between neighboring quantile lines. When a leaf node is expanded into a non-leaf node, its previous playout results can be turned into another component of the min/max operation. Also be careful when pruning turns a non-leaf node back into a leaf node. (FIXME: This cdf method is not very robust, as the result will change suddenly when a new child node with very few playouts is introduced.) Since only leaf node now contains playout data, in the absence of dynkomi the weighting method in the above paragraph should not be necessary; in the presence of varying dynkomi, the method introduced earlier should help somewhat, and indeed scale_value() might no longer be so important when min/max is used. +The above change will allow promising follow-up moves to affect the value of their parent nodes after a smaller number of playouts. However, when these follow-up moves are initially discovered, their playout count may still be too small to affect the parent value much, so the parent node may not receive enough playouts to allow the follow-up move to be played more. Perhaps we should instead use the expectation of the min/max value of the children as the value of the parent, and use the upper confidence level as the urgency; this computation is straightforward if we approximate all distributions as e.g. Laplacian (a distribution with an easy-to-compute cdf in general, preferably limited in [0,1]) and then compute the cdf of the minimum. As the pdf of min/max random variables are often highly asymmetric, it may be a good idea to record the 5% and 95% quantiles rather than the means, and regard the distributions as uniform between neighboring quantile lines. When a leaf node is expanded into a non-leaf node, its previous playout results can be turned into another component of the min/max operation by taking the a-posteriori distribution of the value according to the playout results. Also be careful when pruning turns a non-leaf node back into a leaf node. (NOTE: This cdf method is not very robust, as the result will change suddenly when a new child node with very few playouts is introduced. It may be better to do min/max on the cdf directly; equivalently, given any probability p in [0,1], we choose the arm with the best p-quantile and use this as the p-cdf of the value at the parent node.) Since only leaf node now contains playout data, in the absence of dynkomi the weighting method in the above paragraph should not be necessary; in the presence of varying dynkomi, the method introduced earlier should help somewhat, and indeed scale_value() might no longer be so important when min/max is used. Adjust the dynkomi thresholds to obtain the correct aggressiveness. Currently the green winrate threshold is 0.5, which is low and may be causing the engine to play a bit too aggressively, since only overplays can win. -- 2.11.4.GIT