mfbt/FastBernoulliTrial.h

   1 /* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
   2 /* vim: set ts=8 sts=2 et sw=2 tw=80: */
   3 /* This Source Code Form is subject to the terms of the Mozilla Public
   4  * License, v. 2.0. If a copy of the MPL was not distributed with this
   5  * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
   6
   7 #ifndef mozilla_FastBernoulliTrial_h
   8 #define mozilla_FastBernoulliTrial_h
   9
  10 #include "mozilla/Assertions.h"
  11 #include "mozilla/XorShift128PlusRNG.h"
  12
  13 #include <cmath>
  14 #include <stdint.h>
  15
  16 namespace mozilla {
  17
  18 /**
  19  * class FastBernoulliTrial: Efficient sampling with uniform probability
  20  *
  21  * When gathering statistics about a program's behavior, we may be observing
  22  * events that occur very frequently (e.g., function calls or memory
  23  * allocations) and we may be gathering information that is somewhat expensive
  24  * to produce (e.g., call stacks). Sampling all the events could have a
  25  * significant impact on the program's performance.
  26  *
  27  * Why not just sample every N'th event? This technique is called "systematic
  28  * sampling"; it's simple and efficient, and it's fine if we imagine a
  29  * patternless stream of events. But what if we're sampling allocations, and the
  30  * program happens to have a loop where each iteration does exactly N
  31  * allocations? You would end up sampling the same allocation every time through
  32  * the loop; the entire rest of the loop becomes invisible to your measurements!
  33  * More generally, if each iteration does M allocations, and M and N have any
  34  * common divisor at all, most allocation sites will never be sampled. If
  35  * they're both even, say, the odd-numbered allocations disappear from your
  36  * results.
  37  *
  38  * Ideally, we'd like each event to have some probability P of being sampled,
  39  * independent of its neighbors and of its position in the sequence. This is
  40  * called "Bernoulli sampling", and it doesn't suffer from any of the problems
  41  * mentioned above.
  42  *
  43  * One disadvantage of Bernoulli sampling is that you can't be sure exactly how
  44  * many samples you'll get: technically, it's possible that you might sample
  45  * none of them, or all of them. But if the number of events N is large, these
  46  * aren't likely outcomes; you can generally expect somewhere around P * N
  47  * events to be sampled.
  48  *
  49  * The other disadvantage of Bernoulli sampling is that you have to generate a
  50  * random number for every event, which can be slow.
  51  *
  52  * [significant pause]
  53  *
  54  * BUT NOT WITH THIS CLASS! FastBernoulliTrial lets you do true Bernoulli
  55  * sampling, while generating a fresh random number only when we do decide to
  56  * sample an event, not on every trial. When it decides not to sample, a call to
  57  * |FastBernoulliTrial::trial| is nothing but decrementing a counter and
  58  * comparing it to zero. So the lower your sampling probability is, the less
  59  * overhead FastBernoulliTrial imposes.
  60  *
  61  * Probabilities of 0 and 1 are handled efficiently. (In neither case need we
  62  * ever generate a random number at all.)
  63  *
  64  * The essential API:
  65  *
  66  * - FastBernoulliTrial(double P)
  67  *   Construct an instance that selects events with probability P.
  68  *
  69  * - FastBernoulliTrial::trial()
  70  *   Return true with probability P. Call this each time an event occurs, to
  71  *   decide whether to sample it or not.
  72  *
  73  * - FastBernoulliTrial::trial(size_t n)
  74  *   Equivalent to calling trial() |n| times, and returning true if any of those
  75  *   calls do. However, like trial, this runs in fast constant time.
  76  *
  77  *   What is this good for? In some applications, some events are "bigger" than
  78  *   others. For example, large allocations are more significant than small
  79  *   allocations. Perhaps we'd like to imagine that we're drawing allocations
  80  *   from a stream of bytes, and performing a separate Bernoulli trial on every
  81  *   byte from the stream. We can accomplish this by calling |t.trial(S)| for
  82  *   the number of bytes S, and sampling the event if that returns true.
  83  *
  84  *   Of course, this style of sampling needs to be paired with analysis and
  85  *   presentation that makes the size of the event apparent, lest trials with
  86  *   large values for |n| appear to be indistinguishable from those with small
  87  *   values for |n|.
  88  */
  89 class FastBernoulliTrial {
  90   /*
  91    * This comment should just read, "Generate skip counts with a geometric
  92    * distribution", and leave everyone to go look that up and see why it's the
  93    * right thing to do, if they don't know already.
  94    *
  95    * BUT IF YOU'RE CURIOUS, COMMENTS ARE FREE...
  96    *
  97    * Instead of generating a fresh random number for every trial, we can
  98    * randomly generate a count of how many times we should return false before
  99    * the next time we return true. We call this a "skip count". Once we've
 100    * returned true, we generate a fresh skip count, and begin counting down
 101    * again.
 102    *
 103    * Here's an awesome fact: by exercising a little care in the way we generate
 104    * skip counts, we can produce results indistinguishable from those we would
 105    * get "rolling the dice" afresh for every trial.
 106    *
 107    * In short, skip counts in Bernoulli trials of probability P obey a geometric
 108    * distribution. If a random variable X is uniformly distributed from [0..1),
 109    * then std::floor(std::log(X) / std::log(1-P)) has the appropriate geometric
 110    * distribution for the skip counts.
 111    *
 112    * Why that formula?
 113    *
 114    * Suppose we're to return |true| with some probability P, say, 0.3. Spread
 115    * all possible futures along a line segment of length 1. In portion P of
 116    * those cases, we'll return true on the next call to |trial|; the skip count
 117    * is 0. For the remaining portion 1-P of cases, the skip count is 1 or more.
 118    *
 119    * skip:                0                         1 or more
 120    *             |------------------^-----------------------------------------|
 121    * portion:            0.3                            0.7
 122    *                      P                             1-P
 123    *
 124    * But the "1 or more" section of the line is subdivided the same way: *within
 125    * that section*, in portion P the second call to |trial()| returns true, and
 126    * in portion 1-P it returns false a second time; the skip count is two or
 127    * more. So we return true on the second call in proportion 0.7 * 0.3, and
 128    * skip at least the first two in proportion 0.7 * 0.7.
 129    *
 130    * skip:                0                1              2 or more
 131    *             |------------------^------------^----------------------------|
 132    * portion:            0.3           0.7 * 0.3          0.7 * 0.7
 133    *                      P             (1-P)*P            (1-P)^2
 134    *
 135    * We can continue to subdivide:
 136    *
 137    * skip >= 0:  |------------------------------------------------- (1-P)^0 --|
 138    * skip >= 1:  |                  ------------------------------- (1-P)^1 --|
 139    * skip >= 2:  |                               ------------------ (1-P)^2 --|
 140    * skip >= 3:  |                                 ^     ---------- (1-P)^3 --|
 141    * skip >= 4:  |                                 .            --- (1-P)^4 --|
 142    *                                               .
 143    *                                               ^X, see below
 144    *
 145    * In other words, the likelihood of the next n calls to |trial| returning
 146    * false is (1-P)^n. The longer a run we require, the more the likelihood
 147    * drops. Further calls may return false too, but this is the probability
 148    * we'll skip at least n.
 149    *
 150    * This is interesting, because we can pick a point along this line segment
 151    * and see which skip count's range it falls within; the point X above, for
 152    * example, is within the ">= 2" range, but not within the ">= 3" range, so it
 153    * designates a skip count of 2. So if we pick points on the line at random
 154    * and use the skip counts they fall under, that will be indistinguishable
 155    * from generating a fresh random number between 0 and 1 for each trial and
 156    * comparing it to P.
 157    *
 158    * So to find the skip count for a point X, we must ask: To what whole power
 159    * must we raise 1-P such that we include X, but the next power would exclude
 160    * it? This is exactly std::floor(std::log(X) / std::log(1-P)).
 161    *
 162    * Our algorithm is then, simply: When constructed, compute an initial skip
 163    * count. Return false from |trial| that many times, and then compute a new
 164    * skip count.
 165    *
 166    * For a call to |trial(n)|, if the skip count is greater than n, return false
 167    * and subtract n from the skip count. If the skip count is less than n,
 168    * return true and compute a new skip count. Since each trial is independent,
 169    * it doesn't matter by how much n overshoots the skip count; we can actually
 170    * compute a new skip count at *any* time without affecting the distribution.
 171    * This is really beautiful.
 172    */
 173  public:
 174   /**
 175    * Construct a fast Bernoulli trial generator. Calls to |trial()| return true
 176    * with probability |aProbability|. Use |aState0| and |aState1| to seed the
 177    * random number generator; both may not be zero.
 178    */
 179   FastBernoulliTrial(double aProbability, uint64_t aState0, uint64_t aState1)
 180       : mProbability(0),
 181         mInvLogNotProbability(0),
 182         mGenerator(aState0, aState1),
 183         mSkipCount(0) {
 184     setProbability(aProbability);
 185   }
 186
 187   /**
 188    * Return true with probability |mProbability|. Call this each time an event
 189    * occurs, to decide whether to sample it or not. The lower |mProbability| is,
 190    * the faster this function runs.
 191    */
 192   bool trial() {
 193     if (mSkipCount) {
 194       mSkipCount--;
 195       return false;
 196     }
 197
 198     return chooseSkipCount();
 199   }
 200
 201   /**
 202    * Equivalent to calling trial() |n| times, and returning true if any of those
 203    * calls do. However, like trial, this runs in fast constant time.
 204    *
 205    * What is this good for? In some applications, some events are "bigger" than
 206    * others. For example, large allocations are more significant than small
 207    * allocations. Perhaps we'd like to imagine that we're drawing allocations
 208    * from a stream of bytes, and performing a separate Bernoulli trial on every
 209    * byte from the stream. We can accomplish this by calling |t.trial(S)| for
 210    * the number of bytes S, and sampling the event if that returns true.
 211    *
 212    * Of course, this style of sampling needs to be paired with analysis and
 213    * presentation that makes the "size" of the event apparent, lest trials with
 214    * large values for |n| appear to be indistinguishable from those with small
 215    * values for |n|, despite being potentially much more likely to be sampled.
 216    */
 217   bool trial(size_t aCount) {
 218     if (mSkipCount > aCount) {
 219       mSkipCount -= aCount;
 220       return false;
 221     }
 222
 223     return chooseSkipCount();
 224   }
 225
 226   void setRandomState(uint64_t aState0, uint64_t aState1) {
 227     mGenerator.setState(aState0, aState1);
 228   }
 229
 230   void setProbability(double aProbability) {
 231     MOZ_ASSERT(0 <= aProbability && aProbability <= 1);
 232     mProbability = aProbability;
 233     if (0 < mProbability && mProbability < 1) {
 234       /*
 235        * Let's look carefully at how this calculation plays out in floating-
 236        * point arithmetic. We'll assume IEEE, but the final C++ code we arrive
 237        * at would still be fine if our numbers were mathematically perfect. So,
 238        * while we've considered IEEE's edge cases, we haven't done anything that
 239        * should be actively bad when using other representations.
 240        *
 241        * (In the below, read comparisons as exact mathematical comparisons: when
 242        * we say something "equals 1", that means it's exactly equal to 1. We
 243        * treat approximation using intervals with open boundaries: saying a
 244        * value is in (0,1) doesn't specify how close to 0 or 1 the value gets.
 245        * When we use closed boundaries like [2**-53, 1], we're careful to ensure
 246        * the boundary values are actually representable.)
 247        *
 248        * - After the comparison above, we know mProbability is in (0,1).
 249        *
 250        * - The gaps below 1 are 2**-53, so that interval is (0, 1-2**-53].
 251        *
 252        * - Because the floating-point gaps near 1 are wider than those near
 253        *   zero, there are many small positive doubles ε such that 1-ε rounds to
 254        *   exactly 1. However, 2**-53 can be represented exactly. So
 255        *   1-mProbability is in [2**-53, 1].
 256        *
 257        * - log(1 - mProbability) is thus in (-37, 0].
 258        *
 259        *   That range includes zero, but when we use mInvLogNotProbability, it
 260        *   would be helpful if we could trust that it's negative. So when log(1
 261        *   - mProbability) is 0, we'll just set mProbability to 0, so that
 262        *   mInvLogNotProbability is not used in chooseSkipCount.
 263        *
 264        * - How much of the range of mProbability does this cause us to ignore?
 265        *   The only value for which log returns 0 is exactly 1; the slope of log
 266        *   at 1 is 1, so for small ε such that 1 - ε != 1, log(1 - ε) is -ε,
 267        *   never 0. The gaps near one are larger than the gaps near zero, so if
 268        *   1 - ε wasn't 1, then -ε is representable. So if log(1 - mProbability)
 269        *   isn't 0, then 1 - mProbability isn't 1, which means that mProbability
 270        *   is at least 2**-53, as discussed earlier. This is a sampling
 271        *   likelihood of roughly one in ten trillion, which is unlikely to be
 272        *   distinguishable from zero in practice.
 273        *
 274        *   So by forbidding zero, we've tightened our range to (-37, -2**-53].
 275        *
 276        * - Finally, 1 / log(1 - mProbability) is in [-2**53, -1/37). This all
 277        *   falls readily within the range of an IEEE double.
 278        *
 279        * ALL THAT HAVING BEEN SAID: here are the five lines of actual code:
 280        */
 281       double logNotProbability = std::log(1 - mProbability);
 282       if (logNotProbability == 0.0)
 283         mProbability = 0.0;
 284       else
 285         mInvLogNotProbability = 1 / logNotProbability;
 286     }
 287
 288     chooseSkipCount();
 289   }
 290
 291  private:
 292   /* The likelihood that any given call to |trial| should return true. */
 293   double mProbability;
 294
 295   /*
 296    * The value of 1/std::log(1 - mProbability), cached for repeated use.
 297    *
 298    * If mProbability is exactly 0 or exactly 1, we don't use this value.
 299    * Otherwise, we guarantee this value is in the range [-2**53, -1/37), i.e.
 300    * definitely negative, as required by chooseSkipCount. See setProbability for
 301    * the details.
 302    */
 303   double mInvLogNotProbability;
 304
 305   /* Our random number generator. */
 306   non_crypto::XorShift128PlusRNG mGenerator;
 307
 308   /* The number of times |trial| should return false before next returning true.
 309    */
 310   size_t mSkipCount;
 311
 312   /*
 313    * Choose the next skip count. This also returns the value that |trial| should
 314    * return, since we have to check for the extreme values for mProbability
 315    * anyway, and |trial| should never return true at all when mProbability is 0.
 316    */
 317   bool chooseSkipCount() {
 318     /*
 319      * If the probability is 1.0, every call to |trial| returns true. Make sure
 320      * mSkipCount is 0.
 321      */
 322     if (mProbability == 1.0) {
 323       mSkipCount = 0;
 324       return true;
 325     }
 326
 327     /*
 328      * If the probabilility is zero, |trial| never returns true. Don't bother us
 329      * for a while.
 330      */
 331     if (mProbability == 0.0) {
 332       mSkipCount = SIZE_MAX;
 333       return false;
 334     }
 335
 336     /*
 337      * What sorts of values can this call to std::floor produce?
 338      *
 339      * Since mGenerator.nextDouble returns a value in [0, 1-2**-53], std::log
 340      * returns a value in the range [-infinity, -2**-53], all negative. Since
 341      * mInvLogNotProbability is negative (see its comments), the product is
 342      * positive and possibly infinite. std::floor returns +infinity unchanged.
 343      * So the result will always be positive.
 344      *
 345      * Converting a double to an integer that is out of range for that integer
 346      * is undefined behavior, so we must clamp our result to SIZE_MAX, to ensure
 347      * we get an acceptable value for mSkipCount.
 348      *
 349      * The clamp is written carefully. Note that if we had said:
 350      *
 351      *    if (skipCount > double(SIZE_MAX))
 352      *       mSkipCount = SIZE_MAX;
 353      *    else
 354      *       mSkipCount = skipCount;
 355      *
 356      * that leads to undefined behavior 64-bit machines: SIZE_MAX coerced to
 357      * double can equal 2^64, so if skipCount equaled 2^64 converting it to
 358      * size_t would induce undefined behavior.
 359      *
 360      * Jakob Olesen cleverly suggested flipping the sense of the comparison to
 361      * skipCount < double(SIZE_MAX).  The conversion will evaluate to 2^64 or
 362      * the double just below it: either way, skipCount is guaranteed to have a
 363      * value that's safely convertible to size_t.
 364      *
 365      * (On 32-bit machines, all size_t values can be represented exactly in
 366      * double, so all is well.)
 367      */
 368     double skipCount =
 369         std::floor(std::log(mGenerator.nextDouble()) * mInvLogNotProbability);
 370     if (skipCount < double(SIZE_MAX))
 371       mSkipCount = skipCount;
 372     else
 373       mSkipCount = SIZE_MAX;
 374
 375     return true;
 376   }
 377 };
 378
 379 } /* namespace mozilla */
 380
 381 #endif /* mozilla_FastBernoulliTrial_h */