src/backend/statistics/README.mcv

   1 MCV lists
   2 =========
   3
   4 Multivariate MCV (most-common values) lists are a straightforward extension of
   5 regular MCV lists, tracking most frequent combinations of values for a group of
   6 attributes.
   7
   8 This works particularly well for columns with a small number of distinct values,
   9 as the list may include all the combinations and approximate the distribution
  10 very accurately.
  11
  12 For columns with a large number of distinct values (e.g. those with continuous
  13 domains), the list will only track the most frequent combinations. If the
  14 distribution is mostly uniform (all combinations about equally frequent), the
  15 MCV list will be empty.
  16
  17 Estimates of some clauses (e.g. equality) based on MCV lists are more accurate
  18 than when using histograms.
  19
  20 Also, MCV lists don't necessarily require sorting of the values (the fact that
  21 we use sorting when building them is an implementation detail), but even more
  22 importantly the ordering is not built into the approximation (while histograms
  23 are built on ordering). So MCV lists work well even for attributes where the
  24 ordering of the data type is disconnected from the meaning of the data. For
  25 example we know how to sort strings, but it's unlikely to make much sense for
  26 city names (or other label-like attributes).
  27
  28
  29 Selectivity estimation
  30 ----------------------
  31
  32 The estimation, implemented in mcv_clauselist_selectivity(), is quite simple
  33 in principle - we need to identify MCV items matching all the clauses and sum
  34 frequencies of all those items.
  35
  36 Currently MCV lists support estimation of the following clause types:
  37
  38     (a) equality clauses      WHERE (a = 1) AND (b = 2)
  39     (b) inequality clauses    WHERE (a < 1) AND (b >= 2)
  40     (c) NULL clauses          WHERE (a IS NULL) AND (b IS NOT NULL)
  41     (d) OR clauses            WHERE (a < 1) OR (b >= 2)
  42
  43 It's possible to add support for additional clauses, for example:
  44
  45     (e) multi-var clauses     WHERE (a > b)
  46
  47 and possibly others. These are tasks for the future, not yet implemented.
  48
  49
  50 Hashed MCV (not yet implemented)
  51 --------------------------------
  52
  53 Regular MCV lists have to include actual values for each item, so if those items
  54 are large the list may be quite large. This is especially true for multivariate
  55 MCV lists, although the current implementation partially mitigates this by
  56 de-duplicating the values before storing them on disk.
  57
  58 It's possible to only store hashes (32-bit values) instead of the actual values,
  59 significantly reducing the space requirements. Obviously, this would only make
  60 the MCV lists useful for estimating equality conditions (assuming the 32-bit
  61 hashes make the collisions rare enough).
  62
  63 This might also complicate matching the columns to available stats.
  64
  65
  66 TODO Consider implementing hashed MCV list, storing just 32-bit hashes instead
  67      of the actual values. This type of MCV list will be useful only for
  68      estimating equality clauses, and will reduce space requirements for large
  69      varlena types (in such cases we usually only want equality anyway).
  70
  71
  72 Inspecting the MCV list
  73 -----------------------
  74
  75 Inspecting the regular (per-attribute) MCV lists is trivial, as it's enough
  76 to select the columns from pg_stats. The data is encoded as anyarrays, and
  77 all the items have the same data type, so anyarray provides a simple way to
  78 get a text representation.
  79
  80 With multivariate MCV lists, the columns may use different data types, making
  81 it impossible to use anyarrays. It might be possible to produce a similar
  82 array-like representation, but that would complicate further processing and
  83 analysis of the MCV list.
  84
  85 So instead the MCV lists are stored in a custom data type (pg_mcv_list),
  86 which however makes it more difficult to inspect the contents. To make that
  87 easier, there's a SRF returning detailed information about the MCV lists.
  88
  89     SELECT m.* FROM pg_statistic_ext s,
  90                     pg_statistic_ext_data d,
  91                     pg_mcv_list_items(stxdmcv) m
  92               WHERE s.stxname = 'stts2'
  93                 AND d.stxoid = s.oid;
  94
  95 It accepts one parameter - a pg_mcv_list value (which can only be obtained
  96 from pg_statistic_ext_data catalog, to defend against malicious input), and
  97 returns these columns:
  98
  99     - item index (0, ..., (nitems-1))
 100     - values (string array)
 101     - nulls only (boolean array)
 102     - frequency (double precision)
 103     - base_frequency (double precision)