doc/src/sgml/gin.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3 <chapter id="GIN">
   4 <title>GIN Indexes</title>
   5
   6    <indexterm>
   7     <primary>index</primary>
   8     <secondary>GIN</secondary>
   9    </indexterm>
  10
  11 <sect1 id="gin-intro">
  12  <title>Introduction</title>
  13
  14  <para>
  15    <acronym>GIN</acronym> stands for Generalized Inverted Index.  It is
  16    an index structure storing a set of (key, posting list) pairs, where
  17    a <quote>posting list</> is a set of rows in which the key occurs. Each
  18    indexed value can contain many keys, so the same row ID can appear in
  19    multiple posting lists.
  20  </para>
  21
  22  <para>
  23    It is generalized in the sense that a <acronym>GIN</acronym> index
  24    does not need to be aware of the operation that it accelerates.
  25    Instead, it uses custom strategies defined for particular data types.
  26  </para>
  27
  28  <para>
  29   One advantage of <acronym>GIN</acronym> is that it allows the development
  30   of custom data types with the appropriate access methods, by
  31   an expert in the domain of the data type, rather than a database expert.
  32   This is much the same advantage as using <acronym>GiST</acronym>.
  33  </para>
  34
  35  <para>
  36   The <acronym>GIN</acronym>
  37   implementation in <productname>PostgreSQL</productname> is primarily
  38   maintained by Teodor Sigaev and Oleg Bartunov. There is more
  39   information about <acronym>GIN</acronym> on their
  40   <ulink url="http://www.sai.msu.su/~megera/wiki/Gin">website</ulink>.
  41  </para>
  42 </sect1>
  43
  44 <sect1 id="gin-extensibility">
  45  <title>Extensibility</title>
  46
  47  <para>
  48    The <acronym>GIN</acronym> interface has a high level of abstraction,
  49    requiring the access method implementer only to implement the semantics of
  50    the data type being accessed.  The <acronym>GIN</acronym> layer itself
  51    takes care of concurrency, logging and searching the tree structure.
  52  </para>
  53
  54  <para>
  55    All it takes to get a <acronym>GIN</acronym> access method working is to
  56    implement four (or five) user-defined methods, which define the behavior of
  57    keys in the tree and the relationships between keys, indexed values,
  58    and indexable queries. In short, <acronym>GIN</acronym> combines
  59    extensibility with generality, code reuse, and a clean interface.
  60  </para>
  61
  62  <para>
  63    The four methods that an operator class for
  64    <acronym>GIN</acronym> must provide are:
  65  </para>
  66
  67  <variablelist>
  68     <varlistentry>
  69      <term>int compare(Datum a, Datum b)</term>
  70      <listitem>
  71       <para>
  72        Compares keys (not indexed values!) and returns an integer less than
  73        zero, zero, or greater than zero, indicating whether the first key is
  74        less than, equal to, or greater than the second.
  75       </para>
  76      </listitem>
  77     </varlistentry>
  78
  79     <varlistentry>
  80      <term>Datum *extractValue(Datum inputValue, int32 *nkeys)</term>
  81      <listitem>
  82       <para>
  83        Returns an array of keys given a value to be indexed.  The
  84        number of returned keys must be stored into <literal>*nkeys</>.
  85       </para>
  86      </listitem>
  87     </varlistentry>
  88
  89     <varlistentry>
  90      <term>Datum *extractQuery(Datum query, int32 *nkeys,
  91         StrategyNumber n, bool **pmatch)</term>
  92      <listitem>
  93       <para>
  94        Returns an array of keys given a value to be queried; that is,
  95        <literal>query</> is the value on the right-hand side of an
  96        indexable operator whose left-hand side is the indexed column.
  97        <literal>n</> is the strategy number of the operator within the
  98        operator class (see <xref linkend="xindex-strategies">).
  99        Often, <function>extractQuery</> will need
 100        to consult <literal>n</> to determine the data type of
 101        <literal>query</> and the key values that need to be extracted.
 102        The number of returned keys must be stored into <literal>*nkeys</>.
 103        If the query contains no keys then <function>extractQuery</>
 104        should store 0 or -1 into <literal>*nkeys</>, depending on the
 105        semantics of the operator.  0 means that every
 106        value matches the <literal>query</> and a sequential scan should be
 107        produced.  -1 means nothing can match the <literal>query</>.
 108        <literal>pmatch</> is an output argument for use when partial match
 109        is supported.  To use it, <function>extractQuery</> must allocate
 110        an array of <literal>*nkeys</> booleans and store its address at
 111        <literal>*pmatch</>.  Each element of the array should be set to TRUE
 112        if the corresponding key requires partial match, FALSE if not.
 113        If <literal>*pmatch</> is set to NULL then GIN assumes partial match
 114        is not required.  The variable is initialized to NULL before call,
 115        so this argument can simply be ignored by operator classes that do
 116        not support partial match.
 117       </para>
 118
 119      </listitem>
 120     </varlistentry>
 121
 122     <varlistentry>
 123      <term>bool consistent(bool check[], StrategyNumber n, Datum query, bool *recheck)</term>
 124      <listitem>
 125       <para>
 126        Returns TRUE if the indexed value satisfies the query operator with
 127        strategy number <literal>n</> (or might satisfy, if the recheck
 128        indication is returned).  The <literal>check</> array has
 129        the same length as the number of keys previously returned by
 130        <function>extractQuery</> for this query.  Each element of the
 131        <literal>check</> array is TRUE if the indexed value contains the
 132        corresponding query key, ie, if (check[i] == TRUE) the i-th key of the
 133        <function>extractQuery</> result array is present in the indexed value.
 134        The original <literal>query</> datum (not the extracted key array!) is
 135        passed in case the <function>consistent</> method needs to consult it.
 136        On success, <literal>*recheck</> should be set to TRUE if the heap
 137        tuple needs to be rechecked against the query operator, or FALSE if
 138        the index test is exact.
 139       </para>
 140      </listitem>
 141     </varlistentry>
 142
 143   </variablelist>
 144
 145  <para>
 146   Optionally, an operator class for
 147   <acronym>GIN</acronym> can supply a fifth method:
 148  </para>
 149
 150   <variablelist>
 151
 152     <varlistentry>
 153      <term>int comparePartial(Datum partial_key, Datum key, StrategyNumber n)</term>
 154      <listitem>
 155       <para>
 156        Compare a partial-match query to an index key.  Returns an integer
 157        whose sign indicates the result: less than zero means the index key
 158        does not match the query, but the index scan should continue; zero
 159        means that the index key does match the query; greater than zero
 160        indicates that the index scan should stop because no more matches
 161        are possible.  The strategy number <literal>n</> of the operator
 162        that generated the partial match query is provided, in case its
 163        semantics are needed to determine when to end the scan.
 164       </para>
 165      </listitem>
 166     </varlistentry>
 167
 168   </variablelist>
 169
 170  <para>
 171   To support <quote>partial match</> queries, an operator class must
 172   provide the <function>comparePartial</> method, and its
 173   <function>extractQuery</> method must set the <literal>pmatch</>
 174   parameter when a partial-match query is encountered.  See
 175   <xref linkend="gin-partial-match"> for details.
 176  </para>
 177
 178 </sect1>
 179
 180 <sect1 id="gin-implementation">
 181  <title>Implementation</title>
 182
 183  <para>
 184   Internally, a <acronym>GIN</acronym> index contains a B-tree index
 185   constructed over keys, where each key is an element of the indexed value
 186   (a member of an array, for example) and where each tuple in a leaf page is
 187   either a pointer to a B-tree over heap pointers (PT, posting tree), or a
 188   list of heap pointers (PL, posting list) if the list is small enough.
 189  </para>
 190
 191  <sect2 id="gin-partial-match">
 192   <title>Partial match algorithm</title>
 193
 194   <para>
 195    GIN can support <quote>partial match</> queries, in which the query
 196    does not determine an exact match for one or more keys, but the possible
 197    matches fall within a reasonably narrow range of key values (within the
 198    key sorting order determined by the <function>compare</> support method).
 199    The <function>extractQuery</> method, instead of returning a key value
 200    to be matched exactly, returns a key value that is the lower bound of
 201    the range to be searched, and sets the <literal>pmatch</> flag true.
 202    The key range is then searched using the <function>comparePartial</>
 203    method.  <function>comparePartial</> must return zero for an actual
 204    match, less than zero for a non-match that is still within the range
 205    to be searched, or greater than zero if the index key is past the range
 206    that could match.
 207   </para>
 208
 209   <para>
 210    During a partial-match scan, all <literal>itemPointer</>s for matching keys
 211    are OR'ed into a <literal>TIDBitmap</>.
 212    The scan fails if the <literal>TIDBitmap</> becomes lossy.
 213    In this case an error message will be reported with advice
 214    to increase <literal>work_mem</>.
 215   </para>
 216  </sect2>
 217
 218 </sect1>
 219
 220 <sect1 id="gin-tips">
 221 <title>GIN tips and tricks</title>
 222
 223  <variablelist>
 224   <varlistentry>
 225    <term>Create vs insert</term>
 226    <listitem>
 227     <para>
 228      In most cases, insertion into a <acronym>GIN</acronym> index is slow
 229      due to the likelihood of many keys being inserted for each value.
 230      So, for bulk insertions into a table it is advisable to drop the GIN
 231      index and recreate it after finishing bulk insertion.
 232     </para>
 233    </listitem>
 234   </varlistentry>
 235
 236   <varlistentry>
 237    <term><xref linkend="guc-maintenance-work-mem"></term>
 238    <listitem>
 239     <para>
 240      Build time for a <acronym>GIN</acronym> index is very sensitive to
 241      the <varname>maintenance_work_mem</> setting; it doesn't pay to
 242      skimp on work memory during index creation.
 243     </para>
 244    </listitem>
 245   </varlistentry>
 246
 247   <varlistentry>
 248    <term><xref linkend="guc-gin-fuzzy-search-limit"></term>
 249    <listitem>
 250     <para>
 251      The primary goal of developing <acronym>GIN</acronym> indexes was
 252      to create support for highly scalable, full-text search in
 253      <productname>PostgreSQL</productname>, and there are often situations when
 254      a full-text search returns a very large set of results.  Moreover, this
 255      often happens when the query contains very frequent words, so that the
 256      large result set is not even useful.  Since reading many
 257      tuples from the disk and sorting them could take a lot of time, this is
 258      unacceptable for production.  (Note that the index search itself is very
 259      fast.)
 260     </para>
 261     <para>
 262      To facilitate controlled execution of such queries
 263      <acronym>GIN</acronym> has a configurable soft upper limit on the
 264      number of rows returned, the
 265      <varname>gin_fuzzy_search_limit</varname> configuration parameter.
 266      It is set to 0 (meaning no limit) by default.
 267      If a non-zero limit is set, then the returned set is a subset of
 268      the whole result set, chosen at random.
 269     </para>
 270     <para>
 271      <quote>Soft</quote> means that the actual number of returned results
 272      could differ slightly from the specified limit, depending on the query
 273      and the quality of the system's random number generator.
 274     </para>
 275    </listitem>
 276   </varlistentry>
 277  </variablelist>
 278
 279 </sect1>
 280
 281 <sect1 id="gin-limit">
 282  <title>Limitations</title>
 283
 284  <para>
 285   <acronym>GIN</acronym> doesn't support full index scans: because there are
 286   often many keys per value, each heap pointer would be returned many times,
 287   and there is no easy way to prevent this.
 288  </para>
 289
 290  <para>
 291   When <function>extractQuery</function> returns zero keys,
 292   <acronym>GIN</acronym> will emit an error.  Depending on the operator,
 293   a void query might match all, some, or none of the indexed values (for
 294   example, every array contains the empty array, but does not overlap the
 295   empty array), and <acronym>GIN</acronym> cannot determine the correct
 296   answer, nor produce a full-index-scan result if it could determine that
 297   that was correct.
 298  </para>
 299
 300  <para>
 301   It is not an error for <function>extractValue</> to return zero keys,
 302   but in this case the indexed value will be unrepresented in the index.
 303   This is another reason why full index scan is not useful &mdash; it would
 304   miss such rows.
 305  </para>
 306
 307  <para>
 308   It is possible for an operator class to circumvent the restriction against
 309   full index scan.  To do that, <function>extractValue</> must return at least
 310   one (possibly dummy) key for every indexed value, and
 311   <function>extractQuery</function> must convert an unrestricted search into
 312   a partial-match query that will scan the whole index.  This is inefficient
 313   but might be necessary to avoid corner-case failures with operators such
 314   as LIKE.  Note however that failure could still occur if the intermediate
 315   <literal>TIDBitmap</> becomes lossy.
 316  </para>
 317 </sect1>
 318
 319 <sect1 id="gin-examples">
 320  <title>Examples</title>
 321
 322  <para>
 323   The <productname>PostgreSQL</productname> source distribution includes
 324   <acronym>GIN</acronym> operator classes for <type>tsvector</> and
 325   for one-dimensional arrays of all internal types.  Prefix searching in
 326   <type>tsvector</> is implemented using the <acronym>GIN</> partial match
 327   feature.
 328   The following <filename>contrib</> modules also contain
 329   <acronym>GIN</acronym> operator classes:
 330  </para>
 331
 332  <variablelist>
 333   <varlistentry>
 334    <term>hstore</term>
 335    <listitem>
 336     <para>Module for storing (key, value) pairs</para>
 337    </listitem>
 338   </varlistentry>
 339
 340   <varlistentry>
 341    <term>intarray</term>
 342    <listitem>
 343     <para>Enhanced support for int4[]</para>
 344    </listitem>
 345   </varlistentry>
 346
 347   <varlistentry>
 348    <term>pg_trgm</term>
 349    <listitem>
 350     <para>Text similarity using trigram matching</para>
 351    </listitem>
 352   </varlistentry>
 353  </variablelist>
 354 </sect1>
 355
 356 </chapter>