Update link to Oleg and Teodor's GIN page.
[PostgreSQL.git] / doc / src / sgml / gin.sgml
blob1c5841a2f2b46b8cff9be7f2b83e521349abdcdf
1 <!-- $PostgreSQL$ -->
3 <chapter id="GIN">
4 <title>GIN Indexes</title>
6 <indexterm>
7 <primary>index</primary>
8 <secondary>GIN</secondary>
9 </indexterm>
11 <sect1 id="gin-intro">
12 <title>Introduction</title>
14 <para>
15 <acronym>GIN</acronym> stands for Generalized Inverted Index. It is
16 an index structure storing a set of (key, posting list) pairs, where
17 a <quote>posting list</> is a set of rows in which the key occurs. Each
18 indexed value can contain many keys, so the same row ID can appear in
19 multiple posting lists.
20 </para>
22 <para>
23 It is generalized in the sense that a <acronym>GIN</acronym> index
24 does not need to be aware of the operation that it accelerates.
25 Instead, it uses custom strategies defined for particular data types.
26 </para>
28 <para>
29 One advantage of <acronym>GIN</acronym> is that it allows the development
30 of custom data types with the appropriate access methods, by
31 an expert in the domain of the data type, rather than a database expert.
32 This is much the same advantage as using <acronym>GiST</acronym>.
33 </para>
35 <para>
36 The <acronym>GIN</acronym>
37 implementation in <productname>PostgreSQL</productname> is primarily
38 maintained by Teodor Sigaev and Oleg Bartunov. There is more
39 information about <acronym>GIN</acronym> on their
40 <ulink url="http://www.sai.msu.su/~megera/wiki/Gin">website</ulink>.
41 </para>
42 </sect1>
44 <sect1 id="gin-extensibility">
45 <title>Extensibility</title>
47 <para>
48 The <acronym>GIN</acronym> interface has a high level of abstraction,
49 requiring the access method implementer only to implement the semantics of
50 the data type being accessed. The <acronym>GIN</acronym> layer itself
51 takes care of concurrency, logging and searching the tree structure.
52 </para>
54 <para>
55 All it takes to get a <acronym>GIN</acronym> access method working is to
56 implement four (or five) user-defined methods, which define the behavior of
57 keys in the tree and the relationships between keys, indexed values,
58 and indexable queries. In short, <acronym>GIN</acronym> combines
59 extensibility with generality, code reuse, and a clean interface.
60 </para>
62 <para>
63 The four methods that an operator class for
64 <acronym>GIN</acronym> must provide are:
65 </para>
67 <variablelist>
68 <varlistentry>
69 <term>int compare(Datum a, Datum b)</term>
70 <listitem>
71 <para>
72 Compares keys (not indexed values!) and returns an integer less than
73 zero, zero, or greater than zero, indicating whether the first key is
74 less than, equal to, or greater than the second.
75 </para>
76 </listitem>
77 </varlistentry>
79 <varlistentry>
80 <term>Datum *extractValue(Datum inputValue, int32 *nkeys)</term>
81 <listitem>
82 <para>
83 Returns an array of keys given a value to be indexed. The
84 number of returned keys must be stored into <literal>*nkeys</>.
85 </para>
86 </listitem>
87 </varlistentry>
89 <varlistentry>
90 <term>Datum *extractQuery(Datum query, int32 *nkeys,
91 StrategyNumber n, bool **pmatch)</term>
92 <listitem>
93 <para>
94 Returns an array of keys given a value to be queried; that is,
95 <literal>query</> is the value on the right-hand side of an
96 indexable operator whose left-hand side is the indexed column.
97 <literal>n</> is the strategy number of the operator within the
98 operator class (see <xref linkend="xindex-strategies">).
99 Often, <function>extractQuery</> will need
100 to consult <literal>n</> to determine the data type of
101 <literal>query</> and the key values that need to be extracted.
102 The number of returned keys must be stored into <literal>*nkeys</>.
103 If the query contains no keys then <function>extractQuery</>
104 should store 0 or -1 into <literal>*nkeys</>, depending on the
105 semantics of the operator. 0 means that every
106 value matches the <literal>query</> and a sequential scan should be
107 produced. -1 means nothing can match the <literal>query</>.
108 <literal>pmatch</> is an output argument for use when partial match
109 is supported. To use it, <function>extractQuery</> must allocate
110 an array of <literal>*nkeys</> booleans and store its address at
111 <literal>*pmatch</>. Each element of the array should be set to TRUE
112 if the corresponding key requires partial match, FALSE if not.
113 If <literal>*pmatch</> is set to NULL then GIN assumes partial match
114 is not required. The variable is initialized to NULL before call,
115 so this argument can simply be ignored by operator classes that do
116 not support partial match.
117 </para>
119 </listitem>
120 </varlistentry>
122 <varlistentry>
123 <term>bool consistent(bool check[], StrategyNumber n, Datum query, bool *recheck)</term>
124 <listitem>
125 <para>
126 Returns TRUE if the indexed value satisfies the query operator with
127 strategy number <literal>n</> (or might satisfy, if the recheck
128 indication is returned). The <literal>check</> array has
129 the same length as the number of keys previously returned by
130 <function>extractQuery</> for this query. Each element of the
131 <literal>check</> array is TRUE if the indexed value contains the
132 corresponding query key, ie, if (check[i] == TRUE) the i-th key of the
133 <function>extractQuery</> result array is present in the indexed value.
134 The original <literal>query</> datum (not the extracted key array!) is
135 passed in case the <function>consistent</> method needs to consult it.
136 On success, <literal>*recheck</> should be set to TRUE if the heap
137 tuple needs to be rechecked against the query operator, or FALSE if
138 the index test is exact.
139 </para>
140 </listitem>
141 </varlistentry>
143 </variablelist>
145 <para>
146 Optionally, an operator class for
147 <acronym>GIN</acronym> can supply a fifth method:
148 </para>
150 <variablelist>
152 <varlistentry>
153 <term>int comparePartial(Datum partial_key, Datum key, StrategyNumber n)</term>
154 <listitem>
155 <para>
156 Compare a partial-match query to an index key. Returns an integer
157 whose sign indicates the result: less than zero means the index key
158 does not match the query, but the index scan should continue; zero
159 means that the index key does match the query; greater than zero
160 indicates that the index scan should stop because no more matches
161 are possible. The strategy number <literal>n</> of the operator
162 that generated the partial match query is provided, in case its
163 semantics are needed to determine when to end the scan.
164 </para>
165 </listitem>
166 </varlistentry>
168 </variablelist>
170 <para>
171 To support <quote>partial match</> queries, an operator class must
172 provide the <function>comparePartial</> method, and its
173 <function>extractQuery</> method must set the <literal>pmatch</>
174 parameter when a partial-match query is encountered. See
175 <xref linkend="gin-partial-match"> for details.
176 </para>
178 </sect1>
180 <sect1 id="gin-implementation">
181 <title>Implementation</title>
183 <para>
184 Internally, a <acronym>GIN</acronym> index contains a B-tree index
185 constructed over keys, where each key is an element of the indexed value
186 (a member of an array, for example) and where each tuple in a leaf page is
187 either a pointer to a B-tree over heap pointers (PT, posting tree), or a
188 list of heap pointers (PL, posting list) if the list is small enough.
189 </para>
191 <sect2 id="gin-partial-match">
192 <title>Partial match algorithm</title>
194 <para>
195 GIN can support <quote>partial match</> queries, in which the query
196 does not determine an exact match for one or more keys, but the possible
197 matches fall within a reasonably narrow range of key values (within the
198 key sorting order determined by the <function>compare</> support method).
199 The <function>extractQuery</> method, instead of returning a key value
200 to be matched exactly, returns a key value that is the lower bound of
201 the range to be searched, and sets the <literal>pmatch</> flag true.
202 The key range is then searched using the <function>comparePartial</>
203 method. <function>comparePartial</> must return zero for an actual
204 match, less than zero for a non-match that is still within the range
205 to be searched, or greater than zero if the index key is past the range
206 that could match.
207 </para>
209 <para>
210 During a partial-match scan, all <literal>itemPointer</>s for matching keys
211 are OR'ed into a <literal>TIDBitmap</>.
212 The scan fails if the <literal>TIDBitmap</> becomes lossy.
213 In this case an error message will be reported with advice
214 to increase <literal>work_mem</>.
215 </para>
216 </sect2>
218 </sect1>
220 <sect1 id="gin-tips">
221 <title>GIN tips and tricks</title>
223 <variablelist>
224 <varlistentry>
225 <term>Create vs insert</term>
226 <listitem>
227 <para>
228 In most cases, insertion into a <acronym>GIN</acronym> index is slow
229 due to the likelihood of many keys being inserted for each value.
230 So, for bulk insertions into a table it is advisable to drop the GIN
231 index and recreate it after finishing bulk insertion.
232 </para>
233 </listitem>
234 </varlistentry>
236 <varlistentry>
237 <term><xref linkend="guc-maintenance-work-mem"></term>
238 <listitem>
239 <para>
240 Build time for a <acronym>GIN</acronym> index is very sensitive to
241 the <varname>maintenance_work_mem</> setting; it doesn't pay to
242 skimp on work memory during index creation.
243 </para>
244 </listitem>
245 </varlistentry>
247 <varlistentry>
248 <term><xref linkend="guc-gin-fuzzy-search-limit"></term>
249 <listitem>
250 <para>
251 The primary goal of developing <acronym>GIN</acronym> indexes was
252 to create support for highly scalable, full-text search in
253 <productname>PostgreSQL</productname>, and there are often situations when
254 a full-text search returns a very large set of results. Moreover, this
255 often happens when the query contains very frequent words, so that the
256 large result set is not even useful. Since reading many
257 tuples from the disk and sorting them could take a lot of time, this is
258 unacceptable for production. (Note that the index search itself is very
259 fast.)
260 </para>
261 <para>
262 To facilitate controlled execution of such queries
263 <acronym>GIN</acronym> has a configurable soft upper limit on the
264 number of rows returned, the
265 <varname>gin_fuzzy_search_limit</varname> configuration parameter.
266 It is set to 0 (meaning no limit) by default.
267 If a non-zero limit is set, then the returned set is a subset of
268 the whole result set, chosen at random.
269 </para>
270 <para>
271 <quote>Soft</quote> means that the actual number of returned results
272 could differ slightly from the specified limit, depending on the query
273 and the quality of the system's random number generator.
274 </para>
275 </listitem>
276 </varlistentry>
277 </variablelist>
279 </sect1>
281 <sect1 id="gin-limit">
282 <title>Limitations</title>
284 <para>
285 <acronym>GIN</acronym> doesn't support full index scans: because there are
286 often many keys per value, each heap pointer would be returned many times,
287 and there is no easy way to prevent this.
288 </para>
290 <para>
291 When <function>extractQuery</function> returns zero keys,
292 <acronym>GIN</acronym> will emit an error. Depending on the operator,
293 a void query might match all, some, or none of the indexed values (for
294 example, every array contains the empty array, but does not overlap the
295 empty array), and <acronym>GIN</acronym> cannot determine the correct
296 answer, nor produce a full-index-scan result if it could determine that
297 that was correct.
298 </para>
300 <para>
301 It is not an error for <function>extractValue</> to return zero keys,
302 but in this case the indexed value will be unrepresented in the index.
303 This is another reason why full index scan is not useful &mdash; it would
304 miss such rows.
305 </para>
307 <para>
308 It is possible for an operator class to circumvent the restriction against
309 full index scan. To do that, <function>extractValue</> must return at least
310 one (possibly dummy) key for every indexed value, and
311 <function>extractQuery</function> must convert an unrestricted search into
312 a partial-match query that will scan the whole index. This is inefficient
313 but might be necessary to avoid corner-case failures with operators such
314 as LIKE. Note however that failure could still occur if the intermediate
315 <literal>TIDBitmap</> becomes lossy.
316 </para>
317 </sect1>
319 <sect1 id="gin-examples">
320 <title>Examples</title>
322 <para>
323 The <productname>PostgreSQL</productname> source distribution includes
324 <acronym>GIN</acronym> operator classes for <type>tsvector</> and
325 for one-dimensional arrays of all internal types. Prefix searching in
326 <type>tsvector</> is implemented using the <acronym>GIN</> partial match
327 feature.
328 The following <filename>contrib</> modules also contain
329 <acronym>GIN</acronym> operator classes:
330 </para>
332 <variablelist>
333 <varlistentry>
334 <term>hstore</term>
335 <listitem>
336 <para>Module for storing (key, value) pairs</para>
337 </listitem>
338 </varlistentry>
340 <varlistentry>
341 <term>intarray</term>
342 <listitem>
343 <para>Enhanced support for int4[]</para>
344 </listitem>
345 </varlistentry>
347 <varlistentry>
348 <term>pg_trgm</term>
349 <listitem>
350 <para>Text similarity using trigram matching</para>
351 </listitem>
352 </varlistentry>
353 </variablelist>
354 </sect1>
356 </chapter>