Fix spelling error in docs.
[PostgreSQL.git] / doc / src / sgml / gist.sgml
blobd577523be14264c004dee4193c482b0fc4b3a04b
1 <!-- $PostgreSQL$ -->
3 <chapter id="GiST">
4 <title>GiST Indexes</title>
6 <indexterm>
7 <primary>index</primary>
8 <secondary>GiST</secondary>
9 </indexterm>
11 <sect1 id="gist-intro">
12 <title>Introduction</title>
14 <para>
15 <acronym>GiST</acronym> stands for Generalized Search Tree. It is a
16 balanced, tree-structured access method, that acts as a base template in
17 which to implement arbitrary indexing schemes. B-trees, R-trees and many
18 other indexing schemes can be implemented in <acronym>GiST</acronym>.
19 </para>
21 <para>
22 One advantage of <acronym>GiST</acronym> is that it allows the development
23 of custom data types with the appropriate access methods, by
24 an expert in the domain of the data type, rather than a database expert.
25 </para>
27 <para>
28 Some of the information here is derived from the University of California at
29 Berkeley's GiST Indexing Project
30 <ulink url="http://gist.cs.berkeley.edu/">web site</ulink> and
31 <ulink url="http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz">
32 Marcel Kornacker's thesis, Access Methods for Next-Generation Database Systems</ulink>.
33 The <acronym>GiST</acronym>
34 implementation in <productname>PostgreSQL</productname> is primarily
35 maintained by Teodor Sigaev and Oleg Bartunov, and there is more
36 information on their
37 <ulink url="http://www.sai.msu.su/~megera/postgres/gist/">website</ulink>.
38 </para>
40 </sect1>
42 <sect1 id="gist-extensibility">
43 <title>Extensibility</title>
45 <para>
46 Traditionally, implementing a new index access method meant a lot of
47 difficult work. It was necessary to understand the inner workings of the
48 database, such as the lock manager and Write-Ahead Log. The
49 <acronym>GiST</acronym> interface has a high level of abstraction,
50 requiring the access method implementer to only implement the semantics of
51 the data type being accessed. The <acronym>GiST</acronym> layer itself
52 takes care of concurrency, logging and searching the tree structure.
53 </para>
55 <para>
56 This extensibility should not be confused with the extensibility of the
57 other standard search trees in terms of the data they can handle. For
58 example, <productname>PostgreSQL</productname> supports extensible B-trees
59 and hash indexes. That means that you can use
60 <productname>PostgreSQL</productname> to build a B-tree or hash over any
61 data type you want. But B-trees only support range predicates
62 (<literal>&lt;</literal>, <literal>=</literal>, <literal>&gt;</literal>),
63 and hash indexes only support equality queries.
64 </para>
66 <para>
67 So if you index, say, an image collection with a
68 <productname>PostgreSQL</productname> B-tree, you can only issue queries
69 such as <quote>is imagex equal to imagey</quote>, <quote>is imagex less
70 than imagey</quote> and <quote>is imagex greater than imagey</quote>?
71 Depending on how you define <quote>equals</quote>, <quote>less than</quote>
72 and <quote>greater than</quote> in this context, this could be useful.
73 However, by using a <acronym>GiST</acronym> based index, you could create
74 ways to ask domain-specific questions, perhaps <quote>find all images of
75 horses</quote> or <quote>find all over-exposed images</quote>.
76 </para>
78 <para>
79 All it takes to get a <acronym>GiST</acronym> access method up and running
80 is to implement seven user-defined methods, which define the behavior of
81 keys in the tree. Of course these methods have to be pretty fancy to
82 support fancy queries, but for all the standard queries (B-trees,
83 R-trees, etc.) they're relatively straightforward. In short,
84 <acronym>GiST</acronym> combines extensibility along with generality, code
85 reuse, and a clean interface.
86 </para>
88 </sect1>
90 <sect1 id="gist-implementation">
91 <title>Implementation</title>
93 <para>
94 There are seven methods that an index operator class for
95 <acronym>GiST</acronym> must provide:
96 </para>
98 <variablelist>
99 <varlistentry>
100 <term>consistent</term>
101 <listitem>
102 <para>
103 Given a predicate <literal>p</literal> on a tree page, and a user
104 query, <literal>q</literal>, this method will return false if it is
105 certain that both <literal>p</literal> and <literal>q</literal> cannot
106 be true for a given data item. For a true result, a
107 <literal>recheck</> flag must also be returned; this indicates whether
108 the predicate implies the query (<literal>recheck</> = false) or
109 not (<literal>recheck</> = true).
110 </para>
111 </listitem>
112 </varlistentry>
114 <varlistentry>
115 <term>union</term>
116 <listitem>
117 <para>
118 This method consolidates information in the tree. Given a set of
119 entries, this function generates a new predicate that is true for all
120 the entries.
121 </para>
122 </listitem>
123 </varlistentry>
125 <varlistentry>
126 <term>compress</term>
127 <listitem>
128 <para>
129 Converts the data item into a format suitable for physical storage in
130 an index page.
131 </para>
132 </listitem>
133 </varlistentry>
135 <varlistentry>
136 <term>decompress</term>
137 <listitem>
138 <para>
139 The reverse of the <function>compress</function> method. Converts the
140 index representation of the data item into a format that can be
141 manipulated by the database.
142 </para>
143 </listitem>
144 </varlistentry>
146 <varlistentry>
147 <term>penalty</term>
148 <listitem>
149 <para>
150 Returns a value indicating the <quote>cost</quote> of inserting the new
151 entry into a particular branch of the tree. items will be inserted
152 down the path of least <function>penalty</function> in the tree.
153 </para>
154 </listitem>
155 </varlistentry>
157 <varlistentry>
158 <term>picksplit</term>
159 <listitem>
160 <para>
161 When a page split is necessary, this function decides which entries on
162 the page are to stay on the old page, and which are to move to the new
163 page.
164 </para>
165 </listitem>
166 </varlistentry>
168 <varlistentry>
169 <term>same</term>
170 <listitem>
171 <para>
172 Returns true if two entries are identical, false otherwise.
173 </para>
174 </listitem>
175 </varlistentry>
177 </variablelist>
179 </sect1>
181 <sect1 id="gist-examples">
182 <title>Examples</title>
184 <para>
185 The <productname>PostgreSQL</productname> source distribution includes
186 several examples of index methods implemented using
187 <acronym>GiST</acronym>. The core system currently provides text search
188 support (indexing for <type>tsvector</> and <type>tsquery</>) as well as
189 R-Tree equivalent functionality for some of the built-in geometric data types
190 (see <filename>src/backend/access/gist/gistproc.c</>). The following
191 <filename>contrib</> modules also contain <acronym>GiST</acronym>
192 operator classes:
193 </para>
195 <variablelist>
196 <varlistentry>
197 <term>btree_gist</term>
198 <listitem>
199 <para>B-Tree equivalent functionality for several data types</para>
200 </listitem>
201 </varlistentry>
203 <varlistentry>
204 <term>cube</term>
205 <listitem>
206 <para>Indexing for multidimensional cubes</para>
207 </listitem>
208 </varlistentry>
210 <varlistentry>
211 <term>hstore</term>
212 <listitem>
213 <para>Module for storing (key, value) pairs</para>
214 </listitem>
215 </varlistentry>
217 <varlistentry>
218 <term>intarray</term>
219 <listitem>
220 <para>RD-Tree for one-dimensional array of int4 values</para>
221 </listitem>
222 </varlistentry>
224 <varlistentry>
225 <term>ltree</term>
226 <listitem>
227 <para>Indexing for tree-like structures</para>
228 </listitem>
229 </varlistentry>
231 <varlistentry>
232 <term>pg_trgm</term>
233 <listitem>
234 <para>Text similarity using trigram matching</para>
235 </listitem>
236 </varlistentry>
238 <varlistentry>
239 <term>seg</term>
240 <listitem>
241 <para>Indexing for <quote>float ranges</quote></para>
242 </listitem>
243 </varlistentry>
244 </variablelist>
246 </sect1>
248 <sect1 id="gist-recovery">
249 <title>Crash Recovery</title>
251 <para>
252 Usually, replay of the WAL log is sufficient to restore the integrity
253 of a GiST index following a database crash. However, there are some
254 corner cases in which the index state is not fully rebuilt. The index
255 will still be functionally correct, but there might be some performance
256 degradation. When this occurs, the index can be repaired by
257 <command>VACUUM</>ing its table, or by rebuilding the index using
258 <command>REINDEX</>. In some cases a plain <command>VACUUM</> is
259 not sufficient, and either <command>VACUUM FULL</> or <command>REINDEX</>
260 is needed. The need for one of these procedures is indicated by occurrence
261 of this log message during crash recovery:
262 <programlisting>
263 LOG: index NNN/NNN/NNN needs VACUUM or REINDEX to finish crash recovery
264 </programlisting>
265 or this log message during routine index insertions:
266 <programlisting>
267 LOG: index "FOO" needs VACUUM or REINDEX to finish crash recovery
268 </programlisting>
269 If a plain <command>VACUUM</> finds itself unable to complete recovery
270 fully, it will return a notice:
271 <programlisting>
272 NOTICE: index "FOO" needs VACUUM FULL or REINDEX to finish crash recovery
273 </programlisting>
274 </para>
275 </sect1>
277 </chapter>