Fix xslt_process() to ensure that it inserts a NULL terminator after the
[PostgreSQL.git] / doc / src / sgml / textsearch.sgml
blobed05bea27ef8db8ac2357ff278892b12823bcc5d
1 <!-- $PostgreSQL$ -->
3 <chapter id="textsearch">
4 <title id="textsearch-title">Full Text Search</title>
6 <indexterm zone="textsearch">
7 <primary>full text search</primary>
8 </indexterm>
10 <indexterm zone="textsearch">
11 <primary>text search</primary>
12 </indexterm>
14 <sect1 id="textsearch-intro">
15 <title>Introduction</title>
17 <para>
18 Full Text Searching (or just <firstterm>text search</firstterm>) provides
19 the capability to identify natural-language <firstterm>documents</> that
20 satisfy a <firstterm>query</firstterm>, and optionally to sort them by
21 relevance to the query. The most common type of search
22 is to find all documents containing given <firstterm>query terms</firstterm>
23 and return them in order of their <firstterm>similarity</firstterm> to the
24 query. Notions of <varname>query</varname> and
25 <varname>similarity</varname> are very flexible and depend on the specific
26 application. The simplest search considers <varname>query</varname> as a
27 set of words and <varname>similarity</varname> as the frequency of query
28 words in the document.
29 </para>
31 <para>
32 Textual search operators have existed in databases for years.
33 <productname>PostgreSQL</productname> has
34 <literal>~</literal>, <literal>~*</literal>, <literal>LIKE</literal>, and
35 <literal>ILIKE</literal> operators for textual data types, but they lack
36 many essential properties required by modern information systems:
37 </para>
39 <itemizedlist spacing="compact" mark="bullet">
40 <listitem>
41 <para>
42 There is no linguistic support, even for English. Regular expressions
43 are not sufficient because they cannot easily handle derived words, e.g.,
44 <literal>satisfies</literal> and <literal>satisfy</literal>. You might
45 miss documents that contain <literal>satisfies</literal>, although you
46 probably would like to find them when searching for
47 <literal>satisfy</literal>. It is possible to use <literal>OR</literal>
48 to search for multiple derived forms, but this is tedious and error-prone
49 (some words can have several thousand derivatives).
50 </para>
51 </listitem>
53 <listitem>
54 <para>
55 They provide no ordering (ranking) of search results, which makes them
56 ineffective when thousands of matching documents are found.
57 </para>
58 </listitem>
60 <listitem>
61 <para>
62 They tend to be slow because there is no index support, so they must
63 process all documents for every search.
64 </para>
65 </listitem>
66 </itemizedlist>
68 <para>
69 Full text indexing allows documents to be <emphasis>preprocessed</emphasis>
70 and an index saved for later rapid searching. Preprocessing includes:
71 </para>
73 <itemizedlist mark="none">
74 <listitem>
75 <para>
76 <emphasis>Parsing documents into <firstterm>tokens</></emphasis>. It is
77 useful to identify various classes of tokens, e.g., numbers, words,
78 complex words, email addresses, so that they can be processed
79 differently. In principle token classes depend on the specific
80 application, but for most purposes it is adequate to use a predefined
81 set of classes.
82 <productname>PostgreSQL</productname> uses a <firstterm>parser</> to
83 perform this step. A standard parser is provided, and custom parsers
84 can be created for specific needs.
85 </para>
86 </listitem>
88 <listitem>
89 <para>
90 <emphasis>Converting tokens into <firstterm>lexemes</></emphasis>.
91 A lexeme is a string, just like a token, but it has been
92 <firstterm>normalized</> so that different forms of the same word
93 are made alike. For example, normalization almost always includes
94 folding upper-case letters to lower-case, and often involves removal
95 of suffixes (such as <literal>s</> or <literal>es</> in English).
96 This allows searches to find variant forms of the
97 same word, without tediously entering all the possible variants.
98 Also, this step typically eliminates <firstterm>stop words</>, which
99 are words that are so common that they are useless for searching.
100 (In short, then, tokens are raw fragments of the document text, while
101 lexemes are words that are believed useful for indexing and searching.)
102 <productname>PostgreSQL</productname> uses <firstterm>dictionaries</> to
103 perform this step. Various standard dictionaries are provided, and
104 custom ones can be created for specific needs.
105 </para>
106 </listitem>
108 <listitem>
109 <para>
110 <emphasis>Storing preprocessed documents optimized for
111 searching</emphasis>. For example, each document can be represented
112 as a sorted array of normalized lexemes. Along with the lexemes it is
113 often desirable to store positional information to use for
114 <firstterm>proximity ranking</firstterm>, so that a document that
115 contains a more <quote>dense</> region of query words is
116 assigned a higher rank than one with scattered query words.
117 </para>
118 </listitem>
119 </itemizedlist>
121 <para>
122 Dictionaries allow fine-grained control over how tokens are normalized.
123 With appropriate dictionaries, you can:
124 </para>
126 <itemizedlist spacing="compact" mark="bullet">
127 <listitem>
128 <para>
129 Define stop words that should not be indexed.
130 </para>
131 </listitem>
133 <listitem>
134 <para>
135 Map synonyms to a single word using <application>Ispell</>.
136 </para>
137 </listitem>
139 <listitem>
140 <para>
141 Map phrases to a single word using a thesaurus.
142 </para>
143 </listitem>
145 <listitem>
146 <para>
147 Map different variations of a word to a canonical form using
148 an <application>Ispell</> dictionary.
149 </para>
150 </listitem>
152 <listitem>
153 <para>
154 Map different variations of a word to a canonical form using
155 <application>Snowball</> stemmer rules.
156 </para>
157 </listitem>
158 </itemizedlist>
160 <para>
161 A data type <type>tsvector</type> is provided for storing preprocessed
162 documents, along with a type <type>tsquery</type> for representing processed
163 queries (<xref linkend="datatype-textsearch">). There are many
164 functions and operators available for these data types
165 (<xref linkend="functions-textsearch">), the most important of which is
166 the match operator <literal>@@</literal>, which we introduce in
167 <xref linkend="textsearch-matching">. Full text searches can be accelerated
168 using indexes (<xref linkend="textsearch-indexes">).
169 </para>
172 <sect2 id="textsearch-document">
173 <title>What Is a Document?</title>
175 <indexterm zone="textsearch-document">
176 <primary>document</primary>
177 <secondary>text search</secondary>
178 </indexterm>
180 <para>
181 A <firstterm>document</> is the unit of searching in a full text search
182 system; for example, a magazine article or email message. The text search
183 engine must be able to parse documents and store associations of lexemes
184 (key words) with their parent document. Later, these associations are
185 used to search for documents that contain query words.
186 </para>
188 <para>
189 For searches within <productname>PostgreSQL</productname>,
190 a document is normally a textual field within a row of a database table,
191 or possibly a combination (concatenation) of such fields, perhaps stored
192 in several tables or obtained dynamically. In other words, a document can
193 be constructed from different parts for indexing and it might not be
194 stored anywhere as a whole. For example:
196 <programlisting>
197 SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document
198 FROM messages
199 WHERE mid = 12;
201 SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
202 FROM messages m, docs d
203 WHERE mid = did AND mid = 12;
204 </programlisting>
205 </para>
207 <note>
208 <para>
209 Actually, in these example queries, <function>coalesce</function>
210 should be used to prevent a single <literal>NULL</literal> attribute from
211 causing a <literal>NULL</literal> result for the whole document.
212 </para>
213 </note>
215 <para>
216 Another possibility is to store the documents as simple text files in the
217 file system. In this case, the database can be used to store the full text
218 index and to execute searches, and some unique identifier can be used to
219 retrieve the document from the file system. However, retrieving files
220 from outside the database requires superuser permissions or special
221 function support, so this is usually less convenient than keeping all
222 the data inside <productname>PostgreSQL</productname>. Also, keeping
223 everything inside the database allows easy access
224 to document metadata to assist in indexing and display.
225 </para>
227 <para>
228 For text search purposes, each document must be reduced to the
229 preprocessed <type>tsvector</> format. Searching and ranking
230 are performed entirely on the <type>tsvector</> representation
231 of a document &mdash; the original text need only be retrieved
232 when the document has been selected for display to a user.
233 We therefore often speak of the <type>tsvector</> as being the
234 document, but of course it is only a compact representation of
235 the full document.
236 </para>
237 </sect2>
239 <sect2 id="textsearch-matching">
240 <title>Basic Text Matching</title>
242 <para>
243 Full text searching in <productname>PostgreSQL</productname> is based on
244 the match operator <literal>@@</literal>, which returns
245 <literal>true</literal> if a <type>tsvector</type>
246 (document) matches a <type>tsquery</type> (query).
247 It doesn't matter which data type is written first:
249 <programlisting>
250 SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat &amp; rat'::tsquery;
251 ?column?
252 ----------
255 SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
256 ?column?
257 ----------
259 </programlisting>
260 </para>
262 <para>
263 As the above example suggests, a <type>tsquery</type> is not just raw
264 text, any more than a <type>tsvector</type> is. A <type>tsquery</type>
265 contains search terms, which must be already-normalized lexemes, and
266 may combine multiple terms using AND, OR, and NOT operators.
267 (For details see <xref linkend="datatype-textsearch">.) There are
268 functions <function>to_tsquery</> and <function>plainto_tsquery</>
269 that are helpful in converting user-written text into a proper
270 <type>tsquery</type>, for example by normalizing words appearing in
271 the text. Similarly, <function>to_tsvector</> is used to parse and
272 normalize a document string. So in practice a text search match would
273 look more like this:
275 <programlisting>
276 SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat &amp; rat');
277 ?column?
278 ----------
280 </programlisting>
282 Observe that this match would not succeed if written as
284 <programlisting>
285 SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat &amp; rat');
286 ?column?
287 ----------
289 </programlisting>
291 since here no normalization of the word <literal>rats</> will occur.
292 The elements of a <type>tsvector</> are lexemes, which are assumed
293 already normalized, so <literal>rats</> does not match <literal>rat</>.
294 </para>
296 <para>
297 The <literal>@@</literal> operator also
298 supports <type>text</type> input, allowing explicit conversion of a text
299 string to <type>tsvector</type> or <type>tsquery</> to be skipped
300 in simple cases. The variants available are:
302 <programlisting>
303 tsvector @@ tsquery
304 tsquery @@ tsvector
305 text @@ tsquery
306 text @@ text
307 </programlisting>
308 </para>
310 <para>
311 The first two of these we saw already.
312 The form <type>text</type> <literal>@@</literal> <type>tsquery</type>
313 is equivalent to <literal>to_tsvector(x) @@ y</literal>.
314 The form <type>text</type> <literal>@@</literal> <type>text</type>
315 is equivalent to <literal>to_tsvector(x) @@ plainto_tsquery(y)</literal>.
316 </para>
317 </sect2>
319 <sect2 id="textsearch-intro-configurations">
320 <title>Configurations</title>
322 <para>
323 The above are all simple text search examples. As mentioned before, full
324 text search functionality includes the ability to do many more things:
325 skip indexing certain words (stop words), process synonyms, and use
326 sophisticated parsing, e.g., parse based on more than just white space.
327 This functionality is controlled by <firstterm>text search
328 configurations</>. <productname>PostgreSQL</> comes with predefined
329 configurations for many languages, and you can easily create your own
330 configurations. (<application>psql</>'s <command>\dF</> command
331 shows all available configurations.)
332 </para>
334 <para>
335 During installation an appropriate configuration is selected and
336 <xref linkend="guc-default-text-search-config"> is set accordingly
337 in <filename>postgresql.conf</>. If you are using the same text search
338 configuration for the entire cluster you can use the value in
339 <filename>postgresql.conf</>. To use different configurations
340 throughout the cluster but the same configuration within any one database,
341 use <command>ALTER DATABASE ... SET</>. Otherwise, you can set
342 <varname>default_text_search_config</varname> in each session.
343 </para>
345 <para>
346 Each text search function that depends on a configuration has an optional
347 <type>regconfig</> argument, so that the configuration to use can be
348 specified explicitly. <varname>default_text_search_config</varname>
349 is used only when this argument is omitted.
350 </para>
352 <para>
353 To make it easier to build custom text search configurations, a
354 configuration is built up from simpler database objects.
355 <productname>PostgreSQL</>'s text search facility provides
356 four types of configuration-related database objects:
357 </para>
359 <itemizedlist spacing="compact" mark="bullet">
360 <listitem>
361 <para>
362 <firstterm>Text search parsers</> break documents into tokens
363 and classify each token (for example, as words or numbers).
364 </para>
365 </listitem>
367 <listitem>
368 <para>
369 <firstterm>Text search dictionaries</> convert tokens to normalized
370 form and reject stop words.
371 </para>
372 </listitem>
374 <listitem>
375 <para>
376 <firstterm>Text search templates</> provide the functions underlying
377 dictionaries. (A dictionary simply specifies a template and a set
378 of parameters for the template.)
379 </para>
380 </listitem>
382 <listitem>
383 <para>
384 <firstterm>Text search configurations</> select a parser and a set
385 of dictionaries to use to normalize the tokens produced by the parser.
386 </para>
387 </listitem>
388 </itemizedlist>
390 <para>
391 Text search parsers and templates are built from low-level C functions;
392 therefore it requires C programming ability to develop new ones, and
393 superuser privileges to install one into a database. (There are examples
394 of add-on parsers and templates in the <filename>contrib/</> area of the
395 <productname>PostgreSQL</> distribution.) Since dictionaries and
396 configurations just parameterize and connect together some underlying
397 parsers and templates, no special privilege is needed to create a new
398 dictionary or configuration. Examples of creating custom dictionaries and
399 configurations appear later in this chapter.
400 </para>
402 </sect2>
404 </sect1>
406 <sect1 id="textsearch-tables">
407 <title>Tables and Indexes</title>
409 <para>
410 The examples in the previous section illustrated full text matching using
411 simple constant strings. This section shows how to search table data,
412 optionally using indexes.
413 </para>
415 <sect2 id="textsearch-tables-search">
416 <title>Searching a Table</title>
418 <para>
419 It is possible to do a full text search without an index. A simple query
420 to print the <structname>title</> of each row that contains the word
421 <literal>friend</> in its <structfield>body</> field is:
423 <programlisting>
424 SELECT title
425 FROM pgweb
426 WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');
427 </programlisting>
429 This will also find related words such as <literal>friends</>
430 and <literal>friendly</>, since all these are reduced to the same
431 normalized lexeme.
432 </para>
434 <para>
435 The query above specifies that the <literal>english</> configuration
436 is to be used to parse and normalize the strings. Alternatively we
437 could omit the configuration parameters:
439 <programlisting>
440 SELECT title
441 FROM pgweb
442 WHERE to_tsvector(body) @@ to_tsquery('friend');
443 </programlisting>
445 This query will use the configuration set by <xref
446 linkend="guc-default-text-search-config">.
447 </para>
449 <para>
450 A more complex example is to
451 select the ten most recent documents that contain <literal>create</> and
452 <literal>table</> in the <structname>title</> or <structname>body</>:
454 <programlisting>
455 SELECT title
456 FROM pgweb
457 WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create &amp; table')
458 ORDER BY last_mod_date DESC
459 LIMIT 10;
460 </programlisting>
462 For clarity we omitted the <function>coalesce</function> function calls
463 which would be needed to find rows that contain <literal>NULL</literal>
464 in one of the two fields.
465 </para>
467 <para>
468 Although these queries will work without an index, most applications
469 will find this approach too slow, except perhaps for occasional ad-hoc
470 searches. Practical use of text searching usually requires creating
471 an index.
472 </para>
474 </sect2>
476 <sect2 id="textsearch-tables-index">
477 <title>Creating Indexes</title>
479 <para>
480 We can create a <acronym>GIN</acronym> index (<xref
481 linkend="textsearch-indexes">) to speed up text searches:
483 <programlisting>
484 CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));
485 </programlisting>
487 Notice that the 2-argument version of <function>to_tsvector</function> is
488 used. Only text search functions that specify a configuration name can
489 be used in expression indexes (<xref linkend="indexes-expressional">).
490 This is because the index contents must be unaffected by <xref
491 linkend="guc-default-text-search-config">. If they were affected, the
492 index contents might be inconsistent because different entries could
493 contain <type>tsvector</>s that were created with different text search
494 configurations, and there would be no way to guess which was which. It
495 would be impossible to dump and restore such an index correctly.
496 </para>
498 <para>
499 Because the two-argument version of <function>to_tsvector</function> was
500 used in the index above, only a query reference that uses the 2-argument
501 version of <function>to_tsvector</function> with the same configuration
502 name will use that index. That is, <literal>WHERE
503 to_tsvector('english', body) @@ 'a &amp; b'</> can use the index,
504 but <literal>WHERE to_tsvector(body) @@ 'a &amp; b'</> cannot.
505 This ensures that an index will be used only with the same configuration
506 used to create the index entries.
507 </para>
509 <para>
510 It is possible to set up more complex expression indexes wherein the
511 configuration name is specified by another column, e.g.:
513 <programlisting>
514 CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body));
515 </programlisting>
517 where <literal>config_name</> is a column in the <literal>pgweb</>
518 table. This allows mixed configurations in the same index while
519 recording which configuration was used for each index entry. This
520 would be useful, for example, if the document collection contained
521 documents in different languages. Again,
522 queries that are meant to use the index must be phrased to match, e.g.,
523 <literal>WHERE to_tsvector(config_name, body) @@ 'a &amp; b'</>.
524 </para>
526 <para>
527 Indexes can even concatenate columns:
529 <programlisting>
530 CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', title || ' ' || body));
531 </programlisting>
532 </para>
534 <para>
535 Another approach is to create a separate <type>tsvector</> column
536 to hold the output of <function>to_tsvector</>. This example is a
537 concatenation of <literal>title</literal> and <literal>body</literal>,
538 using <function>coalesce</> to ensure that one field will still be
539 indexed when the other is <literal>NULL</>:
541 <programlisting>
542 ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector;
543 UPDATE pgweb SET textsearchable_index_col =
544 to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));
545 </programlisting>
547 Then we create a <acronym>GIN</acronym> index to speed up the search:
549 <programlisting>
550 CREATE INDEX textsearch_idx ON pgweb USING gin(textsearchable_index_col);
551 </programlisting>
553 Now we are ready to perform a fast full text search:
555 <programlisting>
556 SELECT title
557 FROM pgweb
558 WHERE textsearchable_index_col @@ to_tsquery('create &amp; table')
559 ORDER BY last_mod_date DESC
560 LIMIT 10;
561 </programlisting>
562 </para>
564 <para>
565 When using a separate column to store the <type>tsvector</>
566 representation,
567 it is necessary to create a trigger to keep the <type>tsvector</>
568 column current anytime <literal>title</> or <literal>body</> changes.
569 <xref linkend="textsearch-update-triggers"> explains how to do that.
570 </para>
572 <para>
573 One advantage of the separate-column approach over an expression index
574 is that it is not necessary to explicitly specify the text search
575 configuration in queries in order to make use of the index. As shown
576 in the example above, the query can depend on
577 <varname>default_text_search_config</>. Another advantage is that
578 searches will be faster, since it will not be necessary to redo the
579 <function>to_tsvector</> calls to verify index matches. (This is more
580 important when using a GiST index than a GIN index; see <xref
581 linkend="textsearch-indexes">.) The expression-index approach is
582 simpler to set up, however, and it requires less disk space since the
583 <type>tsvector</> representation is not stored explicitly.
584 </para>
586 </sect2>
588 </sect1>
590 <sect1 id="textsearch-controls">
591 <title>Controlling Text Search</title>
593 <para>
594 To implement full text searching there must be a function to create a
595 <type>tsvector</type> from a document and a <type>tsquery</type> from a
596 user query. Also, we need to return results in a useful order, so we need
597 a function that compares documents with respect to their relevance to
598 the query. It's also important to be able to display the results nicely.
599 <productname>PostgreSQL</productname> provides support for all of these
600 functions.
601 </para>
603 <sect2 id="textsearch-parsing-documents">
604 <title>Parsing Documents</title>
606 <para>
607 <productname>PostgreSQL</productname> provides the
608 function <function>to_tsvector</function> for converting a document to
609 the <type>tsvector</type> data type.
610 </para>
612 <indexterm>
613 <primary>to_tsvector</primary>
614 </indexterm>
616 <synopsis>
617 to_tsvector(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>) returns <type>tsvector</>
618 </synopsis>
620 <para>
621 <function>to_tsvector</function> parses a textual document into tokens,
622 reduces the tokens to lexemes, and returns a <type>tsvector</type> which
623 lists the lexemes together with their positions in the document.
624 The document is processed according to the specified or default
625 text search configuration.
626 Here is a simple example:
628 <programlisting>
629 SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats');
630 to_tsvector
631 -----------------------------------------------------
632 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
633 </programlisting>
634 </para>
636 <para>
637 In the example above we see that the resulting <type>tsvector</type> does not
638 contain the words <literal>a</literal>, <literal>on</literal>, or
639 <literal>it</literal>, the word <literal>rats</literal> became
640 <literal>rat</literal>, and the punctuation sign <literal>-</literal> was
641 ignored.
642 </para>
644 <para>
645 The <function>to_tsvector</function> function internally calls a parser
646 which breaks the document text into tokens and assigns a type to
647 each token. For each token, a list of
648 dictionaries (<xref linkend="textsearch-dictionaries">) is consulted,
649 where the list can vary depending on the token type. The first dictionary
650 that <firstterm>recognizes</> the token emits one or more normalized
651 <firstterm>lexemes</firstterm> to represent the token. For example,
652 <literal>rats</literal> became <literal>rat</literal> because one of the
653 dictionaries recognized that the word <literal>rats</literal> is a plural
654 form of <literal>rat</literal>. Some words are recognized as
655 <firstterm>stop words</> (<xref linkend="textsearch-stopwords">), which
656 causes them to be ignored since they occur too frequently to be useful in
657 searching. In our example these are
658 <literal>a</literal>, <literal>on</literal>, and <literal>it</literal>.
659 If no dictionary in the list recognizes the token then it is also ignored.
660 In this example that happened to the punctuation sign <literal>-</literal>
661 because there are in fact no dictionaries assigned for its token type
662 (<literal>Space symbols</literal>), meaning space tokens will never be
663 indexed. The choices of parser, dictionaries and which types of tokens to
664 index are determined by the selected text search configuration (<xref
665 linkend="textsearch-configuration">). It is possible to have
666 many different configurations in the same database, and predefined
667 configurations are available for various languages. In our example
668 we used the default configuration <literal>english</literal> for the
669 English language.
670 </para>
672 <para>
673 The function <function>setweight</function> can be used to label the
674 entries of a <type>tsvector</type> with a given <firstterm>weight</>,
675 where a weight is one of the letters <literal>A</>, <literal>B</>,
676 <literal>C</>, or <literal>D</>.
677 This is typically used to mark entries coming from
678 different parts of a document, such as title versus body. Later, this
679 information can be used for ranking of search results.
680 </para>
682 <para>
683 Because <function>to_tsvector</function>(<literal>NULL</literal>) will
684 return <literal>NULL</literal>, it is recommended to use
685 <function>coalesce</function> whenever a field might be null.
686 Here is the recommended method for creating
687 a <type>tsvector</type> from a structured document:
689 <programlisting>
690 UPDATE tt SET ti =
691 setweight(to_tsvector(coalesce(title,'')), 'A') ||
692 setweight(to_tsvector(coalesce(keyword,'')), 'B') ||
693 setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
694 setweight(to_tsvector(coalesce(body,'')), 'D');
695 </programlisting>
697 Here we have used <function>setweight</function> to label the source
698 of each lexeme in the finished <type>tsvector</type>, and then merged
699 the labeled <type>tsvector</type> values using the <type>tsvector</>
700 concatenation operator <literal>||</>. (<xref
701 linkend="textsearch-manipulate-tsvector"> gives details about these
702 operations.)
703 </para>
705 </sect2>
707 <sect2 id="textsearch-parsing-queries">
708 <title>Parsing Queries</title>
710 <para>
711 <productname>PostgreSQL</productname> provides the
712 functions <function>to_tsquery</function> and
713 <function>plainto_tsquery</function> for converting a query to
714 the <type>tsquery</type> data type. <function>to_tsquery</function>
715 offers access to more features than <function>plainto_tsquery</function>,
716 but is less forgiving about its input.
717 </para>
719 <indexterm>
720 <primary>to_tsquery</primary>
721 </indexterm>
723 <synopsis>
724 to_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
725 </synopsis>
727 <para>
728 <function>to_tsquery</function> creates a <type>tsquery</> value from
729 <replaceable>querytext</replaceable>, which must consist of single tokens
730 separated by the Boolean operators <literal>&amp;</literal> (AND),
731 <literal>|</literal> (OR) and <literal>!</literal> (NOT). These operators
732 can be grouped using parentheses. In other words, the input to
733 <function>to_tsquery</function> must already follow the general rules for
734 <type>tsquery</> input, as described in <xref
735 linkend="datatype-textsearch">. The difference is that while basic
736 <type>tsquery</> input takes the tokens at face value,
737 <function>to_tsquery</function> normalizes each token to a lexeme using
738 the specified or default configuration, and discards any tokens that are
739 stop words according to the configuration. For example:
741 <programlisting>
742 SELECT to_tsquery('english', 'The &amp; Fat &amp; Rats');
743 to_tsquery
744 ---------------
745 'fat' &amp; 'rat'
746 </programlisting>
748 As in basic <type>tsquery</> input, weight(s) can be attached to each
749 lexeme to restrict it to match only <type>tsvector</> lexemes of those
750 weight(s). For example:
752 <programlisting>
753 SELECT to_tsquery('english', 'Fat | Rats:AB');
754 to_tsquery
755 ------------------
756 'fat' | 'rat':AB
757 </programlisting>
759 Also, <literal>*</> can be attached to a lexeme to specify prefix matching:
761 <programlisting>
762 SELECT to_tsquery('supern:*A &amp; star:A*B');
763 to_tsquery
764 --------------------------
765 'supern':*A &amp; 'star':*AB
766 </programlisting>
768 Such a lexeme will match any word in a <type>tsvector</> that begins
769 with the given string.
770 </para>
772 <para>
773 <function>to_tsquery</function> can also accept single-quoted
774 phrases. This is primarily useful when the configuration includes a
775 thesaurus dictionary that may trigger on such phrases.
776 In the example below, a thesaurus contains the rule <literal>supernovae
777 stars : sn</literal>:
779 <programlisting>
780 SELECT to_tsquery('''supernovae stars'' &amp; !crab');
781 to_tsquery
782 ---------------
783 'sn' &amp; !'crab'
784 </programlisting>
786 Without quotes, <function>to_tsquery</function> will generate a syntax
787 error for tokens that are not separated by an AND or OR operator.
788 </para>
790 <indexterm>
791 <primary>plainto_tsquery</primary>
792 </indexterm>
794 <synopsis>
795 plainto_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
796 </synopsis>
798 <para>
799 <function>plainto_tsquery</> transforms unformatted text
800 <replaceable>querytext</replaceable> to <type>tsquery</type>.
801 The text is parsed and normalized much as for <function>to_tsvector</>,
802 then the <literal>&amp;</literal> (AND) Boolean operator is inserted
803 between surviving words.
804 </para>
806 <para>
807 Example:
809 <programlisting>
810 SELECT plainto_tsquery('english', 'The Fat Rats');
811 plainto_tsquery
812 -----------------
813 'fat' &amp; 'rat'
814 </programlisting>
816 Note that <function>plainto_tsquery</> cannot
817 recognize Boolean operators, weight labels, or prefix-match labels
818 in its input:
820 <programlisting>
821 SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');
822 plainto_tsquery
823 ---------------------
824 'fat' &amp; 'rat' &amp; 'c'
825 </programlisting>
827 Here, all the input punctuation was discarded as being space symbols.
828 </para>
830 </sect2>
832 <sect2 id="textsearch-ranking">
833 <title>Ranking Search Results</title>
835 <para>
836 Ranking attempts to measure how relevant documents are to a particular
837 query, so that when there are many matches the most relevant ones can be
838 shown first. <productname>PostgreSQL</productname> provides two
839 predefined ranking functions, which take into account lexical, proximity,
840 and structural information; that is, they consider how often the query
841 terms appear in the document, how close together the terms are in the
842 document, and how important is the part of the document where they occur.
843 However, the concept of relevancy is vague and very application-specific.
844 Different applications might require additional information for ranking,
845 e.g., document modification time. The built-in ranking functions are only
846 examples. You can write your own ranking functions and/or combine their
847 results with additional factors to fit your specific needs.
848 </para>
850 <para>
851 The two ranking functions currently available are:
853 <variablelist>
855 <varlistentry>
857 <indexterm>
858 <primary>ts_rank</primary>
859 </indexterm>
861 <term>
862 <synopsis>
863 ts_rank(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>,
864 <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>
865 </synopsis>
866 </term>
868 <listitem>
869 <para>
870 Standard ranking function.<!-- TODO document this better -->
871 </para>
872 </listitem>
873 </varlistentry>
875 <varlistentry>
877 <indexterm>
878 <primary>ts_rank_cd</primary>
879 </indexterm>
881 <term>
882 <synopsis>
883 ts_rank_cd(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>,
884 <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>
885 </synopsis>
886 </term>
888 <listitem>
889 <para>
890 This function computes the <firstterm>cover density</firstterm>
891 ranking for the given document vector and query, as described in
892 Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three
893 Term Queries" in the journal "Information Processing and Management",
894 1999.
895 </para>
897 <para>
898 This function requires positional information in its input.
899 Therefore it will not work on <quote>stripped</> <type>tsvector</>
900 values &mdash; it will always return zero.
901 </para>
902 </listitem>
903 </varlistentry>
905 </variablelist>
907 </para>
909 <para>
910 For both these functions,
911 the optional <replaceable class="PARAMETER">weights</replaceable>
912 argument offers the ability to weigh word instances more or less
913 heavily depending on how they are labeled. The weight arrays specify
914 how heavily to weigh each category of word, in the order:
916 <programlisting>
917 {D-weight, C-weight, B-weight, A-weight}
918 </programlisting>
920 If no <replaceable class="PARAMETER">weights</replaceable> are provided,
921 then these defaults are used:
923 <programlisting>
924 {0.1, 0.2, 0.4, 1.0}
925 </programlisting>
927 Typically weights are used to mark words from special areas of the
928 document, like the title or an initial abstract, so they can be
929 treated with more or less importance than words in the document body.
930 </para>
932 <para>
933 Since a longer document has a greater chance of containing a query term
934 it is reasonable to take into account document size, e.g., a hundred-word
935 document with five instances of a search word is probably more relevant
936 than a thousand-word document with five instances. Both ranking functions
937 take an integer <replaceable>normalization</replaceable> option that
938 specifies whether and how a document's length should impact its rank.
939 The integer option controls several behaviors, so it is a bit mask:
940 you can specify one or more behaviors using
941 <literal>|</literal> (for example, <literal>2|4</literal>).
943 <itemizedlist spacing="compact" mark="bullet">
944 <listitem>
945 <para>
946 0 (the default) ignores the document length
947 </para>
948 </listitem>
949 <listitem>
950 <para>
951 1 divides the rank by 1 + the logarithm of the document length
952 </para>
953 </listitem>
954 <listitem>
955 <para>
956 2 divides the rank by the document length
957 </para>
958 </listitem>
959 <listitem>
960 <para>
961 4 divides the rank by the mean harmonic distance between extents
962 (this is implemented only by <function>ts_rank_cd</>)
963 </para>
964 </listitem>
965 <listitem>
966 <para>
967 8 divides the rank by the number of unique words in document
968 </para>
969 </listitem>
970 <listitem>
971 <para>
972 16 divides the rank by 1 + the logarithm of the number
973 of unique words in document
974 </para>
975 </listitem>
976 <listitem>
977 <para>
978 32 divides the rank by itself + 1
979 </para>
980 </listitem>
981 </itemizedlist>
983 If more than one flag bit is specified, the transformations are
984 applied in the order listed.
985 </para>
987 <para>
988 It is important to note that the ranking functions do not use any global
989 information, so it is impossible to produce a fair normalization to 1% or
990 100% as sometimes desired. Normalization option 32
991 (<literal>rank/(rank+1)</literal>) can be applied to scale all ranks
992 into the range zero to one, but of course this is just a cosmetic change;
993 it will not affect the ordering of the search results.
994 </para>
996 <para>
997 Here is an example that selects only the ten highest-ranked matches:
999 <programlisting>
1000 SELECT title, ts_rank_cd(textsearch, query) AS rank
1001 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
1002 WHERE query @@ textsearch
1003 ORDER BY rank DESC
1004 LIMIT 10;
1005 title | rank
1006 -----------------------------------------------+----------
1007 Neutrinos in the Sun | 3.1
1008 The Sudbury Neutrino Detector | 2.4
1009 A MACHO View of Galactic Dark Matter | 2.01317
1010 Hot Gas and Dark Matter | 1.91171
1011 The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953
1012 Rafting for Solar Neutrinos | 1.9
1013 NGC 4650A: Strange Galaxy and Dark Matter | 1.85774
1014 Hot Gas and Dark Matter | 1.6123
1015 Ice Fishing for Cosmic Neutrinos | 1.6
1016 Weak Lensing Distorts the Universe | 0.818218
1017 </programlisting>
1019 This is the same example using normalized ranking:
1021 <programlisting>
1022 SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
1023 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
1024 WHERE query @@ textsearch
1025 ORDER BY rank DESC
1026 LIMIT 10;
1027 title | rank
1028 -----------------------------------------------+-------------------
1029 Neutrinos in the Sun | 0.756097569485493
1030 The Sudbury Neutrino Detector | 0.705882361190954
1031 A MACHO View of Galactic Dark Matter | 0.668123210574724
1032 Hot Gas and Dark Matter | 0.65655958650282
1033 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
1034 Rafting for Solar Neutrinos | 0.655172410958162
1035 NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637
1036 Hot Gas and Dark Matter | 0.617195790024749
1037 Ice Fishing for Cosmic Neutrinos | 0.615384618911517
1038 Weak Lensing Distorts the Universe | 0.450010798361481
1039 </programlisting>
1040 </para>
1042 <para>
1043 Ranking can be expensive since it requires consulting the
1044 <type>tsvector</type> of each matching document, which can be I/O bound and
1045 therefore slow. Unfortunately, it is almost impossible to avoid since
1046 practical queries often result in large numbers of matches.
1047 </para>
1049 </sect2>
1051 <sect2 id="textsearch-headline">
1052 <title>Highlighting Results</title>
1054 <para>
1055 To present search results it is ideal to show a part of each document and
1056 how it is related to the query. Usually, search engines show fragments of
1057 the document with marked search terms. <productname>PostgreSQL</>
1058 provides a function <function>ts_headline</function> that
1059 implements this functionality.
1060 </para>
1062 <indexterm>
1063 <primary>ts_headline</primary>
1064 </indexterm>
1066 <synopsis>
1067 ts_headline(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>, <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">options</replaceable> <type>text</> </optional>) returns <type>text</>
1068 </synopsis>
1070 <para>
1071 <function>ts_headline</function> accepts a document along
1072 with a query, and returns an excerpt from
1073 the document in which terms from the query are highlighted. The
1074 configuration to be used to parse the document can be specified by
1075 <replaceable>config</replaceable>; if <replaceable>config</replaceable>
1076 is omitted, the
1077 <varname>default_text_search_config</varname> configuration is used.
1078 </para>
1080 <para>
1081 If an <replaceable>options</replaceable> string is specified it must
1082 consist of a comma-separated list of one or more
1083 <replaceable>option</><literal>=</><replaceable>value</> pairs.
1084 The available options are:
1086 <itemizedlist spacing="compact" mark="bullet">
1087 <listitem>
1088 <para>
1089 <literal>StartSel</>, <literal>StopSel</literal>: the strings with
1090 which to delimit query words appearing in the document, to distinguish
1091 them from other excerpted words. You must double-quote these strings
1092 if they contain spaces or commas.
1093 </para>
1094 </listitem>
1095 <listitem >
1096 <para>
1097 <literal>MaxWords</>, <literal>MinWords</literal>: these numbers
1098 determine the longest and shortest headlines to output.
1099 </para>
1100 </listitem>
1101 <listitem>
1102 <para>
1103 <literal>ShortWord</literal>: words of this length or less will be
1104 dropped at the start and end of a headline. The default
1105 value of three eliminates common English articles.
1106 </para>
1107 </listitem>
1108 <listitem>
1109 <para>
1110 <literal>HighlightAll</literal>: Boolean flag; if
1111 <literal>true</literal> the whole document will be used as the
1112 headline, ignoring the preceding three parameters.
1113 </para>
1114 </listitem>
1115 <listitem>
1116 <para>
1117 <literal>MaxFragments</literal>: maximum number of text excerpts
1118 or fragments to display. The default value of zero selects a
1119 non-fragment-oriented headline generation method. A value greater than
1120 zero selects fragment-based headline generation. This method
1121 finds text fragments with as many query words as possible and
1122 stretches those fragments around the query words. As a result
1123 query words are close to the middle of each fragment and have words on
1124 each side. Each fragment will be of at most <literal>MaxWords</> and
1125 words of length <literal>ShortWord</> or less are dropped at the start
1126 and end of each fragment. If not all query words are found in the
1127 document, then a single fragment of the first <literal>MinWords</>
1128 in the document will be displayed.
1129 </para>
1130 </listitem>
1131 <listitem>
1132 <para>
1133 <literal>FragmentDelimiter</literal>: When more than one fragment is
1134 displayed, the fragments will be separated by this string.
1135 </para>
1136 </listitem>
1137 </itemizedlist>
1139 Any unspecified options receive these defaults:
1141 <programlisting>
1142 StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;,
1143 MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE,
1144 MaxFragments=0, FragmentDelimiter=" ... "
1145 </programlisting>
1146 </para>
1148 <para>
1149 For example:
1151 <programlisting>
1152 SELECT ts_headline('english',
1153 'The most common type of search
1154 is to find all documents containing given query terms
1155 and return them in order of their similarity to the
1156 query.',
1157 to_tsquery('query &amp; similarity'));
1158 ts_headline
1159 ------------------------------------------------------------
1160 containing given &lt;b&gt;query&lt;/b&gt; terms
1161 and return them in order of their &lt;b&gt;similarity&lt;/b&gt; to the
1162 &lt;b&gt;query&lt;/b&gt;.
1164 SELECT ts_headline('english',
1165 'The most common type of search
1166 is to find all documents containing given query terms
1167 and return them in order of their similarity to the
1168 query.',
1169 to_tsquery('query &amp; similarity'),
1170 'StartSel = &lt;, StopSel = &gt;');
1171 ts_headline
1172 -------------------------------------------------------
1173 containing given &lt;query&gt; terms
1174 and return them in order of their &lt;similarity&gt; to the
1175 &lt;query&gt;.
1176 </programlisting>
1177 </para>
1179 <para>
1180 <function>ts_headline</> uses the original document, not a
1181 <type>tsvector</type> summary, so it can be slow and should be used with
1182 care. A typical mistake is to call <function>ts_headline</function> for
1183 <emphasis>every</emphasis> matching document when only ten documents are
1184 to be shown. <acronym>SQL</acronym> subqueries can help; here is an
1185 example:
1187 <programlisting>
1188 SELECT id, ts_headline(body, q), rank
1189 FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank
1190 FROM apod, to_tsquery('stars') q
1191 WHERE ti @@ q
1192 ORDER BY rank DESC
1193 LIMIT 10) AS foo;
1194 </programlisting>
1195 </para>
1197 </sect2>
1199 </sect1>
1201 <sect1 id="textsearch-features">
1202 <title>Additional Features</title>
1204 <para>
1205 This section describes additional functions and operators that are
1206 useful in connection with text search.
1207 </para>
1209 <sect2 id="textsearch-manipulate-tsvector">
1210 <title>Manipulating Documents</title>
1212 <para>
1213 <xref linkend="textsearch-parsing-documents"> showed how raw textual
1214 documents can be converted into <type>tsvector</> values.
1215 <productname>PostgreSQL</productname> also provides functions and
1216 operators that can be used to manipulate documents that are already
1217 in <type>tsvector</> form.
1218 </para>
1220 <variablelist>
1222 <varlistentry>
1224 <indexterm>
1225 <primary>tsvector concatenation</primary>
1226 </indexterm>
1228 <term>
1229 <synopsis>
1230 <type>tsvector</> || <type>tsvector</>
1231 </synopsis>
1232 </term>
1234 <listitem>
1235 <para>
1236 The <type>tsvector</> concatenation operator
1237 returns a vector which combines the lexemes and positional information
1238 of the two vectors given as arguments. Positions and weight labels
1239 are retained during the concatenation.
1240 Positions appearing in the right-hand vector are offset by the largest
1241 position mentioned in the left-hand vector, so that the result is
1242 nearly equivalent to the result of performing <function>to_tsvector</>
1243 on the concatenation of the two original document strings. (The
1244 equivalence is not exact, because any stop-words removed from the
1245 end of the left-hand argument will not affect the result, whereas
1246 they would have affected the positions of the lexemes in the
1247 right-hand argument if textual concatenation were used.)
1248 </para>
1250 <para>
1251 One advantage of using concatenation in the vector form, rather than
1252 concatenating text before applying <function>to_tsvector</>, is that
1253 you can use different configurations to parse different sections
1254 of the document. Also, because the <function>setweight</> function
1255 marks all lexemes of the given vector the same way, it is necessary
1256 to parse the text and do <function>setweight</> before concatenating
1257 if you want to label different parts of the document with different
1258 weights.
1259 </para>
1260 </listitem>
1261 </varlistentry>
1263 <varlistentry>
1265 <indexterm>
1266 <primary>setweight</primary>
1267 </indexterm>
1269 <term>
1270 <synopsis>
1271 setweight(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>, <replaceable class="PARAMETER">weight</replaceable> <type>"char"</>) returns <type>tsvector</>
1272 </synopsis>
1273 </term>
1275 <listitem>
1276 <para>
1277 <function>setweight</> returns a copy of the input vector in which every
1278 position has been labeled with the given <replaceable>weight</>, either
1279 <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or
1280 <literal>D</literal>. (<literal>D</literal> is the default for new
1281 vectors and as such is not displayed on output.) These labels are
1282 retained when vectors are concatenated, allowing words from different
1283 parts of a document to be weighted differently by ranking functions.
1284 </para>
1286 <para>
1287 Note that weight labels apply to <emphasis>positions</>, not
1288 <emphasis>lexemes</>. If the input vector has been stripped of
1289 positions then <function>setweight</> does nothing.
1290 </para>
1291 </listitem>
1292 </varlistentry>
1294 <varlistentry>
1295 <indexterm>
1296 <primary>length(tsvector)</primary>
1297 </indexterm>
1299 <term>
1300 <synopsis>
1301 length(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>) returns <type>integer</>
1302 </synopsis>
1303 </term>
1305 <listitem>
1306 <para>
1307 Returns the number of lexemes stored in the vector.
1308 </para>
1309 </listitem>
1310 </varlistentry>
1312 <varlistentry>
1314 <indexterm>
1315 <primary>strip</primary>
1316 </indexterm>
1318 <term>
1319 <synopsis>
1320 strip(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>) returns <type>tsvector</>
1321 </synopsis>
1322 </term>
1324 <listitem>
1325 <para>
1326 Returns a vector which lists the same lexemes as the given vector, but
1327 which lacks any position or weight information. While the returned
1328 vector is much less useful than an unstripped vector for relevance
1329 ranking, it will usually be much smaller.
1330 </para>
1331 </listitem>
1333 </varlistentry>
1335 </variablelist>
1337 </sect2>
1339 <sect2 id="textsearch-manipulate-tsquery">
1340 <title>Manipulating Queries</title>
1342 <para>
1343 <xref linkend="textsearch-parsing-queries"> showed how raw textual
1344 queries can be converted into <type>tsquery</> values.
1345 <productname>PostgreSQL</productname> also provides functions and
1346 operators that can be used to manipulate queries that are already
1347 in <type>tsquery</> form.
1348 </para>
1350 <variablelist>
1352 <varlistentry>
1354 <term>
1355 <synopsis>
1356 <type>tsquery</> &amp;&amp; <type>tsquery</>
1357 </synopsis>
1358 </term>
1360 <listitem>
1361 <para>
1362 Returns the AND-combination of the two given queries.
1363 </para>
1364 </listitem>
1366 </varlistentry>
1368 <varlistentry>
1370 <term>
1371 <synopsis>
1372 <type>tsquery</> || <type>tsquery</>
1373 </synopsis>
1374 </term>
1376 <listitem>
1377 <para>
1378 Returns the OR-combination of the two given queries.
1379 </para>
1380 </listitem>
1382 </varlistentry>
1384 <varlistentry>
1386 <term>
1387 <synopsis>
1388 !! <type>tsquery</>
1389 </synopsis>
1390 </term>
1392 <listitem>
1393 <para>
1394 Returns the negation (NOT) of the given query.
1395 </para>
1396 </listitem>
1398 </varlistentry>
1400 <varlistentry>
1402 <indexterm>
1403 <primary>numnode</primary>
1404 </indexterm>
1406 <term>
1407 <synopsis>
1408 numnode(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>) returns <type>integer</>
1409 </synopsis>
1410 </term>
1412 <listitem>
1413 <para>
1414 Returns the number of nodes (lexemes plus operators) in a
1415 <type>tsquery</>. This function is useful
1416 to determine if the <replaceable>query</replaceable> is meaningful
1417 (returns &gt; 0), or contains only stop words (returns 0).
1418 Examples:
1420 <programlisting>
1421 SELECT numnode(plainto_tsquery('the any'));
1422 NOTICE: query contains only stopword(s) or doesn't contain lexeme(s), ignored
1423 numnode
1424 ---------
1427 SELECT numnode('foo &amp; bar'::tsquery);
1428 numnode
1429 ---------
1431 </programlisting>
1432 </para>
1433 </listitem>
1434 </varlistentry>
1436 <varlistentry>
1438 <indexterm>
1439 <primary>querytree</primary>
1440 </indexterm>
1442 <term>
1443 <synopsis>
1444 querytree(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>) returns <type>text</>
1445 </synopsis>
1446 </term>
1448 <listitem>
1449 <para>
1450 Returns the portion of a <type>tsquery</> that can be used for
1451 searching an index. This function is useful for detecting
1452 unindexable queries, for example those containing only stop words
1453 or only negated terms. For example:
1455 <programlisting>
1456 SELECT querytree(to_tsquery('!defined'));
1457 querytree
1458 -----------
1460 </programlisting>
1461 </para>
1462 </listitem>
1463 </varlistentry>
1465 </variablelist>
1467 <sect3 id="textsearch-query-rewriting">
1468 <title>Query Rewriting</title>
1470 <indexterm zone="textsearch-query-rewriting">
1471 <primary>ts_rewrite</primary>
1472 </indexterm>
1474 <para>
1475 The <function>ts_rewrite</function> family of functions search a
1476 given <type>tsquery</> for occurrences of a target
1477 subquery, and replace each occurrence with a
1478 substitute subquery. In essence this operation is a
1479 <type>tsquery</>-specific version of substring replacement.
1480 A target and substitute combination can be
1481 thought of as a <firstterm>query rewrite rule</>. A collection
1482 of such rewrite rules can be a powerful search aid.
1483 For example, you can expand the search using synonyms
1484 (e.g., <literal>new york</>, <literal>big apple</>, <literal>nyc</>,
1485 <literal>gotham</>) or narrow the search to direct the user to some hot
1486 topic. There is some overlap in functionality between this feature
1487 and thesaurus dictionaries (<xref linkend="textsearch-thesaurus">).
1488 However, you can modify a set of rewrite rules on-the-fly without
1489 reindexing, whereas updating a thesaurus requires reindexing to be
1490 effective.
1491 </para>
1493 <variablelist>
1495 <varlistentry>
1497 <term>
1498 <synopsis>
1499 ts_rewrite (<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">target</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">substitute</replaceable> <type>tsquery</>) returns <type>tsquery</>
1500 </synopsis>
1501 </term>
1503 <listitem>
1504 <para>
1505 This form of <function>ts_rewrite</> simply applies a single
1506 rewrite rule: <replaceable class="PARAMETER">target</replaceable>
1507 is replaced by <replaceable class="PARAMETER">substitute</replaceable>
1508 wherever it appears in <replaceable
1509 class="PARAMETER">query</replaceable>. For example:
1511 <programlisting>
1512 SELECT ts_rewrite('a &amp; b'::tsquery, 'a'::tsquery, 'c'::tsquery);
1513 ts_rewrite
1514 ------------
1515 'b' &amp; 'c'
1516 </programlisting>
1517 </para>
1518 </listitem>
1519 </varlistentry>
1521 <varlistentry>
1523 <term>
1524 <synopsis>
1525 ts_rewrite (<replaceable class="PARAMETER">query</> <type>tsquery</>, <replaceable class="PARAMETER">select</> <type>text</>) returns <type>tsquery</>
1526 </synopsis>
1527 </term>
1529 <listitem>
1530 <para>
1531 This form of <function>ts_rewrite</> accepts a starting
1532 <replaceable>query</> and a SQL <replaceable>select</> command, which
1533 is given as a text string. The <replaceable>select</> must yield two
1534 columns of <type>tsquery</> type. For each row of the
1535 <replaceable>select</> result, occurrences of the first column value
1536 (the target) are replaced by the second column value (the substitute)
1537 within the current <replaceable>query</> value. For example:
1539 <programlisting>
1540 CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
1541 INSERT INTO aliases VALUES('a', 'c');
1543 SELECT ts_rewrite('a &amp; b'::tsquery, 'SELECT t,s FROM aliases');
1544 ts_rewrite
1545 ------------
1546 'b' &amp; 'c'
1547 </programlisting>
1548 </para>
1550 <para>
1551 Note that when multiple rewrite rules are applied in this way,
1552 the order of application can be important; so in practice you will
1553 want the source query to <literal>ORDER BY</> some ordering key.
1554 </para>
1555 </listitem>
1556 </varlistentry>
1558 </variablelist>
1560 <para>
1561 Let's consider a real-life astronomical example. We'll expand query
1562 <literal>supernovae</literal> using table-driven rewriting rules:
1564 <programlisting>
1565 CREATE TABLE aliases (t tsquery primary key, s tsquery);
1566 INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));
1568 SELECT ts_rewrite(to_tsquery('supernovae &amp; crab'), 'SELECT * FROM aliases');
1569 ts_rewrite
1570 ---------------------------------
1571 'crab' &amp; ( 'supernova' | 'sn' )
1572 </programlisting>
1574 We can change the rewriting rules just by updating the table:
1576 <programlisting>
1577 UPDATE aliases
1578 SET s = to_tsquery('supernovae|sn &amp; !nebulae')
1579 WHERE t = to_tsquery('supernovae');
1581 SELECT ts_rewrite(to_tsquery('supernovae &amp; crab'), 'SELECT * FROM aliases');
1582 ts_rewrite
1583 ---------------------------------------------
1584 'crab' &amp; ( 'supernova' | 'sn' &amp; !'nebula' )
1585 </programlisting>
1586 </para>
1588 <para>
1589 Rewriting can be slow when there are many rewriting rules, since it
1590 checks every rule for a possible match. To filter out obvious non-candidate
1591 rules we can use the containment operators for the <type>tsquery</type>
1592 type. In the example below, we select only those rules which might match
1593 the original query:
1595 <programlisting>
1596 SELECT ts_rewrite('a &amp; b'::tsquery,
1597 'SELECT t,s FROM aliases WHERE ''a &amp; b''::tsquery @&gt; t');
1598 ts_rewrite
1599 ------------
1600 'b' &amp; 'c'
1601 </programlisting>
1602 </para>
1604 </sect3>
1606 </sect2>
1608 <sect2 id="textsearch-update-triggers">
1609 <title>Triggers for Automatic Updates</title>
1611 <indexterm>
1612 <primary>trigger</primary>
1613 <secondary>for updating a derived tsvector column</secondary>
1614 </indexterm>
1616 <para>
1617 When using a separate column to store the <type>tsvector</> representation
1618 of your documents, it is necessary to create a trigger to update the
1619 <type>tsvector</> column when the document content columns change.
1620 Two built-in trigger functions are available for this, or you can write
1621 your own.
1622 </para>
1624 <synopsis>
1625 tsvector_update_trigger(<replaceable class="PARAMETER">tsvector_column_name</replaceable>, <replaceable class="PARAMETER">config_name</replaceable>, <replaceable class="PARAMETER">text_column_name</replaceable> <optional>, ... </optional>)
1626 tsvector_update_trigger_column(<replaceable class="PARAMETER">tsvector_column_name</replaceable>, <replaceable class="PARAMETER">config_column_name</replaceable>, <replaceable class="PARAMETER">text_column_name</replaceable> <optional>, ... </optional>)
1627 </synopsis>
1629 <para>
1630 These trigger functions automatically compute a <type>tsvector</>
1631 column from one or more textual columns, under the control of
1632 parameters specified in the <command>CREATE TRIGGER</> command.
1633 An example of their use is:
1635 <programlisting>
1636 CREATE TABLE messages (
1637 title text,
1638 body text,
1639 tsv tsvector
1642 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
1643 ON messages FOR EACH ROW EXECUTE PROCEDURE
1644 tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);
1646 INSERT INTO messages VALUES('title here', 'the body text is here');
1648 SELECT * FROM messages;
1649 title | body | tsv
1650 ------------+-----------------------+----------------------------
1651 title here | the body text is here | 'bodi':4 'text':5 'titl':1
1653 SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title &amp; body');
1654 title | body
1655 ------------+-----------------------
1656 title here | the body text is here
1657 </programlisting>
1659 Having created this trigger, any change in <structfield>title</> or
1660 <structfield>body</> will automatically be reflected into
1661 <structfield>tsv</>, without the application having to worry about it.
1662 </para>
1664 <para>
1665 The first trigger argument must be the name of the <type>tsvector</>
1666 column to be updated. The second argument specifies the text search
1667 configuration to be used to perform the conversion. For
1668 <function>tsvector_update_trigger</>, the configuration name is simply
1669 given as the second trigger argument. It must be schema-qualified as
1670 shown above, so that the trigger behavior will not change with changes
1671 in <varname>search_path</>. For
1672 <function>tsvector_update_trigger_column</>, the second trigger argument
1673 is the name of another table column, which must be of type
1674 <type>regconfig</>. This allows a per-row selection of configuration
1675 to be made. The remaining argument(s) are the names of textual columns
1676 (of type <type>text</>, <type>varchar</>, or <type>char</>). These
1677 will be included in the document in the order given. NULL values will
1678 be skipped (but the other columns will still be indexed).
1679 </para>
1681 <para>
1682 A limitation of these built-in triggers is that they treat all the
1683 input columns alike. To process columns differently &mdash; for
1684 example, to weight title differently from body &mdash; it is necessary
1685 to write a custom trigger. Here is an example using
1686 <application>PL/pgSQL</application> as the trigger language:
1688 <programlisting>
1689 CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
1690 begin
1691 new.tsv :=
1692 setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') ||
1693 setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
1694 return new;
1696 $$ LANGUAGE plpgsql;
1698 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
1699 ON messages FOR EACH ROW EXECUTE PROCEDURE messages_trigger();
1700 </programlisting>
1701 </para>
1703 <para>
1704 Keep in mind that it is important to specify the configuration name
1705 explicitly when creating <type>tsvector</> values inside triggers,
1706 so that the column's contents will not be affected by changes to
1707 <varname>default_text_search_config</>. Failure to do this is likely to
1708 lead to problems such as search results changing after a dump and reload.
1709 </para>
1711 </sect2>
1713 <sect2 id="textsearch-statistics">
1714 <title>Gathering Document Statistics</title>
1716 <indexterm>
1717 <primary>ts_stat</primary>
1718 </indexterm>
1720 <para>
1721 The function <function>ts_stat</> is useful for checking your
1722 configuration and for finding stop-word candidates.
1723 </para>
1725 <synopsis>
1726 ts_stat(<replaceable class="PARAMETER">sqlquery</replaceable> <type>text</>, <optional> <replaceable class="PARAMETER">weights</replaceable> <type>text</>, </optional>
1727 OUT <replaceable class="PARAMETER">word</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">ndoc</replaceable> <type>integer</>,
1728 OUT <replaceable class="PARAMETER">nentry</replaceable> <type>integer</>) returns <type>setof record</>
1729 </synopsis>
1731 <para>
1732 <replaceable>sqlquery</replaceable> is a text value containing an SQL
1733 query which must return a single <type>tsvector</type> column.
1734 <function>ts_stat</> executes the query and returns statistics about
1735 each distinct lexeme (word) contained in the <type>tsvector</type>
1736 data. The columns returned are
1738 <itemizedlist spacing="compact" mark="bullet">
1739 <listitem>
1740 <para>
1741 <replaceable>word</> <type>text</> &mdash; the value of a lexeme
1742 </para>
1743 </listitem>
1744 <listitem>
1745 <para>
1746 <replaceable>ndoc</> <type>integer</> &mdash; number of documents
1747 (<type>tsvector</>s) the word occurred in
1748 </para>
1749 </listitem>
1750 <listitem>
1751 <para>
1752 <replaceable>nentry</> <type>integer</> &mdash; total number of
1753 occurrences of the word
1754 </para>
1755 </listitem>
1756 </itemizedlist>
1758 If <replaceable>weights</replaceable> is supplied, only occurrences
1759 having one of those weights are counted.
1760 </para>
1762 <para>
1763 For example, to find the ten most frequent words in a document collection:
1765 <programlisting>
1766 SELECT * FROM ts_stat('SELECT vector FROM apod')
1767 ORDER BY nentry DESC, ndoc DESC, word
1768 LIMIT 10;
1769 </programlisting>
1771 The same, but counting only word occurrences with weight <literal>A</>
1772 or <literal>B</>:
1774 <programlisting>
1775 SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
1776 ORDER BY nentry DESC, ndoc DESC, word
1777 LIMIT 10;
1778 </programlisting>
1779 </para>
1781 </sect2>
1783 </sect1>
1785 <sect1 id="textsearch-parsers">
1786 <title>Parsers</title>
1788 <para>
1789 Text search parsers are responsible for splitting raw document text
1790 into <firstterm>tokens</> and identifying each token's type, where
1791 the set of possible types is defined by the parser itself.
1792 Note that a parser does not modify the text at all &mdash; it simply
1793 identifies plausible word boundaries. Because of this limited scope,
1794 there is less need for application-specific custom parsers than there is
1795 for custom dictionaries. At present <productname>PostgreSQL</productname>
1796 provides just one built-in parser, which has been found to be useful for a
1797 wide range of applications.
1798 </para>
1800 <para>
1801 The built-in parser is named <literal>pg_catalog.default</>.
1802 It recognizes 23 token types:
1803 </para>
1805 <table id="textsearch-default-parser">
1806 <title>Default Parser's Token Types</title>
1807 <tgroup cols="3">
1808 <thead>
1809 <row>
1810 <entry>Alias</entry>
1811 <entry>Description</entry>
1812 <entry>Example</entry>
1813 </row>
1814 </thead>
1815 <tbody>
1816 <row>
1817 <entry><literal>asciiword</></entry>
1818 <entry>Word, all ASCII letters</entry>
1819 <entry><literal>elephant</literal></entry>
1820 </row>
1821 <row>
1822 <entry><literal>word</></entry>
1823 <entry>Word, all letters</entry>
1824 <entry><literal>ma&ntilde;ana</literal></entry>
1825 </row>
1826 <row>
1827 <entry><literal>numword</></entry>
1828 <entry>Word, letters and digits</entry>
1829 <entry><literal>beta1</literal></entry>
1830 </row>
1831 <row>
1832 <entry><literal>asciihword</></entry>
1833 <entry>Hyphenated word, all ASCII</entry>
1834 <entry><literal>up-to-date</literal></entry>
1835 </row>
1836 <row>
1837 <entry><literal>hword</></entry>
1838 <entry>Hyphenated word, all letters</entry>
1839 <entry><literal>l&oacute;gico-matem&aacute;tica</literal></entry>
1840 </row>
1841 <row>
1842 <entry><literal>numhword</></entry>
1843 <entry>Hyphenated word, letters and digits</entry>
1844 <entry><literal>postgresql-beta1</literal></entry>
1845 </row>
1846 <row>
1847 <entry><literal>hword_asciipart</></entry>
1848 <entry>Hyphenated word part, all ASCII</entry>
1849 <entry><literal>postgresql</literal> in the context <literal>postgresql-beta1</literal></entry>
1850 </row>
1851 <row>
1852 <entry><literal>hword_part</></entry>
1853 <entry>Hyphenated word part, all letters</entry>
1854 <entry><literal>l&oacute;gico</literal> or <literal>matem&aacute;tica</literal>
1855 in the context <literal>l&oacute;gico-matem&aacute;tica</literal></entry>
1856 </row>
1857 <row>
1858 <entry><literal>hword_numpart</></entry>
1859 <entry>Hyphenated word part, letters and digits</entry>
1860 <entry><literal>beta1</literal> in the context
1861 <literal>postgresql-beta1</literal></entry>
1862 </row>
1863 <row>
1864 <entry><literal>email</></entry>
1865 <entry>Email address</entry>
1866 <entry><literal>foo@example.com</literal></entry>
1867 </row>
1868 <row>
1869 <entry><literal>protocol</></entry>
1870 <entry>Protocol head</entry>
1871 <entry><literal>http://</literal></entry>
1872 </row>
1873 <row>
1874 <entry><literal>url</></entry>
1875 <entry>URL</entry>
1876 <entry><literal>example.com/stuff/index.html</literal></entry>
1877 </row>
1878 <row>
1879 <entry><literal>host</></entry>
1880 <entry>Host</entry>
1881 <entry><literal>example.com</literal></entry>
1882 </row>
1883 <row>
1884 <entry><literal>url_path</></entry>
1885 <entry>URL path</entry>
1886 <entry><literal>/stuff/index.html</literal>, in the context of a URL</entry>
1887 </row>
1888 <row>
1889 <entry><literal>file</></entry>
1890 <entry>File or path name</entry>
1891 <entry><literal>/usr/local/foo.txt</literal>, if not within a URL</entry>
1892 </row>
1893 <row>
1894 <entry><literal>sfloat</></entry>
1895 <entry>Scientific notation</entry>
1896 <entry><literal>-1.234e56</literal></entry>
1897 </row>
1898 <row>
1899 <entry><literal>float</></entry>
1900 <entry>Decimal notation</entry>
1901 <entry><literal>-1.234</literal></entry>
1902 </row>
1903 <row>
1904 <entry><literal>int</></entry>
1905 <entry>Signed integer</entry>
1906 <entry><literal>-1234</literal></entry>
1907 </row>
1908 <row>
1909 <entry><literal>uint</></entry>
1910 <entry>Unsigned integer</entry>
1911 <entry><literal>1234</literal></entry>
1912 </row>
1913 <row>
1914 <entry><literal>version</></entry>
1915 <entry>Version number</entry>
1916 <entry><literal>8.3.0</literal></entry>
1917 </row>
1918 <row>
1919 <entry><literal>tag</></entry>
1920 <entry>XML tag</entry>
1921 <entry><literal>&lt;a href="dictionaries.html"&gt;</literal></entry>
1922 </row>
1923 <row>
1924 <entry><literal>entity</></entry>
1925 <entry>XML entity</entry>
1926 <entry><literal>&amp;amp;</literal></entry>
1927 </row>
1928 <row>
1929 <entry><literal>blank</></entry>
1930 <entry>Space symbols</entry>
1931 <entry>(any whitespace or punctuation not otherwise recognized)</entry>
1932 </row>
1933 </tbody>
1934 </tgroup>
1935 </table>
1937 <note>
1938 <para>
1939 The parser's notion of a <quote>letter</> is determined by the database's
1940 locale setting, specifically <varname>lc_ctype</>. Words containing
1941 only the basic ASCII letters are reported as a separate token type,
1942 since it is sometimes useful to distinguish them. In most European
1943 languages, token types <literal>word</> and <literal>asciiword</>
1944 should be treated alike.
1945 </para>
1946 </note>
1948 <para>
1949 It is possible for the parser to produce overlapping tokens from the same
1950 piece of text. As an example, a hyphenated word will be reported both
1951 as the entire word and as each component:
1953 <programlisting>
1954 SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
1955 alias | description | token
1956 -----------------+------------------------------------------+---------------
1957 numhword | Hyphenated word, letters and digits | foo-bar-beta1
1958 hword_asciipart | Hyphenated word part, all ASCII | foo
1959 blank | Space symbols | -
1960 hword_asciipart | Hyphenated word part, all ASCII | bar
1961 blank | Space symbols | -
1962 hword_numpart | Hyphenated word part, letters and digits | beta1
1963 </programlisting>
1965 This behavior is desirable since it allows searches to work for both
1966 the whole compound word and for components. Here is another
1967 instructive example:
1969 <programlisting>
1970 SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
1971 alias | description | token
1972 ----------+---------------+------------------------------
1973 protocol | Protocol head | http://
1974 url | URL | example.com/stuff/index.html
1975 host | Host | example.com
1976 url_path | URL path | /stuff/index.html
1977 </programlisting>
1978 </para>
1980 </sect1>
1982 <sect1 id="textsearch-dictionaries">
1983 <title>Dictionaries</title>
1985 <para>
1986 Dictionaries are used to eliminate words that should not be considered in a
1987 search (<firstterm>stop words</>), and to <firstterm>normalize</> words so
1988 that different derived forms of the same word will match. A successfully
1989 normalized word is called a <firstterm>lexeme</>. Aside from
1990 improving search quality, normalization and removal of stop words reduce the
1991 size of the <type>tsvector</type> representation of a document, thereby
1992 improving performance. Normalization does not always have linguistic meaning
1993 and usually depends on application semantics.
1994 </para>
1996 <para>
1997 Some examples of normalization:
1999 <itemizedlist spacing="compact" mark="bullet">
2001 <listitem>
2002 <para>
2003 Linguistic - Ispell dictionaries try to reduce input words to a
2004 normalized form; stemmer dictionaries remove word endings
2005 </para>
2006 </listitem>
2007 <listitem>
2008 <para>
2009 <acronym>URL</acronym> locations can be canonicalized to make
2010 equivalent URLs match:
2012 <itemizedlist spacing="compact" mark="bullet">
2013 <listitem>
2014 <para>
2015 http://www.pgsql.ru/db/mw/index.html
2016 </para>
2017 </listitem>
2018 <listitem>
2019 <para>
2020 http://www.pgsql.ru/db/mw/
2021 </para>
2022 </listitem>
2023 <listitem>
2024 <para>
2025 http://www.pgsql.ru/db/../db/mw/index.html
2026 </para>
2027 </listitem>
2028 </itemizedlist>
2029 </para>
2030 </listitem>
2031 <listitem>
2032 <para>
2033 Color names can be replaced by their hexadecimal values, e.g.,
2034 <literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal>
2035 </para>
2036 </listitem>
2037 <listitem>
2038 <para>
2039 If indexing numbers, we can
2040 remove some fractional digits to reduce the range of possible
2041 numbers, so for example <emphasis>3.14</emphasis>159265359,
2042 <emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same
2043 after normalization if only two digits are kept after the decimal point.
2044 </para>
2045 </listitem>
2046 </itemizedlist>
2048 </para>
2050 <para>
2051 A dictionary is a program that accepts a token as
2052 input and returns:
2053 <itemizedlist spacing="compact" mark="bullet">
2054 <listitem>
2055 <para>
2056 an array of lexemes if the input token is known to the dictionary
2057 (notice that one token can produce more than one lexeme)
2058 </para>
2059 </listitem>
2060 <listitem>
2061 <para>
2062 an empty array if the dictionary knows the token, but it is a stop word
2063 </para>
2064 </listitem>
2065 <listitem>
2066 <para>
2067 <literal>NULL</literal> if the dictionary does not recognize the input token
2068 </para>
2069 </listitem>
2070 </itemizedlist>
2071 </para>
2073 <para>
2074 <productname>PostgreSQL</productname> provides predefined dictionaries for
2075 many languages. There are also several predefined templates that can be
2076 used to create new dictionaries with custom parameters. Each predefined
2077 dictionary template is described below. If no existing
2078 template is suitable, it is possible to create new ones; see the
2079 <filename>contrib/</> area of the <productname>PostgreSQL</> distribution
2080 for examples.
2081 </para>
2083 <para>
2084 A text search configuration binds a parser together with a set of
2085 dictionaries to process the parser's output tokens. For each token
2086 type that the parser can return, a separate list of dictionaries is
2087 specified by the configuration. When a token of that type is found
2088 by the parser, each dictionary in the list is consulted in turn,
2089 until some dictionary recognizes it as a known word. If it is identified
2090 as a stop word, or if no dictionary recognizes the token, it will be
2091 discarded and not indexed or searched for.
2092 The general rule for configuring a list of dictionaries
2093 is to place first the most narrow, most specific dictionary, then the more
2094 general dictionaries, finishing with a very general dictionary, like
2095 a <application>Snowball</> stemmer or <literal>simple</>, which
2096 recognizes everything. For example, for an astronomy-specific search
2097 (<literal>astro_en</literal> configuration) one could bind token type
2098 <type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical
2099 terms, a general English dictionary and a <application>Snowball</> English
2100 stemmer:
2102 <programlisting>
2103 ALTER TEXT SEARCH CONFIGURATION astro_en
2104 ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
2105 </programlisting>
2106 </para>
2108 <sect2 id="textsearch-stopwords">
2109 <title>Stop Words</title>
2111 <para>
2112 Stop words are words that are very common, appear in almost every
2113 document, and have no discrimination value. Therefore, they can be ignored
2114 in the context of full text searching. For example, every English text
2115 contains words like <literal>a</literal> and <literal>the</>, so it is
2116 useless to store them in an index. However, stop words do affect the
2117 positions in <type>tsvector</type>, which in turn affect ranking:
2119 <programlisting>
2120 SELECT to_tsvector('english','in the list of stop words');
2121 to_tsvector
2122 ----------------------------
2123 'list':3 'stop':5 'word':6
2124 </programlisting>
2126 The missing positions 1,2,4 are because of stop words. Ranks
2127 calculated for documents with and without stop words are quite different:
2129 <programlisting>
2130 SELECT ts_rank_cd (to_tsvector('english','in the list of stop words'), to_tsquery('list &amp; stop'));
2131 ts_rank_cd
2132 ------------
2133 0.05
2135 SELECT ts_rank_cd (to_tsvector('english','list stop words'), to_tsquery('list &amp; stop'));
2136 ts_rank_cd
2137 ------------
2139 </programlisting>
2141 </para>
2143 <para>
2144 It is up to the specific dictionary how it treats stop words. For example,
2145 <literal>ispell</literal> dictionaries first normalize words and then
2146 look at the list of stop words, while <literal>Snowball</literal> stemmers
2147 first check the list of stop words. The reason for the different
2148 behavior is an attempt to decrease noise.
2149 </para>
2151 </sect2>
2153 <sect2 id="textsearch-simple-dictionary">
2154 <title>Simple Dictionary</title>
2156 <para>
2157 The <literal>simple</> dictionary template operates by converting the
2158 input token to lower case and checking it against a file of stop words.
2159 If it is found in the file then an empty array is returned, causing
2160 the token to be discarded. If not, the lower-cased form of the word
2161 is returned as the normalized lexeme. Alternatively, the dictionary
2162 can be configured to report non-stop-words as unrecognized, allowing
2163 them to be passed on to the next dictionary in the list.
2164 </para>
2166 <para>
2167 Here is an example of a dictionary definition using the <literal>simple</>
2168 template:
2170 <programlisting>
2171 CREATE TEXT SEARCH DICTIONARY public.simple_dict (
2172 TEMPLATE = pg_catalog.simple,
2173 STOPWORDS = english
2175 </programlisting>
2177 Here, <literal>english</literal> is the base name of a file of stop words.
2178 The file's full name will be
2179 <filename>$SHAREDIR/tsearch_data/english.stop</>,
2180 where <literal>$SHAREDIR</> means the
2181 <productname>PostgreSQL</productname> installation's shared-data directory,
2182 often <filename>/usr/local/share/postgresql</> (use <command>pg_config
2183 --sharedir</> to determine it if you're not sure).
2184 The file format is simply a list
2185 of words, one per line. Blank lines and trailing spaces are ignored,
2186 and upper case is folded to lower case, but no other processing is done
2187 on the file contents.
2188 </para>
2190 <para>
2191 Now we can test our dictionary:
2193 <programlisting>
2194 SELECT ts_lexize('public.simple_dict','YeS');
2195 ts_lexize
2196 -----------
2197 {yes}
2199 SELECT ts_lexize('public.simple_dict','The');
2200 ts_lexize
2201 -----------
2203 </programlisting>
2204 </para>
2206 <para>
2207 We can also choose to return <literal>NULL</>, instead of the lower-cased
2208 word, if it is not found in the stop words file. This behavior is
2209 selected by setting the dictionary's <literal>Accept</> parameter to
2210 <literal>false</>. Continuing the example:
2212 <programlisting>
2213 ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
2215 SELECT ts_lexize('public.simple_dict','YeS');
2216 ts_lexize
2217 -----------
2220 SELECT ts_lexize('public.simple_dict','The');
2221 ts_lexize
2222 -----------
2224 </programlisting>
2225 </para>
2227 <para>
2228 With the default setting of <literal>Accept</> = <literal>true</>,
2229 it is only useful to place a <literal>simple</> dictionary at the end
2230 of a list of dictionaries, since it will never pass on any token to
2231 a following dictionary. Conversely, <literal>Accept</> = <literal>false</>
2232 is only useful when there is at least one following dictionary.
2233 </para>
2235 <caution>
2236 <para>
2237 Most types of dictionaries rely on configuration files, such as files of
2238 stop words. These files <emphasis>must</> be stored in UTF-8 encoding.
2239 They will be translated to the actual database encoding, if that is
2240 different, when they are read into the server.
2241 </para>
2242 </caution>
2244 <caution>
2245 <para>
2246 Normally, a database session will read a dictionary configuration file
2247 only once, when it is first used within the session. If you modify a
2248 configuration file and want to force existing sessions to pick up the
2249 new contents, issue an <command>ALTER TEXT SEARCH DICTIONARY</> command
2250 on the dictionary. This can be a <quote>dummy</> update that doesn't
2251 actually change any parameter values.
2252 </para>
2253 </caution>
2255 </sect2>
2257 <sect2 id="textsearch-synonym-dictionary">
2258 <title>Synonym Dictionary</title>
2260 <para>
2261 This dictionary template is used to create dictionaries that replace a
2262 word with a synonym. Phrases are not supported (use the thesaurus
2263 template (<xref linkend="textsearch-thesaurus">) for that). A synonym
2264 dictionary can be used to overcome linguistic problems, for example, to
2265 prevent an English stemmer dictionary from reducing the word 'Paris' to
2266 'pari'. It is enough to have a <literal>Paris paris</literal> line in the
2267 synonym dictionary and put it before the <literal>english_stem</>
2268 dictionary. For example:
2270 <programlisting>
2271 SELECT * FROM ts_debug('english', 'Paris');
2272 alias | description | token | dictionaries | dictionary | lexemes
2273 -----------+-----------------+-------+----------------+--------------+---------
2274 asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
2276 CREATE TEXT SEARCH DICTIONARY my_synonym (
2277 TEMPLATE = synonym,
2278 SYNONYMS = my_synonyms
2281 ALTER TEXT SEARCH CONFIGURATION english
2282 ALTER MAPPING FOR asciiword
2283 WITH my_synonym, english_stem;
2285 SELECT * FROM ts_debug('english', 'Paris');
2286 alias | description | token | dictionaries | dictionary | lexemes
2287 -----------+-----------------+-------+---------------------------+------------+---------
2288 asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
2289 </programlisting>
2290 </para>
2292 <para>
2293 The only parameter required by the <literal>synonym</> template is
2294 <literal>SYNONYMS</>, which is the base name of its configuration file
2295 &mdash; <literal>my_synonyms</> in the above example.
2296 The file's full name will be
2297 <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</>
2298 (where <literal>$SHAREDIR</> means the
2299 <productname>PostgreSQL</> installation's shared-data directory).
2300 The file format is just one line
2301 per word to be substituted, with the word followed by its synonym,
2302 separated by white space. Blank lines and trailing spaces are ignored.
2303 </para>
2305 <para>
2306 The <literal>synonym</> template also has an optional parameter
2307 <literal>CaseSensitive</>, which defaults to <literal>false</>. When
2308 <literal>CaseSensitive</> is <literal>false</>, words in the synonym file
2309 are folded to lower case, as are input tokens. When it is
2310 <literal>true</>, words and tokens are not folded to lower case,
2311 but are compared as-is.
2312 </para>
2313 </sect2>
2315 <sect2 id="textsearch-thesaurus">
2316 <title>Thesaurus Dictionary</title>
2318 <para>
2319 A thesaurus dictionary (sometimes abbreviated as <acronym>TZ</acronym>) is
2320 a collection of words that includes information about the relationships
2321 of words and phrases, i.e., broader terms (<acronym>BT</acronym>), narrower
2322 terms (<acronym>NT</acronym>), preferred terms, non-preferred terms, related
2323 terms, etc.
2324 </para>
2326 <para>
2327 Basically a thesaurus dictionary replaces all non-preferred terms by one
2328 preferred term and, optionally, preserves the original terms for indexing
2329 as well. <productname>PostgreSQL</>'s current implementation of the
2330 thesaurus dictionary is an extension of the synonym dictionary with added
2331 <firstterm>phrase</firstterm> support. A thesaurus dictionary requires
2332 a configuration file of the following format:
2334 <programlisting>
2335 # this is a comment
2336 sample word(s) : indexed word(s)
2337 more sample word(s) : more indexed word(s)
2339 </programlisting>
2341 where the colon (<symbol>:</symbol>) symbol acts as a delimiter between a
2342 a phrase and its replacement.
2343 </para>
2345 <para>
2346 A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which
2347 is specified in the dictionary's configuration) to normalize the input
2348 text before checking for phrase matches. It is only possible to select one
2349 subdictionary. An error is reported if the subdictionary fails to
2350 recognize a word. In that case, you should remove the use of the word or
2351 teach the subdictionary about it. You can place an asterisk
2352 (<symbol>*</symbol>) at the beginning of an indexed word to skip applying
2353 the subdictionary to it, but all sample words <emphasis>must</> be known
2354 to the subdictionary.
2355 </para>
2357 <para>
2358 The thesaurus dictionary chooses the longest match if there are multiple
2359 phrases matching the input, and ties are broken by using the last
2360 definition.
2361 </para>
2363 <para>
2364 Specific stop words recognized by the subdictionary cannot be
2365 specified; instead use <literal>?</> to mark the location where any
2366 stop word can appear. For example, assuming that <literal>a</> and
2367 <literal>the</> are stop words according to the subdictionary:
2369 <programlisting>
2370 ? one ? two : swsw
2371 </programlisting>
2373 matches <literal>a one the two</> and <literal>the one a two</>;
2374 both would be replaced by <literal>swsw</>.
2375 </para>
2377 <para>
2378 Since a thesaurus dictionary has the capability to recognize phrases it
2379 must remember its state and interact with the parser. A thesaurus dictionary
2380 uses these assignments to check if it should handle the next word or stop
2381 accumulation. The thesaurus dictionary must be configured
2382 carefully. For example, if the thesaurus dictionary is assigned to handle
2383 only the <literal>asciiword</literal> token, then a thesaurus dictionary
2384 definition like <literal>one 7</> will not work since token type
2385 <literal>uint</literal> is not assigned to the thesaurus dictionary.
2386 </para>
2388 <caution>
2389 <para>
2390 Thesauruses are used during indexing so any change in the thesaurus
2391 dictionary's parameters <emphasis>requires</emphasis> reindexing.
2392 For most other dictionary types, small changes such as adding or
2393 removing stopwords does not force reindexing.
2394 </para>
2395 </caution>
2397 <sect3 id="textsearch-thesaurus-config">
2398 <title>Thesaurus Configuration</title>
2400 <para>
2401 To define a new thesaurus dictionary, use the <literal>thesaurus</>
2402 template. For example:
2404 <programlisting>
2405 CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
2406 TEMPLATE = thesaurus,
2407 DictFile = mythesaurus,
2408 Dictionary = pg_catalog.english_stem
2410 </programlisting>
2412 Here:
2413 <itemizedlist spacing="compact" mark="bullet">
2414 <listitem>
2415 <para>
2416 <literal>thesaurus_simple</literal> is the new dictionary's name
2417 </para>
2418 </listitem>
2419 <listitem>
2420 <para>
2421 <literal>mythesaurus</literal> is the base name of the thesaurus
2422 configuration file.
2423 (Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</>,
2424 where <literal>$SHAREDIR</> means the installation shared-data
2425 directory.)
2426 </para>
2427 </listitem>
2428 <listitem>
2429 <para>
2430 <literal>pg_catalog.english_stem</literal> is the subdictionary (here,
2431 a Snowball English stemmer) to use for thesaurus normalization.
2432 Notice that the subdictionary will have its own
2433 configuration (for example, stop words), which is not shown here.
2434 </para>
2435 </listitem>
2436 </itemizedlist>
2438 Now it is possible to bind the thesaurus dictionary <literal>thesaurus_simple</literal>
2439 to the desired token types in a configuration, for example:
2441 <programlisting>
2442 ALTER TEXT SEARCH CONFIGURATION russian
2443 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
2444 WITH thesaurus_simple;
2445 </programlisting>
2446 </para>
2448 </sect3>
2450 <sect3 id="textsearch-thesaurus-examples">
2451 <title>Thesaurus Example</title>
2453 <para>
2454 Consider a simple astronomical thesaurus <literal>thesaurus_astro</literal>,
2455 which contains some astronomical word combinations:
2457 <programlisting>
2458 supernovae stars : sn
2459 crab nebulae : crab
2460 </programlisting>
2462 Below we create a dictionary and bind some token types to
2463 an astronomical thesaurus and English stemmer:
2465 <programlisting>
2466 CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
2467 TEMPLATE = thesaurus,
2468 DictFile = thesaurus_astro,
2469 Dictionary = english_stem
2472 ALTER TEXT SEARCH CONFIGURATION russian
2473 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
2474 WITH thesaurus_astro, english_stem;
2475 </programlisting>
2477 Now we can see how it works.
2478 <function>ts_lexize</function> is not very useful for testing a thesaurus,
2479 because it treats its input as a single token. Instead we can use
2480 <function>plainto_tsquery</function> and <function>to_tsvector</function>
2481 which will break their input strings into multiple tokens:
2483 <programlisting>
2484 SELECT plainto_tsquery('supernova star');
2485 plainto_tsquery
2486 -----------------
2487 'sn'
2489 SELECT to_tsvector('supernova star');
2490 to_tsvector
2491 -------------
2492 'sn':1
2493 </programlisting>
2495 In principle, one can use <function>to_tsquery</function> if you quote
2496 the argument:
2498 <programlisting>
2499 SELECT to_tsquery('''supernova star''');
2500 to_tsquery
2501 ------------
2502 'sn'
2503 </programlisting>
2505 Notice that <literal>supernova star</literal> matches <literal>supernovae
2506 stars</literal> in <literal>thesaurus_astro</literal> because we specified
2507 the <literal>english_stem</literal> stemmer in the thesaurus definition.
2508 The stemmer removed the <literal>e</> and <literal>s</>.
2509 </para>
2511 <para>
2512 To index the original phrase as well as the substitute, just include it
2513 in the right-hand part of the definition:
2515 <programlisting>
2516 supernovae stars : sn supernovae stars
2518 SELECT plainto_tsquery('supernova star');
2519 plainto_tsquery
2520 -----------------------------
2521 'sn' &amp; 'supernova' &amp; 'star'
2522 </programlisting>
2523 </para>
2525 </sect3>
2527 </sect2>
2529 <sect2 id="textsearch-ispell-dictionary">
2530 <title><application>Ispell</> Dictionary</title>
2532 <para>
2533 The <application>Ispell</> dictionary template supports
2534 <firstterm>morphological dictionaries</>, which can normalize many
2535 different linguistic forms of a word into the same lexeme. For example,
2536 an English <application>Ispell</> dictionary can match all declensions and
2537 conjugations of the search term <literal>bank</literal>, e.g.,
2538 <literal>banking</>, <literal>banked</>, <literal>banks</>,
2539 <literal>banks'</>, and <literal>bank's</>.
2540 </para>
2542 <para>
2543 The standard <productname>PostgreSQL</productname> distribution does
2544 not include any <application>Ispell</> configuration files.
2545 Dictionaries for a large number of languages are available from <ulink
2546 url="http://ficus-www.cs.ucla.edu/geoff/ispell.html">Ispell</ulink>.
2547 Also, some more modern dictionary file formats are supported &mdash; <ulink
2548 url="http://en.wikipedia.org/wiki/MySpell">MySpell</ulink> (OO &lt; 2.0.1)
2549 and <ulink url="http://sourceforge.net/projects/hunspell">Hunspell</ulink>
2550 (OO &gt;= 2.0.2). A large list of dictionaries is available on the <ulink
2551 url="http://wiki.services.openoffice.org/wiki/Dictionaries">OpenOffice
2552 Wiki</ulink>.
2553 </para>
2555 <para>
2556 To create an <application>Ispell</> dictionary, use the built-in
2557 <literal>ispell</literal> template and specify several parameters:
2558 </para>
2560 <programlisting>
2561 CREATE TEXT SEARCH DICTIONARY english_ispell (
2562 TEMPLATE = ispell,
2563 DictFile = english,
2564 AffFile = english,
2565 StopWords = english
2567 </programlisting>
2569 <para>
2570 Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
2571 specify the base names of the dictionary, affixes, and stop-words files.
2572 The stop-words file has the same format explained above for the
2573 <literal>simple</> dictionary type. The format of the other files is
2574 not specified here but is available from the above-mentioned web sites.
2575 </para>
2577 <para>
2578 Ispell dictionaries usually recognize a limited set of words, so they
2579 should be followed by another broader dictionary; for
2580 example, a Snowball dictionary, which recognizes everything.
2581 </para>
2583 <para>
2584 Ispell dictionaries support splitting compound words;
2585 a useful feature.
2586 Notice that the affix file should specify a special flag using the
2587 <literal>compoundwords controlled</literal> statement that marks dictionary
2588 words that can participate in compound formation:
2590 <programlisting>
2591 compoundwords controlled z
2592 </programlisting>
2594 Here are some examples for the Norwegian language:
2596 <programlisting>
2597 SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
2598 {over,buljong,terning,pakk,mester,assistent}
2599 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
2600 {sjokoladefabrikk,sjokolade,fabrikk}
2601 </programlisting>
2602 </para>
2604 <note>
2605 <para>
2606 <application>MySpell</> does not support compound words.
2607 <application>Hunspell</> has sophisticated support for compound words. At
2608 present, <productname>PostgreSQL</productname> implements only the basic
2609 compound word operations of Hunspell.
2610 </para>
2611 </note>
2613 </sect2>
2615 <sect2 id="textsearch-snowball-dictionary">
2616 <title><application>Snowball</> Dictionary</title>
2618 <para>
2619 The <application>Snowball</> dictionary template is based on a project
2620 by Martin Porter, inventor of the popular Porter's stemming algorithm
2621 for the English language. Snowball now provides stemming algorithms for
2622 many languages (see the <ulink url="http://snowball.tartarus.org">Snowball
2623 site</ulink> for more information). Each algorithm understands how to
2624 reduce common variant forms of words to a base, or stem, spelling within
2625 its language. A Snowball dictionary requires a <literal>language</>
2626 parameter to identify which stemmer to use, and optionally can specify a
2627 <literal>stopword</> file name that gives a list of words to eliminate.
2628 (<productname>PostgreSQL</productname>'s standard stopword lists are also
2629 provided by the Snowball project.)
2630 For example, there is a built-in definition equivalent to
2632 <programlisting>
2633 CREATE TEXT SEARCH DICTIONARY english_stem (
2634 TEMPLATE = snowball,
2635 Language = english,
2636 StopWords = english
2638 </programlisting>
2640 The stopword file format is the same as already explained.
2641 </para>
2643 <para>
2644 A <application>Snowball</> dictionary recognizes everything, whether
2645 or not it is able to simplify the word, so it should be placed
2646 at the end of the dictionary list. It is useless to have it
2647 before any other dictionary because a token will never pass through it to
2648 the next dictionary.
2649 </para>
2651 </sect2>
2653 </sect1>
2655 <sect1 id="textsearch-configuration">
2656 <title>Configuration Example</title>
2658 <para>
2659 A text search configuration specifies all options necessary to transform a
2660 document into a <type>tsvector</type>: the parser to use to break text
2661 into tokens, and the dictionaries to use to transform each token into a
2662 lexeme. Every call of
2663 <function>to_tsvector</function> or <function>to_tsquery</function>
2664 needs a text search configuration to perform its processing.
2665 The configuration parameter
2666 <xref linkend="guc-default-text-search-config">
2667 specifies the name of the default configuration, which is the
2668 one used by text search functions if an explicit configuration
2669 parameter is omitted.
2670 It can be set in <filename>postgresql.conf</filename>, or set for an
2671 individual session using the <command>SET</> command.
2672 </para>
2674 <para>
2675 Several predefined text search configurations are available, and
2676 you can create custom configurations easily. To facilitate management
2677 of text search objects, a set of <acronym>SQL</acronym> commands
2678 is available, and there are several <application>psql</application> commands that display information
2679 about text search objects (<xref linkend="textsearch-psql">).
2680 </para>
2682 <para>
2683 As an example we will create a configuration
2684 <literal>pg</literal>, starting by duplicating the built-in
2685 <literal>english</> configuration:
2687 <programlisting>
2688 CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = pg_catalog.english );
2689 </programlisting>
2690 </para>
2692 <para>
2693 We will use a PostgreSQL-specific synonym list
2694 and store it in <filename>$SHAREDIR/tsearch_data/pg_dict.syn</filename>.
2695 The file contents look like:
2697 <programlisting>
2698 postgres pg
2699 pgsql pg
2700 postgresql pg
2701 </programlisting>
2703 We define the synonym dictionary like this:
2705 <programlisting>
2706 CREATE TEXT SEARCH DICTIONARY pg_dict (
2707 TEMPLATE = synonym,
2708 SYNONYMS = pg_dict
2710 </programlisting>
2712 Next we register the <productname>Ispell</> dictionary
2713 <literal>english_ispell</literal>, which has its own configuration files:
2715 <programlisting>
2716 CREATE TEXT SEARCH DICTIONARY english_ispell (
2717 TEMPLATE = ispell,
2718 DictFile = english,
2719 AffFile = english,
2720 StopWords = english
2722 </programlisting>
2724 Now we can set up the mappings for words in configuration
2725 <literal>pg</>:
2727 <programlisting>
2728 ALTER TEXT SEARCH CONFIGURATION pg
2729 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
2730 word, hword, hword_part
2731 WITH pg_dict, english_ispell, english_stem;
2732 </programlisting>
2734 We choose not to index or search some token types that the built-in
2735 configuration does handle:
2737 <programlisting>
2738 ALTER TEXT SEARCH CONFIGURATION pg
2739 DROP MAPPING FOR email, url, url_path, sfloat, float;
2740 </programlisting>
2741 </para>
2743 <para>
2744 Now we can test our configuration:
2746 <programlisting>
2747 SELECT * FROM ts_debug('public.pg', '
2748 PostgreSQL, the highly scalable, SQL compliant, open source object-relational
2749 database management system, is now undergoing beta testing of the next
2750 version of our software.
2752 </programlisting>
2753 </para>
2755 <para>
2756 The next step is to set the session to use the new configuration, which was
2757 created in the <literal>public</> schema:
2759 <programlisting>
2760 =&gt; \dF
2761 List of text search configurations
2762 Schema | Name | Description
2763 ---------+------+-------------
2764 public | pg |
2766 SET default_text_search_config = 'public.pg';
2769 SHOW default_text_search_config;
2770 default_text_search_config
2771 ----------------------------
2772 public.pg
2773 </programlisting>
2774 </para>
2776 </sect1>
2778 <sect1 id="textsearch-debugging">
2779 <title>Testing and Debugging Text Search</title>
2781 <para>
2782 The behavior of a custom text search configuration can easily become
2783 confusing. The functions described
2784 in this section are useful for testing text search objects. You can
2785 test a complete configuration, or test parsers and dictionaries separately.
2786 </para>
2788 <sect2 id="textsearch-configuration-testing">
2789 <title>Configuration Testing</title>
2791 <para>
2792 The function <function>ts_debug</function> allows easy testing of a
2793 text search configuration.
2794 </para>
2796 <indexterm>
2797 <primary>ts_debug</primary>
2798 </indexterm>
2800 <synopsis>
2801 ts_debug(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>,
2802 OUT <replaceable class="PARAMETER">alias</> <type>text</>,
2803 OUT <replaceable class="PARAMETER">description</> <type>text</>,
2804 OUT <replaceable class="PARAMETER">token</> <type>text</>,
2805 OUT <replaceable class="PARAMETER">dictionaries</> <type>regdictionary[]</>,
2806 OUT <replaceable class="PARAMETER">dictionary</> <type>regdictionary</>,
2807 OUT <replaceable class="PARAMETER">lexemes</> <type>text[]</>)
2808 returns setof record
2809 </synopsis>
2811 <para>
2812 <function>ts_debug</> displays information about every token of
2813 <replaceable class="PARAMETER">document</replaceable> as produced by the
2814 parser and processed by the configured dictionaries. It uses the
2815 configuration specified by <replaceable
2816 class="PARAMETER">config</replaceable>,
2817 or <varname>default_text_search_config</varname> if that argument is
2818 omitted.
2819 </para>
2821 <para>
2822 <function>ts_debug</> returns one row for each token identified in the text
2823 by the parser. The columns returned are
2825 <itemizedlist spacing="compact" mark="bullet">
2826 <listitem>
2827 <para>
2828 <replaceable>alias</> <type>text</> &mdash; short name of the token type
2829 </para>
2830 </listitem>
2831 <listitem>
2832 <para>
2833 <replaceable>description</> <type>text</> &mdash; description of the
2834 token type
2835 </para>
2836 </listitem>
2837 <listitem>
2838 <para>
2839 <replaceable>token</> <type>text</> &mdash; text of the token
2840 </para>
2841 </listitem>
2842 <listitem>
2843 <para>
2844 <replaceable>dictionaries</> <type>regdictionary[]</> &mdash; the
2845 dictionaries selected by the configuration for this token type
2846 </para>
2847 </listitem>
2848 <listitem>
2849 <para>
2850 <replaceable>dictionary</> <type>regdictionary</> &mdash; the dictionary
2851 that recognized the token, or <literal>NULL</> if none did
2852 </para>
2853 </listitem>
2854 <listitem>
2855 <para>
2856 <replaceable>lexemes</> <type>text[]</> &mdash; the lexeme(s) produced
2857 by the dictionary that recognized the token, or <literal>NULL</> if
2858 none did; an empty array (<literal>{}</>) means it was recognized as a
2859 stop word
2860 </para>
2861 </listitem>
2862 </itemizedlist>
2863 </para>
2865 <para>
2866 Here is a simple example:
2868 <programlisting>
2869 SELECT * FROM ts_debug('english','a fat cat sat on a mat - it ate a fat rats');
2870 alias | description | token | dictionaries | dictionary | lexemes
2871 -----------+-----------------+-------+----------------+--------------+---------
2872 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
2873 blank | Space symbols | | {} | |
2874 asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
2875 blank | Space symbols | | {} | |
2876 asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat}
2877 blank | Space symbols | | {} | |
2878 asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat}
2879 blank | Space symbols | | {} | |
2880 asciiword | Word, all ASCII | on | {english_stem} | english_stem | {}
2881 blank | Space symbols | | {} | |
2882 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
2883 blank | Space symbols | | {} | |
2884 asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat}
2885 blank | Space symbols | | {} | |
2886 blank | Space symbols | - | {} | |
2887 asciiword | Word, all ASCII | it | {english_stem} | english_stem | {}
2888 blank | Space symbols | | {} | |
2889 asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate}
2890 blank | Space symbols | | {} | |
2891 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
2892 blank | Space symbols | | {} | |
2893 asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
2894 blank | Space symbols | | {} | |
2895 asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat}
2896 </programlisting>
2897 </para>
2899 <para>
2900 For a more extensive demonstration, we
2901 first create a <literal>public.english</literal> configuration and
2902 Ispell dictionary for the English language:
2903 </para>
2905 <programlisting>
2906 CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
2908 CREATE TEXT SEARCH DICTIONARY english_ispell (
2909 TEMPLATE = ispell,
2910 DictFile = english,
2911 AffFile = english,
2912 StopWords = english
2915 ALTER TEXT SEARCH CONFIGURATION public.english
2916 ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
2917 </programlisting>
2919 <programlisting>
2920 SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
2921 alias | description | token | dictionaries | dictionary | lexemes
2922 -----------+-----------------+-------------+-------------------------------+----------------+-------------
2923 asciiword | Word, all ASCII | The | {english_ispell,english_stem} | english_ispell | {}
2924 blank | Space symbols | | {} | |
2925 asciiword | Word, all ASCII | Brightest | {english_ispell,english_stem} | english_ispell | {bright}
2926 blank | Space symbols | | {} | |
2927 asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem | {supernova}
2928 </programlisting>
2930 <para>
2931 In this example, the word <literal>Brightest</> was recognized by the
2932 parser as an <literal>ASCII word</literal> (alias <literal>asciiword</literal>).
2933 For this token type the dictionary list is
2934 <literal>english_ispell</> and
2935 <literal>english_stem</literal>. The word was recognized by
2936 <literal>english_ispell</literal>, which reduced it to the noun
2937 <literal>bright</literal>. The word <literal>supernovaes</literal> is
2938 unknown to the <literal>english_ispell</literal> dictionary so it
2939 was passed to the next dictionary, and, fortunately, was recognized (in
2940 fact, <literal>english_stem</literal> is a Snowball dictionary which
2941 recognizes everything; that is why it was placed at the end of the
2942 dictionary list).
2943 </para>
2945 <para>
2946 The word <literal>The</literal> was recognized by the
2947 <literal>english_ispell</literal> dictionary as a stop word (<xref
2948 linkend="textsearch-stopwords">) and will not be indexed.
2949 The spaces are discarded too, since the configuration provides no
2950 dictionaries at all for them.
2951 </para>
2953 <para>
2954 You can reduce the width of the output by explicitly specifying which columns
2955 you want to see:
2957 <programlisting>
2958 SELECT alias, token, dictionary, lexemes
2959 FROM ts_debug('public.english','The Brightest supernovaes');
2960 alias | token | dictionary | lexemes
2961 -----------+-------------+----------------+-------------
2962 asciiword | The | english_ispell | {}
2963 blank | | |
2964 asciiword | Brightest | english_ispell | {bright}
2965 blank | | |
2966 asciiword | supernovaes | english_stem | {supernova}
2967 </programlisting>
2968 </para>
2970 </sect2>
2972 <sect2 id="textsearch-parser-testing">
2973 <title>Parser Testing</title>
2975 <para>
2976 The following functions allow direct testing of a text search parser.
2977 </para>
2979 <indexterm>
2980 <primary>ts_parse</primary>
2981 </indexterm>
2983 <synopsis>
2984 ts_parse(<replaceable class="PARAMETER">parser_name</replaceable> <type>text</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>,
2985 OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
2986 ts_parse(<replaceable class="PARAMETER">parser_oid</replaceable> <type>oid</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>,
2987 OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
2988 </synopsis>
2990 <para>
2991 <function>ts_parse</> parses the given <replaceable>document</replaceable>
2992 and returns a series of records, one for each token produced by
2993 parsing. Each record includes a <varname>tokid</varname> showing the
2994 assigned token type and a <varname>token</varname> which is the text of the
2995 token. For example:
2997 <programlisting>
2998 SELECT * FROM ts_parse('default', '123 - a number');
2999 tokid | token
3000 -------+--------
3001 22 | 123
3002 12 |
3003 12 | -
3004 1 | a
3005 12 |
3006 1 | number
3007 </programlisting>
3008 </para>
3010 <indexterm>
3011 <primary>ts_token_type</primary>
3012 </indexterm>
3014 <synopsis>
3015 ts_token_type(<replaceable class="PARAMETER">parser_name</> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>,
3016 OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
3017 ts_token_type(<replaceable class="PARAMETER">parser_oid</> <type>oid</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>,
3018 OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
3019 </synopsis>
3021 <para>
3022 <function>ts_token_type</> returns a table which describes each type of
3023 token the specified parser can recognize. For each token type, the table
3024 gives the integer <varname>tokid</varname> that the parser uses to label a
3025 token of that type, the <varname>alias</varname> that names the token type
3026 in configuration commands, and a short <varname>description</varname>. For
3027 example:
3029 <programlisting>
3030 SELECT * FROM ts_token_type('default');
3031 tokid | alias | description
3032 -------+-----------------+------------------------------------------
3033 1 | asciiword | Word, all ASCII
3034 2 | word | Word, all letters
3035 3 | numword | Word, letters and digits
3036 4 | email | Email address
3037 5 | url | URL
3038 6 | host | Host
3039 7 | sfloat | Scientific notation
3040 8 | version | Version number
3041 9 | hword_numpart | Hyphenated word part, letters and digits
3042 10 | hword_part | Hyphenated word part, all letters
3043 11 | hword_asciipart | Hyphenated word part, all ASCII
3044 12 | blank | Space symbols
3045 13 | tag | XML tag
3046 14 | protocol | Protocol head
3047 15 | numhword | Hyphenated word, letters and digits
3048 16 | asciihword | Hyphenated word, all ASCII
3049 17 | hword | Hyphenated word, all letters
3050 18 | url_path | URL path
3051 19 | file | File or path name
3052 20 | float | Decimal notation
3053 21 | int | Signed integer
3054 22 | uint | Unsigned integer
3055 23 | entity | XML entity
3056 </programlisting>
3057 </para>
3059 </sect2>
3061 <sect2 id="textsearch-dictionary-testing">
3062 <title>Dictionary Testing</title>
3064 <para>
3065 The <function>ts_lexize</> function facilitates dictionary testing.
3066 </para>
3068 <indexterm>
3069 <primary>ts_lexize</primary>
3070 </indexterm>
3072 <synopsis>
3073 ts_lexize(<replaceable class="PARAMETER">dict</replaceable> <type>regdictionary</>, <replaceable class="PARAMETER">token</replaceable> <type>text</>) returns <type>text[]</>
3074 </synopsis>
3076 <para>
3077 <function>ts_lexize</> returns an array of lexemes if the input
3078 <replaceable>token</replaceable> is known to the dictionary,
3079 or an empty array if the token
3080 is known to the dictionary but it is a stop word, or
3081 <literal>NULL</literal> if it is an unknown word.
3082 </para>
3084 <para>
3085 Examples:
3087 <programlisting>
3088 SELECT ts_lexize('english_stem', 'stars');
3089 ts_lexize
3090 -----------
3091 {star}
3093 SELECT ts_lexize('english_stem', 'a');
3094 ts_lexize
3095 -----------
3097 </programlisting>
3098 </para>
3100 <note>
3101 <para>
3102 The <function>ts_lexize</function> function expects a single
3103 <emphasis>token</emphasis>, not text. Here is a case
3104 where this can be confusing:
3106 <programlisting>
3107 SELECT ts_lexize('thesaurus_astro','supernovae stars') is null;
3108 ?column?
3109 ----------
3111 </programlisting>
3113 The thesaurus dictionary <literal>thesaurus_astro</literal> does know the
3114 phrase <literal>supernovae stars</literal>, but <function>ts_lexize</>
3115 fails since it does not parse the input text but treats it as a single
3116 token. Use <function>plainto_tsquery</> or <function>to_tsvector</> to
3117 test thesaurus dictionaries, for example:
3119 <programlisting>
3120 SELECT plainto_tsquery('supernovae stars');
3121 plainto_tsquery
3122 -----------------
3123 'sn'
3124 </programlisting>
3125 </para>
3126 </note>
3128 </sect2>
3130 </sect1>
3132 <sect1 id="textsearch-indexes">
3133 <title>GiST and GIN Index Types</title>
3135 <indexterm zone="textsearch-indexes">
3136 <primary>text search</primary>
3137 <secondary>indexes</secondary>
3138 </indexterm>
3140 <para>
3141 There are two kinds of indexes that can be used to speed up full text
3142 searches.
3143 Note that indexes are not mandatory for full text searching, but in
3144 cases where a column is searched on a regular basis, an index is
3145 usually desirable.
3147 <variablelist>
3149 <varlistentry>
3151 <indexterm zone="textsearch-indexes">
3152 <primary>index</primary>
3153 <secondary>GiST</secondary>
3154 <tertiary>text search</tertiary>
3155 </indexterm>
3157 <term>
3158 <synopsis>
3159 CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gist(<replaceable>column</replaceable>);
3160 </synopsis>
3161 </term>
3163 <listitem>
3164 <para>
3165 Creates a GiST (Generalized Search Tree)-based index.
3166 The <replaceable>column</replaceable> can be of <type>tsvector</> or
3167 <type>tsquery</> type.
3168 </para>
3169 </listitem>
3170 </varlistentry>
3172 <varlistentry>
3174 <indexterm zone="textsearch-indexes">
3175 <primary>index</primary>
3176 <secondary>GIN</secondary>
3177 <tertiary>text search</tertiary>
3178 </indexterm>
3180 <term>
3181 <synopsis>
3182 CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gin(<replaceable>column</replaceable>);
3183 </synopsis>
3184 </term>
3186 <listitem>
3187 <para>
3188 Creates a GIN (Generalized Inverted Index)-based index.
3189 The <replaceable>column</replaceable> must be of <type>tsvector</> type.
3190 </para>
3191 </listitem>
3192 </varlistentry>
3194 </variablelist>
3195 </para>
3197 <para>
3198 There are substantial performance differences between the two index types,
3199 so it is important to understand their characteristics.
3200 </para>
3202 <para>
3203 A GiST index is <firstterm>lossy</firstterm>, meaning that the index
3204 may produce false matches, and it is necessary
3205 to check the actual table row to eliminate such false matches.
3206 (<productname>PostgreSQL</productname> does this automatically when needed.)
3207 GiST indexes are lossy because each document is represented in the
3208 index by a fixed-length signature. The signature is generated by hashing
3209 each word into a random bit in an n-bit string, with all these bits OR-ed
3210 together to produce an n-bit document signature. When two words hash to
3211 the same bit position there will be a false match. If all words in
3212 the query have matches (real or false) then the table row must be
3213 retrieved to see if the match is correct.
3214 </para>
3216 <para>
3217 Lossiness causes performance degradation due to unnecessary fetches of table
3218 records that turn out to be false matches. Since random access to table
3219 records is slow, this limits the usefulness of GiST indexes. The
3220 likelihood of false matches depends on several factors, in particular the
3221 number of unique words, so using dictionaries to reduce this number is
3222 recommended.
3223 </para>
3225 <para>
3226 GIN indexes are not lossy for standard queries, but their performance
3227 depends logarithmically on the number of unique words.
3228 (However, GIN indexes store only the words (lexemes) of <type>tsvector</>
3229 values, and not their weight labels. Thus a table row recheck is needed
3230 when using a query that involves weights.)
3231 </para>
3233 <para>
3234 In choosing which index type to use, GiST or GIN, consider these
3235 performance differences:
3237 <itemizedlist spacing="compact" mark="bullet">
3238 <listitem>
3239 <para>
3240 GIN index lookups are about three times faster than GiST
3241 </para>
3242 </listitem>
3243 <listitem>
3244 <para>
3245 GIN indexes take about three times longer to build than GiST
3246 </para>
3247 </listitem>
3248 <listitem>
3249 <para>
3250 GIN indexes are moderately slower to update than GiST indexes, but
3251 about 10 times slower if fast-update support was disabled
3252 (see <xref linkend="gin-fast-update"> for details)
3253 </para>
3254 </listitem>
3255 <listitem>
3256 <para>
3257 GIN indexes are two-to-three times larger than GiST indexes
3258 </para>
3259 </listitem>
3260 </itemizedlist>
3261 </para>
3263 <para>
3264 As a rule of thumb, <acronym>GIN</acronym> indexes are best for static data
3265 because lookups are faster. For dynamic data, GiST indexes are
3266 faster to update. Specifically, <acronym>GiST</acronym> indexes are very
3267 good for dynamic data and fast if the number of unique words (lexemes) is
3268 under 100,000, while <acronym>GIN</acronym> indexes will handle 100,000+
3269 lexemes better but are slower to update.
3270 </para>
3272 <para>
3273 Note that <acronym>GIN</acronym> index build time can often be improved
3274 by increasing <xref linkend="guc-maintenance-work-mem">, while
3275 <acronym>GiST</acronym> index build time is not sensitive to that
3276 parameter.
3277 </para>
3279 <para>
3280 Partitioning of big collections and the proper use of GiST and GIN indexes
3281 allows the implementation of very fast searches with online update.
3282 Partitioning can be done at the database level using table inheritance,
3283 or by distributing documents over
3284 servers and collecting search results using the <filename>contrib/dblink</>
3285 extension module. The latter is possible because ranking functions use
3286 only local information.
3287 </para>
3289 </sect1>
3291 <sect1 id="textsearch-psql">
3292 <title><application>psql</> Support</title>
3294 <para>
3295 Information about text search configuration objects can be obtained
3296 in <application>psql</application> using a set of commands:
3297 <synopsis>
3298 \dF{d,p,t}<optional>+</optional> <optional>PATTERN</optional>
3299 </synopsis>
3300 An optional <literal>+</literal> produces more details.
3301 </para>
3303 <para>
3304 The optional parameter <literal>PATTERN</literal> can be the name of
3305 a text search object, optionally schema-qualified. If
3306 <literal>PATTERN</literal> is omitted then information about all
3307 visible objects will be displayed. <literal>PATTERN</literal> can be a
3308 regular expression and can provide <emphasis>separate</emphasis> patterns
3309 for the schema and object names. The following examples illustrate this:
3311 <programlisting>
3312 =&gt; \dF *fulltext*
3313 List of text search configurations
3314 Schema | Name | Description
3315 --------+--------------+-------------
3316 public | fulltext_cfg |
3317 </programlisting>
3319 <programlisting>
3320 =&gt; \dF *.fulltext*
3321 List of text search configurations
3322 Schema | Name | Description
3323 ----------+----------------------------
3324 fulltext | fulltext_cfg |
3325 public | fulltext_cfg |
3326 </programlisting>
3328 The available commands are:
3329 </para>
3331 <variablelist>
3333 <varlistentry>
3334 <term><synopsis>\dF<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3336 <listitem>
3337 <para>
3338 List text search configurations (add <literal>+</> for more detail).
3339 </para>
3341 <para>
3343 <programlisting>
3344 =&gt; \dF russian
3345 List of text search configurations
3346 Schema | Name | Description
3347 ------------+---------+------------------------------------
3348 pg_catalog | russian | configuration for russian language
3350 =&gt; \dF+ russian
3351 Text search configuration "pg_catalog.russian"
3352 Parser: "pg_catalog.default"
3353 Token | Dictionaries
3354 -----------------+--------------
3355 asciihword | english_stem
3356 asciiword | english_stem
3357 email | simple
3358 file | simple
3359 float | simple
3360 host | simple
3361 hword | russian_stem
3362 hword_asciipart | english_stem
3363 hword_numpart | simple
3364 hword_part | russian_stem
3365 int | simple
3366 numhword | simple
3367 numword | simple
3368 sfloat | simple
3369 uint | simple
3370 url | simple
3371 url_path | simple
3372 version | simple
3373 word | russian_stem
3374 </programlisting>
3375 </para>
3376 </listitem>
3377 </varlistentry>
3379 <varlistentry>
3380 <term><synopsis>\dFd<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3381 <listitem>
3382 <para>
3383 List text search dictionaries (add <literal>+</> for more detail).
3384 </para>
3386 <para>
3387 <programlisting>
3388 =&gt; \dFd
3389 List of text search dictionaries
3390 Schema | Name | Description
3391 ------------+-----------------+-----------------------------------------------------------
3392 pg_catalog | danish_stem | snowball stemmer for danish language
3393 pg_catalog | dutch_stem | snowball stemmer for dutch language
3394 pg_catalog | english_stem | snowball stemmer for english language
3395 pg_catalog | finnish_stem | snowball stemmer for finnish language
3396 pg_catalog | french_stem | snowball stemmer for french language
3397 pg_catalog | german_stem | snowball stemmer for german language
3398 pg_catalog | hungarian_stem | snowball stemmer for hungarian language
3399 pg_catalog | italian_stem | snowball stemmer for italian language
3400 pg_catalog | norwegian_stem | snowball stemmer for norwegian language
3401 pg_catalog | portuguese_stem | snowball stemmer for portuguese language
3402 pg_catalog | romanian_stem | snowball stemmer for romanian language
3403 pg_catalog | russian_stem | snowball stemmer for russian language
3404 pg_catalog | simple | simple dictionary: just lower case and check for stopword
3405 pg_catalog | spanish_stem | snowball stemmer for spanish language
3406 pg_catalog | swedish_stem | snowball stemmer for swedish language
3407 pg_catalog | turkish_stem | snowball stemmer for turkish language
3408 </programlisting>
3409 </para>
3410 </listitem>
3411 </varlistentry>
3413 <varlistentry>
3415 <term><synopsis>\dFp<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3416 <listitem>
3417 <para>
3418 List text search parsers (add <literal>+</> for more detail).
3419 </para>
3421 <para>
3422 <programlisting>
3423 =&gt; \dFp
3424 List of text search parsers
3425 Schema | Name | Description
3426 ------------+---------+---------------------
3427 pg_catalog | default | default word parser
3428 =&gt; \dFp+
3429 Text search parser "pg_catalog.default"
3430 Method | Function | Description
3431 -----------------+----------------+-------------
3432 Start parse | prsd_start |
3433 Get next token | prsd_nexttoken |
3434 End parse | prsd_end |
3435 Get headline | prsd_headline |
3436 Get token types | prsd_lextype |
3438 Token types for parser "pg_catalog.default"
3439 Token name | Description
3440 -----------------+------------------------------------------
3441 asciihword | Hyphenated word, all ASCII
3442 asciiword | Word, all ASCII
3443 blank | Space symbols
3444 email | Email address
3445 entity | XML entity
3446 file | File or path name
3447 float | Decimal notation
3448 host | Host
3449 hword | Hyphenated word, all letters
3450 hword_asciipart | Hyphenated word part, all ASCII
3451 hword_numpart | Hyphenated word part, letters and digits
3452 hword_part | Hyphenated word part, all letters
3453 int | Signed integer
3454 numhword | Hyphenated word, letters and digits
3455 numword | Word, letters and digits
3456 protocol | Protocol head
3457 sfloat | Scientific notation
3458 tag | XML tag
3459 uint | Unsigned integer
3460 url | URL
3461 url_path | URL path
3462 version | Version number
3463 word | Word, all letters
3464 (23 rows)
3465 </programlisting>
3466 </para>
3467 </listitem>
3468 </varlistentry>
3470 <varlistentry>
3472 <term><synopsis>\dFt<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3473 <listitem>
3474 <para>
3475 List text search templates (add <literal>+</> for more detail).
3476 </para>
3478 <para>
3479 <programlisting>
3480 =&gt; \dFt
3481 List of text search templates
3482 Schema | Name | Description
3483 ------------+-----------+-----------------------------------------------------------
3484 pg_catalog | ispell | ispell dictionary
3485 pg_catalog | simple | simple dictionary: just lower case and check for stopword
3486 pg_catalog | snowball | snowball stemmer
3487 pg_catalog | synonym | synonym dictionary: replace word by its synonym
3488 pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution
3489 </programlisting>
3490 </para>
3491 </listitem>
3492 </varlistentry>
3494 </variablelist>
3496 </sect1>
3498 <sect1 id="textsearch-limitations">
3499 <title>Limitations</title>
3501 <para>
3502 The current limitations of <productname>PostgreSQL</productname>'s
3503 text search features are:
3504 <itemizedlist spacing="compact" mark="bullet">
3505 <listitem>
3506 <para>The length of each lexeme must be less than 2K bytes</para>
3507 </listitem>
3508 <listitem>
3509 <para>The length of a <type>tsvector</type> (lexemes + positions) must be
3510 less than 1 megabyte</para>
3511 </listitem>
3512 <listitem>
3513 <!-- TODO: number of lexemes in what? This is unclear -->
3514 <para>The number of lexemes must be less than
3515 2<superscript>64</superscript></para>
3516 </listitem>
3517 <listitem>
3518 <para>Position values in <type>tsvector</> must be greater than 0 and
3519 no more than 16,383</para>
3520 </listitem>
3521 <listitem>
3522 <para>No more than 256 positions per lexeme</para>
3523 </listitem>
3524 <listitem>
3525 <para>The number of nodes (lexemes + operators) in a <type>tsquery</type>
3526 must be less than 32,768</para>
3527 </listitem>
3528 </itemizedlist>
3529 </para>
3531 <para>
3532 For comparison, the <productname>PostgreSQL</productname> 8.1 documentation
3533 contained 10,441 unique words, a total of 335,420 words, and the most
3534 frequent word <quote>postgresql</> was mentioned 6,127 times in 655
3535 documents.
3536 </para>
3538 <!-- TODO we need to put a date on these numbers? -->
3539 <para>
3540 Another example &mdash; the <productname>PostgreSQL</productname> mailing
3541 list archives contained 910,989 unique words with 57,491,343 lexemes in
3542 461,020 messages.
3543 </para>
3545 </sect1>
3547 <sect1 id="textsearch-migration">
3548 <title>Migration from Pre-8.3 Text Search</title>
3550 <para>
3551 Applications that used the <filename>contrib/tsearch2</> add-on module
3552 for text searching will need some adjustments to work with the
3553 built-in features:
3554 </para>
3556 <itemizedlist>
3557 <listitem>
3558 <para>
3559 Some functions have been renamed or had small adjustments in their
3560 argument lists, and all of them are now in the <literal>pg_catalog</>
3561 schema, whereas in a previous installation they would have been in
3562 <literal>public</> or another non-system schema. There is a new
3563 version of <filename>contrib/tsearch2</> (see <xref linkend="tsearch2">)
3564 that provides a compatibility layer to solve most problems in this
3565 area.
3566 </para>
3567 </listitem>
3569 <listitem>
3570 <para>
3571 The old <filename>contrib/tsearch2</> functions and other objects
3572 <emphasis>must</> be suppressed when loading <application>pg_dump</>
3573 output from a pre-8.3 database. While many of them won't load anyway,
3574 a few will and then cause problems. One simple way to deal with this
3575 is to load the new <filename>contrib/tsearch2</> module before restoring
3576 the dump; then it will block the old objects from being loaded.
3577 </para>
3578 </listitem>
3580 <listitem>
3581 <para>
3582 Text search configuration setup is completely different now.
3583 Instead of manually inserting rows into configuration tables,
3584 search is configured through the specialized SQL commands shown
3585 earlier in this chapter. There is no automated
3586 support for converting an existing custom configuration for 8.3;
3587 you're on your own here.
3588 </para>
3589 </listitem>
3591 <listitem>
3592 <para>
3593 Most types of dictionaries rely on some outside-the-database
3594 configuration files. These are largely compatible with pre-8.3
3595 usage, but note the following differences:
3597 <itemizedlist spacing="compact" mark="bullet">
3598 <listitem>
3599 <para>
3600 Configuration files now must be placed in a single specified
3601 directory (<filename>$SHAREDIR/tsearch_data</>), and must have
3602 a specific extension depending on the type of file, as noted
3603 previously in the descriptions of the various dictionary types.
3604 This restriction was added to forestall security problems.
3605 </para>
3606 </listitem>
3608 <listitem>
3609 <para>
3610 Configuration files must be encoded in UTF-8 encoding,
3611 regardless of what database encoding is used.
3612 </para>
3613 </listitem>
3615 <listitem>
3616 <para>
3617 In thesaurus configuration files, stop words must be marked with
3618 <literal>?</>.
3619 </para>
3620 </listitem>
3621 </itemizedlist>
3622 </para>
3623 </listitem>
3625 </itemizedlist>
3627 </sect1>
3629 </chapter>