Make LC_COLLATE and LC_CTYPE database-level settings. Collation and
[PostgreSQL.git] / doc / src / sgml / textsearch.sgml
blob98ac7f8c454347ae71527bf99e77d25e6eccf839
1 <!-- $PostgreSQL$ -->
3 <chapter id="textsearch">
4 <title id="textsearch-title">Full Text Search</title>
6 <indexterm zone="textsearch">
7 <primary>full text search</primary>
8 </indexterm>
10 <indexterm zone="textsearch">
11 <primary>text search</primary>
12 </indexterm>
14 <sect1 id="textsearch-intro">
15 <title>Introduction</title>
17 <para>
18 Full Text Searching (or just <firstterm>text search</firstterm>) provides
19 the capability to identify natural-language <firstterm>documents</> that
20 satisfy a <firstterm>query</firstterm>, and optionally to sort them by
21 relevance to the query. The most common type of search
22 is to find all documents containing given <firstterm>query terms</firstterm>
23 and return them in order of their <firstterm>similarity</firstterm> to the
24 query. Notions of <varname>query</varname> and
25 <varname>similarity</varname> are very flexible and depend on the specific
26 application. The simplest search considers <varname>query</varname> as a
27 set of words and <varname>similarity</varname> as the frequency of query
28 words in the document.
29 </para>
31 <para>
32 Textual search operators have existed in databases for years.
33 <productname>PostgreSQL</productname> has
34 <literal>~</literal>, <literal>~*</literal>, <literal>LIKE</literal>, and
35 <literal>ILIKE</literal> operators for textual data types, but they lack
36 many essential properties required by modern information systems:
37 </para>
39 <itemizedlist spacing="compact" mark="bullet">
40 <listitem>
41 <para>
42 There is no linguistic support, even for English. Regular expressions
43 are not sufficient because they cannot easily handle derived words, e.g.,
44 <literal>satisfies</literal> and <literal>satisfy</literal>. You might
45 miss documents that contain <literal>satisfies</literal>, although you
46 probably would like to find them when searching for
47 <literal>satisfy</literal>. It is possible to use <literal>OR</literal>
48 to search for multiple derived forms, but this is tedious and error-prone
49 (some words can have several thousand derivatives).
50 </para>
51 </listitem>
53 <listitem>
54 <para>
55 They provide no ordering (ranking) of search results, which makes them
56 ineffective when thousands of matching documents are found.
57 </para>
58 </listitem>
60 <listitem>
61 <para>
62 They tend to be slow because there is no index support, so they must
63 process all documents for every search.
64 </para>
65 </listitem>
66 </itemizedlist>
68 <para>
69 Full text indexing allows documents to be <emphasis>preprocessed</emphasis>
70 and an index saved for later rapid searching. Preprocessing includes:
71 </para>
73 <itemizedlist mark="none">
74 <listitem>
75 <para>
76 <emphasis>Parsing documents into <firstterm>tokens</></emphasis>. It is
77 useful to identify various classes of tokens, e.g. numbers, words,
78 complex words, email addresses, so that they can be processed
79 differently. In principle token classes depend on the specific
80 application, but for most purposes it is adequate to use a predefined
81 set of classes.
82 <productname>PostgreSQL</productname> uses a <firstterm>parser</> to
83 perform this step. A standard parser is provided, and custom parsers
84 can be created for specific needs.
85 </para>
86 </listitem>
88 <listitem>
89 <para>
90 <emphasis>Converting tokens into <firstterm>lexemes</></emphasis>.
91 A lexeme is a string, just like a token, but it has been
92 <firstterm>normalized</> so that different forms of the same word
93 are made alike. For example, normalization almost always includes
94 folding upper-case letters to lower-case, and often involves removal
95 of suffixes (such as <literal>s</> or <literal>es</> in English).
96 This allows searches to find variant forms of the
97 same word, without tediously entering all the possible variants.
98 Also, this step typically eliminates <firstterm>stop words</>, which
99 are words that are so common that they are useless for searching.
100 (In short, then, tokens are raw fragments of the document text, while
101 lexemes are words that are believed useful for indexing and searching.)
102 <productname>PostgreSQL</productname> uses <firstterm>dictionaries</> to
103 perform this step. Various standard dictionaries are provided, and
104 custom ones can be created for specific needs.
105 </para>
106 </listitem>
108 <listitem>
109 <para>
110 <emphasis>Storing preprocessed documents optimized for
111 searching</emphasis>. For example, each document can be represented
112 as a sorted array of normalized lexemes. Along with the lexemes it is
113 often desirable to store positional information to use for
114 <firstterm>proximity ranking</firstterm>, so that a document that
115 contains a more <quote>dense</> region of query words is
116 assigned a higher rank than one with scattered query words.
117 </para>
118 </listitem>
119 </itemizedlist>
121 <para>
122 Dictionaries allow fine-grained control over how tokens are normalized.
123 With appropriate dictionaries, you can:
124 </para>
126 <itemizedlist spacing="compact" mark="bullet">
127 <listitem>
128 <para>
129 Define stop words that should not be indexed.
130 </para>
131 </listitem>
133 <listitem>
134 <para>
135 Map synonyms to a single word using <application>Ispell</>.
136 </para>
137 </listitem>
139 <listitem>
140 <para>
141 Map phrases to a single word using a thesaurus.
142 </para>
143 </listitem>
145 <listitem>
146 <para>
147 Map different variations of a word to a canonical form using
148 an <application>Ispell</> dictionary.
149 </para>
150 </listitem>
152 <listitem>
153 <para>
154 Map different variations of a word to a canonical form using
155 <application>Snowball</> stemmer rules.
156 </para>
157 </listitem>
158 </itemizedlist>
160 <para>
161 A data type <type>tsvector</type> is provided for storing preprocessed
162 documents, along with a type <type>tsquery</type> for representing processed
163 queries (<xref linkend="datatype-textsearch">). There are many
164 functions and operators available for these data types
165 (<xref linkend="functions-textsearch">), the most important of which is
166 the match operator <literal>@@</literal>, which we introduce in
167 <xref linkend="textsearch-matching">. Full text searches can be accelerated
168 using indexes (<xref linkend="textsearch-indexes">).
169 </para>
172 <sect2 id="textsearch-document">
173 <title>What Is a Document?</title>
175 <indexterm zone="textsearch-document">
176 <primary>document</primary>
177 <secondary>text search</secondary>
178 </indexterm>
180 <para>
181 A <firstterm>document</> is the unit of searching in a full text search
182 system; for example, a magazine article or email message. The text search
183 engine must be able to parse documents and store associations of lexemes
184 (key words) with their parent document. Later, these associations are
185 used to search for documents that contain query words.
186 </para>
188 <para>
189 For searches within <productname>PostgreSQL</productname>,
190 a document is normally a textual field within a row of a database table,
191 or possibly a combination (concatenation) of such fields, perhaps stored
192 in several tables or obtained dynamically. In other words, a document can
193 be constructed from different parts for indexing and it might not be
194 stored anywhere as a whole. For example:
196 <programlisting>
197 SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document
198 FROM messages
199 WHERE mid = 12;
201 SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
202 FROM messages m, docs d
203 WHERE mid = did AND mid = 12;
204 </programlisting>
205 </para>
207 <note>
208 <para>
209 Actually, in these example queries, <function>coalesce</function>
210 should be used to prevent a single <literal>NULL</literal> attribute from
211 causing a <literal>NULL</literal> result for the whole document.
212 </para>
213 </note>
215 <para>
216 Another possibility is to store the documents as simple text files in the
217 file system. In this case, the database can be used to store the full text
218 index and to execute searches, and some unique identifier can be used to
219 retrieve the document from the file system. However, retrieving files
220 from outside the database requires superuser permissions or special
221 function support, so this is usually less convenient than keeping all
222 the data inside <productname>PostgreSQL</productname>. Also, keeping
223 everything inside the database allows easy access
224 to document metadata to assist in indexing and display.
225 </para>
227 <para>
228 For text search purposes, each document must be reduced to the
229 preprocessed <type>tsvector</> format. Searching and ranking
230 are performed entirely on the <type>tsvector</> representation
231 of a document &mdash; the original text need only be retrieved
232 when the document has been selected for display to a user.
233 We therefore often speak of the <type>tsvector</> as being the
234 document, but of course it is only a compact representation of
235 the full document.
236 </para>
237 </sect2>
239 <sect2 id="textsearch-matching">
240 <title>Basic Text Matching</title>
242 <para>
243 Full text searching in <productname>PostgreSQL</productname> is based on
244 the match operator <literal>@@</literal>, which returns
245 <literal>true</literal> if a <type>tsvector</type>
246 (document) matches a <type>tsquery</type> (query).
247 It doesn't matter which data type is written first:
249 <programlisting>
250 SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat &amp; rat'::tsquery;
251 ?column?
252 ----------
255 SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
256 ?column?
257 ----------
259 </programlisting>
260 </para>
262 <para>
263 As the above example suggests, a <type>tsquery</type> is not just raw
264 text, any more than a <type>tsvector</type> is. A <type>tsquery</type>
265 contains search terms, which must be already-normalized lexemes, and
266 may combine multiple terms using AND, OR, and NOT operators.
267 (For details see <xref linkend="datatype-textsearch">.) There are
268 functions <function>to_tsquery</> and <function>plainto_tsquery</>
269 that are helpful in converting user-written text into a proper
270 <type>tsquery</type>, for example by normalizing words appearing in
271 the text. Similarly, <function>to_tsvector</> is used to parse and
272 normalize a document string. So in practice a text search match would
273 look more like this:
275 <programlisting>
276 SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat &amp; rat');
277 ?column?
278 ----------
280 </programlisting>
282 Observe that this match would not succeed if written as
284 <programlisting>
285 SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat &amp; rat');
286 ?column?
287 ----------
289 </programlisting>
291 since here no normalization of the word <literal>rats</> will occur.
292 The elements of a <type>tsvector</> are lexemes, which are assumed
293 already normalized, so <literal>rats</> does not match <literal>rat</>.
294 </para>
296 <para>
297 The <literal>@@</literal> operator also
298 supports <type>text</type> input, allowing explicit conversion of a text
299 string to <type>tsvector</type> or <type>tsquery</> to be skipped
300 in simple cases. The variants available are:
302 <programlisting>
303 tsvector @@ tsquery
304 tsquery @@ tsvector
305 text @@ tsquery
306 text @@ text
307 </programlisting>
308 </para>
310 <para>
311 The first two of these we saw already.
312 The form <type>text</type> <literal>@@</literal> <type>tsquery</type>
313 is equivalent to <literal>to_tsvector(x) @@ y</literal>.
314 The form <type>text</type> <literal>@@</literal> <type>text</type>
315 is equivalent to <literal>to_tsvector(x) @@ plainto_tsquery(y)</literal>.
316 </para>
317 </sect2>
319 <sect2 id="textsearch-intro-configurations">
320 <title>Configurations</title>
322 <para>
323 The above are all simple text search examples. As mentioned before, full
324 text search functionality includes the ability to do many more things:
325 skip indexing certain words (stop words), process synonyms, and use
326 sophisticated parsing, e.g. parse based on more than just white space.
327 This functionality is controlled by <firstterm>text search
328 configurations</>. <productname>PostgreSQL</> comes with predefined
329 configurations for many languages, and you can easily create your own
330 configurations. (<application>psql</>'s <command>\dF</> command
331 shows all available configurations.)
332 </para>
334 <para>
335 During installation an appropriate configuration is selected and
336 <xref linkend="guc-default-text-search-config"> is set accordingly
337 in <filename>postgresql.conf</>. If you are using the same text search
338 configuration for the entire cluster you can use the value in
339 <filename>postgresql.conf</>. To use different configurations
340 throughout the cluster but the same configuration within any one database,
341 use <command>ALTER DATABASE ... SET</>. Otherwise, you can set
342 <varname>default_text_search_config</varname> in each session.
343 </para>
345 <para>
346 Each text search function that depends on a configuration has an optional
347 <type>regconfig</> argument, so that the configuration to use can be
348 specified explicitly. <varname>default_text_search_config</varname>
349 is used only when this argument is omitted.
350 </para>
352 <para>
353 To make it easier to build custom text search configurations, a
354 configuration is built up from simpler database objects.
355 <productname>PostgreSQL</>'s text search facility provides
356 four types of configuration-related database objects:
357 </para>
359 <itemizedlist spacing="compact" mark="bullet">
360 <listitem>
361 <para>
362 <firstterm>Text search parsers</> break documents into tokens
363 and classify each token (for example, as words or numbers).
364 </para>
365 </listitem>
367 <listitem>
368 <para>
369 <firstterm>Text search dictionaries</> convert tokens to normalized
370 form and reject stop words.
371 </para>
372 </listitem>
374 <listitem>
375 <para>
376 <firstterm>Text search templates</> provide the functions underlying
377 dictionaries. (A dictionary simply specifies a template and a set
378 of parameters for the template.)
379 </para>
380 </listitem>
382 <listitem>
383 <para>
384 <firstterm>Text search configurations</> select a parser and a set
385 of dictionaries to use to normalize the tokens produced by the parser.
386 </para>
387 </listitem>
388 </itemizedlist>
390 <para>
391 Text search parsers and templates are built from low-level C functions;
392 therefore it requires C programming ability to develop new ones, and
393 superuser privileges to install one into a database. (There are examples
394 of add-on parsers and templates in the <filename>contrib/</> area of the
395 <productname>PostgreSQL</> distribution.) Since dictionaries and
396 configurations just parameterize and connect together some underlying
397 parsers and templates, no special privilege is needed to create a new
398 dictionary or configuration. Examples of creating custom dictionaries and
399 configurations appear later in this chapter.
400 </para>
402 </sect2>
404 </sect1>
406 <sect1 id="textsearch-tables">
407 <title>Tables and Indexes</title>
409 <para>
410 The examples in the previous section illustrated full text matching using
411 simple constant strings. This section shows how to search table data,
412 optionally using indexes.
413 </para>
415 <sect2 id="textsearch-tables-search">
416 <title>Searching a Table</title>
418 <para>
419 It is possible to do full text search with no index. A simple query
420 to print the <structname>title</> of each row that contains the word
421 <literal>friend</> in its <structfield>body</> field is:
423 <programlisting>
424 SELECT title
425 FROM pgweb
426 WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');
427 </programlisting>
429 This will also find related words such as <literal>friends</>
430 and <literal>friendly</>, since all these are reduced to the same
431 normalized lexeme.
432 </para>
434 <para>
435 The query above specifies that the <literal>english</> configuration
436 is to be used to parse and normalize the strings. Alternatively we
437 could omit the configuration parameters:
439 <programlisting>
440 SELECT title
441 FROM pgweb
442 WHERE to_tsvector(body) @@ to_tsquery('friend');
443 </programlisting>
445 This query will use the configuration set by <xref
446 linkend="guc-default-text-search-config">.
447 </para>
449 <para>
450 A more complex example is to
451 select the ten most recent documents that contain <literal>create</> and
452 <literal>table</> in the <structname>title</> or <structname>body</>:
454 <programlisting>
455 SELECT title
456 FROM pgweb
457 WHERE to_tsvector(title || body) @@ to_tsquery('create &amp; table')
458 ORDER BY last_mod_date DESC LIMIT 10;
459 </programlisting>
461 For clarity we omitted the <function>coalesce</function> function
462 which would be needed to search rows that contain <literal>NULL</literal>
463 in one of the two fields.
464 </para>
466 <para>
467 Although these queries will work without an index, most applications
468 will find this approach too slow, except perhaps for occasional ad-hoc
469 searches. Practical use of text searching usually requires creating
470 an index.
471 </para>
473 </sect2>
475 <sect2 id="textsearch-tables-index">
476 <title>Creating Indexes</title>
478 <para>
479 We can create a <acronym>GIN</acronym> index (<xref
480 linkend="textsearch-indexes">) to speed up text searches:
482 <programlisting>
483 CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));
484 </programlisting>
486 Notice that the 2-argument version of <function>to_tsvector</function> is
487 used. Only text search functions that specify a configuration name can
488 be used in expression indexes (<xref linkend="indexes-expressional">).
489 This is because the index contents must be unaffected by <xref
490 linkend="guc-default-text-search-config">. If they were affected, the
491 index contents might be inconsistent because different entries could
492 contain <type>tsvector</>s that were created with different text search
493 configurations, and there would be no way to guess which was which. It
494 would be impossible to dump and restore such an index correctly.
495 </para>
497 <para>
498 Because the two-argument version of <function>to_tsvector</function> was
499 used in the index above, only a query reference that uses the 2-argument
500 version of <function>to_tsvector</function> with the same configuration
501 name will use that index. That is, <literal>WHERE
502 to_tsvector('english', body) @@ 'a &amp; b'</> can use the index,
503 but <literal>WHERE to_tsvector(body) @@ 'a &amp; b'</> cannot.
504 This ensures that an index will be used only with the same configuration
505 used to create the index entries.
506 </para>
508 <para>
509 It is possible to set up more complex expression indexes wherein the
510 configuration name is specified by another column, e.g.:
512 <programlisting>
513 CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body));
514 </programlisting>
516 where <literal>config_name</> is a column in the <literal>pgweb</>
517 table. This allows mixed configurations in the same index while
518 recording which configuration was used for each index entry. This
519 would be useful, for example, if the document collection contained
520 documents in different languages. Again,
521 queries that are to use the index must be phrased to match, e.g.
522 <literal>WHERE to_tsvector(config_name, body) @@ 'a &amp; b'</>.
523 </para>
525 <para>
526 Indexes can even concatenate columns:
528 <programlisting>
529 CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', title || body));
530 </programlisting>
531 </para>
533 <para>
534 Another approach is to create a separate <type>tsvector</> column
535 to hold the output of <function>to_tsvector</>. This example is a
536 concatenation of <literal>title</literal> and <literal>body</literal>,
537 using <function>coalesce</> to ensure that one field will still be
538 indexed when the other is <literal>NULL</>:
540 <programlisting>
541 ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector;
542 UPDATE pgweb SET textsearchable_index_col =
543 to_tsvector('english', coalesce(title,'') || coalesce(body,''));
544 </programlisting>
546 Then we create a <acronym>GIN</acronym> index to speed up the search:
548 <programlisting>
549 CREATE INDEX textsearch_idx ON pgweb USING gin(textsearchable_index_col);
550 </programlisting>
552 Now we are ready to perform a fast full text search:
554 <programlisting>
555 SELECT title
556 FROM pgweb
557 WHERE textsearchable_index_col @@ to_tsquery('create &amp; table')
558 ORDER BY last_mod_date DESC LIMIT 10;
559 </programlisting>
560 </para>
562 <para>
563 When using a separate column to store the <type>tsvector</>
564 representation,
565 it is necessary to create a trigger to keep the <type>tsvector</>
566 column current anytime <literal>title</> or <literal>body</> changes.
567 <xref linkend="textsearch-update-triggers"> explains how to do that.
568 </para>
570 <para>
571 One advantage of the separate-column approach over an expression index
572 is that it is not necessary to explicitly specify the text search
573 configuration in queries in order to make use of the index. As shown
574 in the example above, the query can depend on
575 <varname>default_text_search_config</>. Another advantage is that
576 searches will be faster, since it will not be necessary to redo the
577 <function>to_tsvector</> calls to verify index matches. (This is more
578 important when using a GiST index than a GIN index; see <xref
579 linkend="textsearch-indexes">.) The expression-index approach is
580 simpler to set up, however, and it requires less disk space since the
581 <type>tsvector</> representation is not stored explicitly.
582 </para>
584 </sect2>
586 </sect1>
588 <sect1 id="textsearch-controls">
589 <title>Controlling Text Search</title>
591 <para>
592 To implement full text searching there must be a function to create a
593 <type>tsvector</type> from a document and a <type>tsquery</type> from a
594 user query. Also, we need to return results in a useful order, so we need
595 a function that compares documents with respect to their relevance to
596 the query. It's also important to be able to display the results nicely.
597 <productname>PostgreSQL</productname> provides support for all of these
598 functions.
599 </para>
601 <sect2 id="textsearch-parsing-documents">
602 <title>Parsing Documents</title>
604 <para>
605 <productname>PostgreSQL</productname> provides the
606 function <function>to_tsvector</function> for converting a document to
607 the <type>tsvector</type> data type.
608 </para>
610 <indexterm>
611 <primary>to_tsvector</primary>
612 </indexterm>
614 <synopsis>
615 to_tsvector(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>) returns <type>tsvector</>
616 </synopsis>
618 <para>
619 <function>to_tsvector</function> parses a textual document into tokens,
620 reduces the tokens to lexemes, and returns a <type>tsvector</type> which
621 lists the lexemes together with their positions in the document.
622 The document is processed according to the specified or default
623 text search configuration.
624 Here is a simple example:
626 <programlisting>
627 SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats');
628 to_tsvector
629 -----------------------------------------------------
630 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
631 </programlisting>
632 </para>
634 <para>
635 In the example above we see that the resulting <type>tsvector</type> does not
636 contain the words <literal>a</literal>, <literal>on</literal>, or
637 <literal>it</literal>, the word <literal>rats</literal> became
638 <literal>rat</literal>, and the punctuation sign <literal>-</literal> was
639 ignored.
640 </para>
642 <para>
643 The <function>to_tsvector</function> function internally calls a parser
644 which breaks the document text into tokens and assigns a type to
645 each token. For each token, a list of
646 dictionaries (<xref linkend="textsearch-dictionaries">) is consulted,
647 where the list can vary depending on the token type. The first dictionary
648 that <firstterm>recognizes</> the token emits one or more normalized
649 <firstterm>lexemes</firstterm> to represent the token. For example,
650 <literal>rats</literal> became <literal>rat</literal> because one of the
651 dictionaries recognized that the word <literal>rats</literal> is a plural
652 form of <literal>rat</literal>. Some words are recognized as
653 <firstterm>stop words</> (<xref linkend="textsearch-stopwords">), which
654 causes them to be ignored since they occur too frequently to be useful in
655 searching. In our example these are
656 <literal>a</literal>, <literal>on</literal>, and <literal>it</literal>.
657 If no dictionary in the list recognizes the token then it is also ignored.
658 In this example that happened to the punctuation sign <literal>-</literal>
659 because there are in fact no dictionaries assigned for its token type
660 (<literal>Space symbols</literal>), meaning space tokens will never be
661 indexed. The choices of parser, dictionaries and which types of tokens to
662 index are determined by the selected text search configuration (<xref
663 linkend="textsearch-configuration">). It is possible to have
664 many different configurations in the same database, and predefined
665 configurations are available for various languages. In our example
666 we used the default configuration <literal>english</literal> for the
667 English language.
668 </para>
670 <para>
671 The function <function>setweight</function> can be used to label the
672 entries of a <type>tsvector</type> with a given <firstterm>weight</>,
673 where a weight is one of the letters <literal>A</>, <literal>B</>,
674 <literal>C</>, or <literal>D</>.
675 This is typically used to mark entries coming from
676 different parts of a document, such as title versus body. Later, this
677 information can be used for ranking of search results.
678 </para>
680 <para>
681 Because <function>to_tsvector</function>(<literal>NULL</literal>) will
682 return <literal>NULL</literal>, it is recommended to use
683 <function>coalesce</function> whenever a field might be null.
684 Here is the recommended method for creating
685 a <type>tsvector</type> from a structured document:
687 <programlisting>
688 UPDATE tt SET ti =
689 setweight(to_tsvector(coalesce(title,'')), 'A') ||
690 setweight(to_tsvector(coalesce(keyword,'')), 'B') ||
691 setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
692 setweight(to_tsvector(coalesce(body,'')), 'D');
693 </programlisting>
695 Here we have used <function>setweight</function> to label the source
696 of each lexeme in the finished <type>tsvector</type>, and then merged
697 the labeled <type>tsvector</type> values using the <type>tsvector</>
698 concatenation operator <literal>||</>. (<xref
699 linkend="textsearch-manipulate-tsvector"> gives details about these
700 operations.)
701 </para>
703 </sect2>
705 <sect2 id="textsearch-parsing-queries">
706 <title>Parsing Queries</title>
708 <para>
709 <productname>PostgreSQL</productname> provides the
710 functions <function>to_tsquery</function> and
711 <function>plainto_tsquery</function> for converting a query to
712 the <type>tsquery</type> data type. <function>to_tsquery</function>
713 offers access to more features than <function>plainto_tsquery</function>,
714 but is less forgiving about its input.
715 </para>
717 <indexterm>
718 <primary>to_tsquery</primary>
719 </indexterm>
721 <synopsis>
722 to_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
723 </synopsis>
725 <para>
726 <function>to_tsquery</function> creates a <type>tsquery</> value from
727 <replaceable>querytext</replaceable>, which must consist of single tokens
728 separated by the Boolean operators <literal>&amp;</literal> (AND),
729 <literal>|</literal> (OR) and <literal>!</literal> (NOT). These operators
730 can be grouped using parentheses. In other words, the input to
731 <function>to_tsquery</function> must already follow the general rules for
732 <type>tsquery</> input, as described in <xref
733 linkend="datatype-textsearch">. The difference is that while basic
734 <type>tsquery</> input takes the tokens at face value,
735 <function>to_tsquery</function> normalizes each token to a lexeme using
736 the specified or default configuration, and discards any tokens that are
737 stop words according to the configuration. For example:
739 <programlisting>
740 SELECT to_tsquery('english', 'The &amp; Fat &amp; Rats');
741 to_tsquery
742 ---------------
743 'fat' &amp; 'rat'
744 </programlisting>
746 As in basic <type>tsquery</> input, weight(s) can be attached to each
747 lexeme to restrict it to match only <type>tsvector</> lexemes of those
748 weight(s). For example:
750 <programlisting>
751 SELECT to_tsquery('english', 'Fat | Rats:AB');
752 to_tsquery
753 ------------------
754 'fat' | 'rat':AB
755 </programlisting>
757 Also, <literal>*</> can be attached to a lexeme to specify prefix matching:
759 <programlisting>
760 SELECT to_tsquery('supern:*A &amp; star:A*B');
761 to_tsquery
762 --------------------------
763 'supern':*A &amp; 'star':*AB
764 </programlisting>
766 Such a lexeme will match any word in a <type>tsvector</> that begins
767 with the given string.
768 </para>
770 <para>
771 <function>to_tsquery</function> can also accept single-quoted
772 phrases. This is primarily useful when the configuration includes a
773 thesaurus dictionary that may trigger on such phrases.
774 In the example below, a thesaurus contains the rule <literal>supernovae
775 stars : sn</literal>:
777 <programlisting>
778 SELECT to_tsquery('''supernovae stars'' &amp; !crab');
779 to_tsquery
780 ---------------
781 'sn' &amp; !'crab'
782 </programlisting>
784 Without quotes, <function>to_tsquery</function> will generate a syntax
785 error for tokens that are not separated by an AND or OR operator.
786 </para>
788 <indexterm>
789 <primary>plainto_tsquery</primary>
790 </indexterm>
792 <synopsis>
793 plainto_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
794 </synopsis>
796 <para>
797 <function>plainto_tsquery</> transforms unformatted text
798 <replaceable>querytext</replaceable> to <type>tsquery</type>.
799 The text is parsed and normalized much as for <function>to_tsvector</>,
800 then the <literal>&amp;</literal> (AND) Boolean operator is inserted
801 between surviving words.
802 </para>
804 <para>
805 Example:
807 <programlisting>
808 SELECT plainto_tsquery('english', 'The Fat Rats');
809 plainto_tsquery
810 -----------------
811 'fat' &amp; 'rat'
812 </programlisting>
814 Note that <function>plainto_tsquery</> cannot
815 recognize Boolean operators, weight labels, or prefix-match labels
816 in its input:
818 <programlisting>
819 SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');
820 plainto_tsquery
821 ---------------------
822 'fat' &amp; 'rat' &amp; 'c'
823 </programlisting>
825 Here, all the input punctuation was discarded as being space symbols.
826 </para>
828 </sect2>
830 <sect2 id="textsearch-ranking">
831 <title>Ranking Search Results</title>
833 <para>
834 Ranking attempts to measure how relevant documents are to a particular
835 query, so that when there are many matches the most relevant ones can be
836 shown first. <productname>PostgreSQL</productname> provides two
837 predefined ranking functions, which take into account lexical, proximity,
838 and structural information; that is, they consider how often the query
839 terms appear in the document, how close together the terms are in the
840 document, and how important is the part of the document where they occur.
841 However, the concept of relevancy is vague and very application-specific.
842 Different applications might require additional information for ranking,
843 e.g. document modification time. The built-in ranking functions are only
844 examples. You can write your own ranking functions and/or combine their
845 results with additional factors to fit your specific needs.
846 </para>
848 <para>
849 The two ranking functions currently available are:
851 <variablelist>
853 <varlistentry>
855 <indexterm>
856 <primary>ts_rank</primary>
857 </indexterm>
859 <term>
860 <synopsis>
861 ts_rank(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>, <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>
862 </synopsis>
863 </term>
865 <listitem>
866 <para>
867 Standard ranking function.<!-- TODO document this better -->
868 </para>
869 </listitem>
870 </varlistentry>
872 <varlistentry>
874 <indexterm>
875 <primary>ts_rank_cd</primary>
876 </indexterm>
878 <term>
879 <synopsis>
880 ts_rank_cd(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>, <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>
881 </synopsis>
882 </term>
884 <listitem>
885 <para>
886 This function computes the <firstterm>cover density</firstterm>
887 ranking for the given document vector and query, as described in
888 Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three
889 Term Queries" in the journal "Information Processing and Management",
890 1999.
891 </para>
893 <para>
894 This function requires positional information in its input.
895 Therefore it will not work on <quote>stripped</> <type>tsvector</>
896 values &mdash; it will always return zero.
897 </para>
898 </listitem>
899 </varlistentry>
901 </variablelist>
903 </para>
905 <para>
906 For both these functions,
907 the optional <replaceable class="PARAMETER">weights</replaceable>
908 argument offers the ability to weigh word instances more or less
909 heavily depending on how they are labeled. The weight arrays specify
910 how heavily to weigh each category of word, in the order:
912 <programlisting>
913 {D-weight, C-weight, B-weight, A-weight}
914 </programlisting>
916 If no <replaceable class="PARAMETER">weights</replaceable> are provided,
917 then these defaults are used:
919 <programlisting>
920 {0.1, 0.2, 0.4, 1.0}
921 </programlisting>
923 Typically weights are used to mark words from special areas of the
924 document, like the title or an initial abstract, so that they can be
925 treated as more or less important than words in the document body.
926 </para>
928 <para>
929 Since a longer document has a greater chance of containing a query term
930 it is reasonable to take into account document size, e.g. a hundred-word
931 document with five instances of a search word is probably more relevant
932 than a thousand-word document with five instances. Both ranking functions
933 take an integer <replaceable>normalization</replaceable> option that
934 specifies whether and how a document's length should impact its rank.
935 The integer option controls several behaviors, so it is a bit mask:
936 you can specify one or more behaviors using
937 <literal>|</literal> (for example, <literal>2|4</literal>).
939 <itemizedlist spacing="compact" mark="bullet">
940 <listitem>
941 <para>
942 0 (the default) ignores the document length
943 </para>
944 </listitem>
945 <listitem>
946 <para>
947 1 divides the rank by 1 + the logarithm of the document length
948 </para>
949 </listitem>
950 <listitem>
951 <para>
952 2 divides the rank by the document length
953 </para>
954 </listitem>
955 <listitem>
956 <para>
957 4 divides the rank by the mean harmonic distance between extents
958 (this is implemented only by <function>ts_rank_cd</>)
959 </para>
960 </listitem>
961 <listitem>
962 <para>
963 8 divides the rank by the number of unique words in document
964 </para>
965 </listitem>
966 <listitem>
967 <para>
968 16 divides the rank by 1 + the logarithm of the number
969 of unique words in document
970 </para>
971 </listitem>
972 <listitem>
973 <para>
974 32 divides the rank by itself + 1
975 </para>
976 </listitem>
977 </itemizedlist>
979 If more than one flag bit is specified, the transformations are
980 applied in the order listed.
981 </para>
983 <para>
984 It is important to note that the ranking functions do not use any global
985 information, so it is impossible to produce a fair normalization to 1% or
986 100% as sometimes desired. Normalization option 32
987 (<literal>rank/(rank+1)</literal>) can be applied to scale all ranks
988 into the range zero to one, but of course this is just a cosmetic change;
989 it will not affect the ordering of the search results.
990 </para>
992 <para>
993 Here is an example that selects only the ten highest-ranked matches:
995 <programlisting>
996 SELECT title, ts_rank_cd(textsearch, query) AS rank
997 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
998 WHERE query @@ textsearch
999 ORDER BY rank DESC LIMIT 10;
1000 title | rank
1001 -----------------------------------------------+----------
1002 Neutrinos in the Sun | 3.1
1003 The Sudbury Neutrino Detector | 2.4
1004 A MACHO View of Galactic Dark Matter | 2.01317
1005 Hot Gas and Dark Matter | 1.91171
1006 The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953
1007 Rafting for Solar Neutrinos | 1.9
1008 NGC 4650A: Strange Galaxy and Dark Matter | 1.85774
1009 Hot Gas and Dark Matter | 1.6123
1010 Ice Fishing for Cosmic Neutrinos | 1.6
1011 Weak Lensing Distorts the Universe | 0.818218
1012 </programlisting>
1014 This is the same example using normalized ranking:
1016 <programlisting>
1017 SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
1018 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
1019 WHERE query @@ textsearch
1020 ORDER BY rank DESC LIMIT 10;
1021 title | rank
1022 -----------------------------------------------+-------------------
1023 Neutrinos in the Sun | 0.756097569485493
1024 The Sudbury Neutrino Detector | 0.705882361190954
1025 A MACHO View of Galactic Dark Matter | 0.668123210574724
1026 Hot Gas and Dark Matter | 0.65655958650282
1027 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
1028 Rafting for Solar Neutrinos | 0.655172410958162
1029 NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637
1030 Hot Gas and Dark Matter | 0.617195790024749
1031 Ice Fishing for Cosmic Neutrinos | 0.615384618911517
1032 Weak Lensing Distorts the Universe | 0.450010798361481
1033 </programlisting>
1034 </para>
1036 <para>
1037 Ranking can be expensive since it requires consulting the
1038 <type>tsvector</type> of each matching document, which can be I/O bound and
1039 therefore slow. Unfortunately, it is almost impossible to avoid since
1040 practical queries often result in large numbers of matches.
1041 </para>
1043 </sect2>
1045 <sect2 id="textsearch-headline">
1046 <title>Highlighting Results</title>
1048 <para>
1049 To present search results it is ideal to show a part of each document and
1050 how it is related to the query. Usually, search engines show fragments of
1051 the document with marked search terms. <productname>PostgreSQL</>
1052 provides a function <function>ts_headline</function> that
1053 implements this functionality.
1054 </para>
1056 <indexterm>
1057 <primary>ts_headline</primary>
1058 </indexterm>
1060 <synopsis>
1061 ts_headline(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>, <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">options</replaceable> <type>text</> </optional>) returns <type>text</>
1062 </synopsis>
1064 <para>
1065 <function>ts_headline</function> accepts a document along
1066 with a query, and returns an excerpt from
1067 the document in which terms from the query are highlighted. The
1068 configuration to be used to parse the document can be specified by
1069 <replaceable>config</replaceable>; if <replaceable>config</replaceable>
1070 is omitted, the
1071 <varname>default_text_search_config</varname> configuration is used.
1072 </para>
1074 <para>
1075 If an <replaceable>options</replaceable> string is specified it must
1076 consist of a comma-separated list of one or more
1077 <replaceable>option</><literal>=</><replaceable>value</> pairs.
1078 The available options are:
1080 <itemizedlist spacing="compact" mark="bullet">
1081 <listitem>
1082 <para>
1083 <literal>StartSel</>, <literal>StopSel</literal>: the strings with which
1084 query words appearing in the document should be delimited to distinguish
1085 them from other excerpted words.
1086 </para>
1087 </listitem>
1088 <listitem >
1089 <para>
1090 <literal>MaxWords</>, <literal>MinWords</literal>: these numbers
1091 determine the longest and shortest headlines to output.
1092 </para>
1093 </listitem>
1094 <listitem>
1095 <para>
1096 <literal>ShortWord</literal>: words of this length or less will be
1097 dropped at the start and end of a headline. The default
1098 value of three eliminates the English articles.
1099 </para>
1100 </listitem>
1101 <listitem>
1102 <para>
1103 <literal>HighlightAll</literal>: Boolean flag; if
1104 <literal>true</literal> the whole document will be highlighted.
1105 </para>
1106 </listitem>
1107 </itemizedlist>
1109 Any unspecified options receive these defaults:
1111 <programlisting>
1112 StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
1113 </programlisting>
1114 </para>
1116 <para>
1117 For example:
1119 <programlisting>
1120 SELECT ts_headline('english', 'The most common type of search
1121 is to find all documents containing given query terms
1122 and return them in order of their similarity to the
1123 query.', to_tsquery('query &amp; similarity'));
1124 ts_headline
1125 ------------------------------------------------------------
1126 given &lt;b&gt;query&lt;/b&gt; terms
1127 and return them in order of their &lt;b&gt;similarity&lt;/b&gt; to the
1128 &lt;b&gt;query&lt;/b&gt;.
1130 SELECT ts_headline('english', 'The most common type of search
1131 is to find all documents containing given query terms
1132 and return them in order of their similarity to the
1133 query.',
1134 to_tsquery('query &amp; similarity'),
1135 'StartSel = &lt;, StopSel = &gt;');
1136 ts_headline
1137 -------------------------------------------------------
1138 given &lt;query&gt; terms
1139 and return them in order of their &lt;similarity&gt; to the
1140 &lt;query&gt;.
1141 </programlisting>
1142 </para>
1144 <para>
1145 <function>ts_headline</> uses the original document, not a
1146 <type>tsvector</type> summary, so it can be slow and should be used with
1147 care. A typical mistake is to call <function>ts_headline</function> for
1148 <emphasis>every</emphasis> matching document when only ten documents are
1149 to be shown. <acronym>SQL</acronym> subqueries can help; here is an
1150 example:
1152 <programlisting>
1153 SELECT id, ts_headline(body, q), rank
1154 FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank
1155 FROM apod, to_tsquery('stars') q
1156 WHERE ti @@ q
1157 ORDER BY rank DESC LIMIT 10) AS foo;
1158 </programlisting>
1159 </para>
1161 </sect2>
1163 </sect1>
1165 <sect1 id="textsearch-features">
1166 <title>Additional Features</title>
1168 <para>
1169 This section describes additional functions and operators that are
1170 useful in connection with text search.
1171 </para>
1173 <sect2 id="textsearch-manipulate-tsvector">
1174 <title>Manipulating Documents</title>
1176 <para>
1177 <xref linkend="textsearch-parsing-documents"> showed how raw textual
1178 documents can be converted into <type>tsvector</> values.
1179 <productname>PostgreSQL</productname> also provides functions and
1180 operators that can be used to manipulate documents that are already
1181 in <type>tsvector</> form.
1182 </para>
1184 <variablelist>
1186 <varlistentry>
1188 <indexterm>
1189 <primary>tsvector concatenation</primary>
1190 </indexterm>
1192 <term>
1193 <synopsis>
1194 <type>tsvector</> || <type>tsvector</>
1195 </synopsis>
1196 </term>
1198 <listitem>
1199 <para>
1200 The <type>tsvector</> concatenation operator
1201 returns a vector which combines the lexemes and positional information
1202 of the two vectors given as arguments. Positions and weight labels
1203 are retained during the concatenation.
1204 Positions appearing in the right-hand vector are offset by the largest
1205 position mentioned in the left-hand vector, so that the result is
1206 nearly equivalent to the result of performing <function>to_tsvector</>
1207 on the concatenation of the two original document strings. (The
1208 equivalence is not exact, because any stop-words removed from the
1209 end of the left-hand argument will not affect the result, whereas
1210 they would have affected the positions of the lexemes in the
1211 right-hand argument if textual concatenation were used.)
1212 </para>
1214 <para>
1215 One advantage of using concatenation in the vector form, rather than
1216 concatenating text before applying <function>to_tsvector</>, is that
1217 you can use different configurations to parse different sections
1218 of the document. Also, because the <function>setweight</> function
1219 marks all lexemes of the given vector the same way, it is necessary
1220 to parse the text and do <function>setweight</> before concatenating
1221 if you want to label different parts of the document with different
1222 weights.
1223 </para>
1224 </listitem>
1225 </varlistentry>
1227 <varlistentry>
1229 <indexterm>
1230 <primary>setweight</primary>
1231 </indexterm>
1233 <term>
1234 <synopsis>
1235 setweight(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>, <replaceable class="PARAMETER">weight</replaceable> <type>"char"</>) returns <type>tsvector</>
1236 </synopsis>
1237 </term>
1239 <listitem>
1240 <para>
1241 This function returns a copy of the input vector in which every
1242 position has been labeled with the given <replaceable>weight</>, either
1243 <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or
1244 <literal>D</literal>. (<literal>D</literal> is the default for new
1245 vectors and as such is not displayed on output.) These labels are
1246 retained when vectors are concatenated, allowing words from different
1247 parts of a document to be weighted differently by ranking functions.
1248 </para>
1250 <para>
1251 Note that weight labels apply to <emphasis>positions</>, not
1252 <emphasis>lexemes</>. If the input vector has been stripped of
1253 positions then <function>setweight</> does nothing.
1254 </para>
1255 </listitem>
1256 </varlistentry>
1258 <varlistentry>
1259 <indexterm>
1260 <primary>length(tsvector)</primary>
1261 </indexterm>
1263 <term>
1264 <synopsis>
1265 length(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>) returns <type>integer</>
1266 </synopsis>
1267 </term>
1269 <listitem>
1270 <para>
1271 Returns the number of lexemes stored in the vector.
1272 </para>
1273 </listitem>
1274 </varlistentry>
1276 <varlistentry>
1278 <indexterm>
1279 <primary>strip</primary>
1280 </indexterm>
1282 <term>
1283 <synopsis>
1284 strip(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>) returns <type>tsvector</>
1285 </synopsis>
1286 </term>
1288 <listitem>
1289 <para>
1290 Returns a vector which lists the same lexemes as the given vector, but
1291 which lacks any position or weight information. While the returned
1292 vector is much less useful than an unstripped vector for relevance
1293 ranking, it will usually be much smaller.
1294 </para>
1295 </listitem>
1297 </varlistentry>
1299 </variablelist>
1301 </sect2>
1303 <sect2 id="textsearch-manipulate-tsquery">
1304 <title>Manipulating Queries</title>
1306 <para>
1307 <xref linkend="textsearch-parsing-queries"> showed how raw textual
1308 queries can be converted into <type>tsquery</> values.
1309 <productname>PostgreSQL</productname> also provides functions and
1310 operators that can be used to manipulate queries that are already
1311 in <type>tsquery</> form.
1312 </para>
1314 <variablelist>
1316 <varlistentry>
1318 <term>
1319 <synopsis>
1320 <type>tsquery</> &amp;&amp; <type>tsquery</>
1321 </synopsis>
1322 </term>
1324 <listitem>
1325 <para>
1326 Returns the AND-combination of the two given queries.
1327 </para>
1328 </listitem>
1330 </varlistentry>
1332 <varlistentry>
1334 <term>
1335 <synopsis>
1336 <type>tsquery</> || <type>tsquery</>
1337 </synopsis>
1338 </term>
1340 <listitem>
1341 <para>
1342 Returns the OR-combination of the two given queries.
1343 </para>
1344 </listitem>
1346 </varlistentry>
1348 <varlistentry>
1350 <term>
1351 <synopsis>
1352 !! <type>tsquery</>
1353 </synopsis>
1354 </term>
1356 <listitem>
1357 <para>
1358 Returns the negation (NOT) of the given query.
1359 </para>
1360 </listitem>
1362 </varlistentry>
1364 <varlistentry>
1366 <indexterm>
1367 <primary>numnode</primary>
1368 </indexterm>
1370 <term>
1371 <synopsis>
1372 numnode(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>) returns <type>integer</>
1373 </synopsis>
1374 </term>
1376 <listitem>
1377 <para>
1378 Returns the number of nodes (lexemes plus operators) in a
1379 <type>tsquery</>. This function is useful
1380 to determine if the <replaceable>query</replaceable> is meaningful
1381 (returns &gt; 0), or contains only stop words (returns 0).
1382 Examples:
1384 <programlisting>
1385 SELECT numnode(plainto_tsquery('the any'));
1386 NOTICE: query contains only stopword(s) or doesn't contain lexeme(s), ignored
1387 numnode
1388 ---------
1391 SELECT numnode('foo &amp; bar'::tsquery);
1392 numnode
1393 ---------
1395 </programlisting>
1396 </para>
1397 </listitem>
1398 </varlistentry>
1400 <varlistentry>
1402 <indexterm>
1403 <primary>querytree</primary>
1404 </indexterm>
1406 <term>
1407 <synopsis>
1408 querytree(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>) returns <type>text</>
1409 </synopsis>
1410 </term>
1412 <listitem>
1413 <para>
1414 Returns the portion of a <type>tsquery</> that can be used for
1415 searching an index. This function is useful for detecting
1416 unindexable queries, for example those containing only stop words
1417 or only negated terms. For example:
1419 <programlisting>
1420 SELECT querytree(to_tsquery('!defined'));
1421 querytree
1422 -----------
1424 </programlisting>
1425 </para>
1426 </listitem>
1427 </varlistentry>
1429 </variablelist>
1431 <sect3 id="textsearch-query-rewriting">
1432 <title>Query Rewriting</title>
1434 <indexterm zone="textsearch-query-rewriting">
1435 <primary>ts_rewrite</primary>
1436 </indexterm>
1438 <para>
1439 The <function>ts_rewrite</function> family of functions search a
1440 given <type>tsquery</> for occurrences of a target
1441 subquery, and replace each occurrence with another
1442 substitute subquery. In essence this operation is a
1443 <type>tsquery</>-specific version of substring replacement.
1444 A target and substitute combination can be
1445 thought of as a <firstterm>query rewrite rule</>. A collection
1446 of such rewrite rules can be a powerful search aid.
1447 For example, you can expand the search using synonyms
1448 (e.g., <literal>new york</>, <literal>big apple</>, <literal>nyc</>,
1449 <literal>gotham</>) or narrow the search to direct the user to some hot
1450 topic. There is some overlap in functionality between this feature
1451 and thesaurus dictionaries (<xref linkend="textsearch-thesaurus">).
1452 However, you can modify a set of rewrite rules on-the-fly without
1453 reindexing, whereas updating a thesaurus requires reindexing to be
1454 effective.
1455 </para>
1457 <variablelist>
1459 <varlistentry>
1461 <term>
1462 <synopsis>
1463 ts_rewrite (<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">target</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">substitute</replaceable> <type>tsquery</>) returns <type>tsquery</>
1464 </synopsis>
1465 </term>
1467 <listitem>
1468 <para>
1469 This form of <function>ts_rewrite</> simply applies a single
1470 rewrite rule: <replaceable class="PARAMETER">target</replaceable>
1471 is replaced by <replaceable class="PARAMETER">substitute</replaceable>
1472 wherever it appears in <replaceable
1473 class="PARAMETER">query</replaceable>. For example:
1475 <programlisting>
1476 SELECT ts_rewrite('a &amp; b'::tsquery, 'a'::tsquery, 'c'::tsquery);
1477 ts_rewrite
1478 ------------
1479 'b' &amp; 'c'
1480 </programlisting>
1481 </para>
1482 </listitem>
1483 </varlistentry>
1485 <varlistentry>
1487 <term>
1488 <synopsis>
1489 ts_rewrite (<replaceable class="PARAMETER">query</> <type>tsquery</>, <replaceable class="PARAMETER">select</> <type>text</>) returns <type>tsquery</>
1490 </synopsis>
1491 </term>
1493 <listitem>
1494 <para>
1495 This form of <function>ts_rewrite</> accepts a starting
1496 <replaceable>query</> and a SQL <replaceable>select</> command, which
1497 is given as a text string. The <replaceable>select</> must yield two
1498 columns of <type>tsquery</> type. For each row of the
1499 <replaceable>select</> result, occurrences of the first column value
1500 (the target) are replaced by the second column value (the substitute)
1501 within the current <replaceable>query</> value. For example:
1503 <programlisting>
1504 CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
1505 INSERT INTO aliases VALUES('a', 'c');
1507 SELECT ts_rewrite('a &amp; b'::tsquery, 'SELECT t,s FROM aliases');
1508 ts_rewrite
1509 ------------
1510 'b' &amp; 'c'
1511 </programlisting>
1512 </para>
1514 <para>
1515 Note that when multiple rewrite rules are applied in this way,
1516 the order of application can be important; so in practice you will
1517 want the source query to <literal>ORDER BY</> some ordering key.
1518 </para>
1519 </listitem>
1520 </varlistentry>
1522 </variablelist>
1524 <para>
1525 Let's consider a real-life astronomical example. We'll expand query
1526 <literal>supernovae</literal> using table-driven rewriting rules:
1528 <programlisting>
1529 CREATE TABLE aliases (t tsquery primary key, s tsquery);
1530 INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));
1532 SELECT ts_rewrite(to_tsquery('supernovae &amp; crab'), 'SELECT * FROM aliases');
1533 ts_rewrite
1534 ---------------------------------
1535 'crab' &amp; ( 'supernova' | 'sn' )
1536 </programlisting>
1538 We can change the rewriting rules just by updating the table:
1540 <programlisting>
1541 UPDATE aliases SET s = to_tsquery('supernovae|sn &amp; !nebulae') WHERE t = to_tsquery('supernovae');
1543 SELECT ts_rewrite(to_tsquery('supernovae &amp; crab'), 'SELECT * FROM aliases');
1544 ts_rewrite
1545 ---------------------------------------------
1546 'crab' &amp; ( 'supernova' | 'sn' &amp; !'nebula' )
1547 </programlisting>
1548 </para>
1550 <para>
1551 Rewriting can be slow when there are many rewriting rules, since it
1552 checks every rule for a possible hit. To filter out obvious non-candidate
1553 rules we can use the containment operators for the <type>tsquery</type>
1554 type. In the example below, we select only those rules which might match
1555 the original query:
1557 <programlisting>
1558 SELECT ts_rewrite('a &amp; b'::tsquery,
1559 'SELECT t,s FROM aliases WHERE ''a &amp; b''::tsquery @&gt; t');
1560 ts_rewrite
1561 ------------
1562 'b' &amp; 'c'
1563 </programlisting>
1564 </para>
1566 </sect3>
1568 </sect2>
1570 <sect2 id="textsearch-update-triggers">
1571 <title>Triggers for Automatic Updates</title>
1573 <indexterm>
1574 <primary>trigger</primary>
1575 <secondary>for updating a derived tsvector column</secondary>
1576 </indexterm>
1578 <para>
1579 When using a separate column to store the <type>tsvector</> representation
1580 of your documents, it is necessary to create a trigger to update the
1581 <type>tsvector</> column when the document content columns change.
1582 Two built-in trigger functions are available for this, or you can write
1583 your own.
1584 </para>
1586 <synopsis>
1587 tsvector_update_trigger(<replaceable class="PARAMETER">tsvector_column_name</replaceable>, <replaceable class="PARAMETER">config_name</replaceable>, <replaceable class="PARAMETER">text_column_name</replaceable> <optional>, ... </optional>)
1588 tsvector_update_trigger_column(<replaceable class="PARAMETER">tsvector_column_name</replaceable>, <replaceable class="PARAMETER">config_column_name</replaceable>, <replaceable class="PARAMETER">text_column_name</replaceable> <optional>, ... </optional>)
1589 </synopsis>
1591 <para>
1592 These trigger functions automatically compute a <type>tsvector</>
1593 column from one or more textual columns, under the control of
1594 parameters specified in the <command>CREATE TRIGGER</> command.
1595 An example of their use is:
1597 <programlisting>
1598 CREATE TABLE messages (
1599 title text,
1600 body text,
1601 tsv tsvector
1604 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
1605 ON messages FOR EACH ROW EXECUTE PROCEDURE
1606 tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);
1608 INSERT INTO messages VALUES('title here', 'the body text is here');
1610 SELECT * FROM messages;
1611 title | body | tsv
1612 ------------+-----------------------+----------------------------
1613 title here | the body text is here | 'bodi':4 'text':5 'titl':1
1615 SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title &amp; body');
1616 title | body
1617 ------------+-----------------------
1618 title here | the body text is here
1619 </programlisting>
1621 Having created this trigger, any change in <structfield>title</> or
1622 <structfield>body</> will automatically be reflected into
1623 <structfield>tsv</>, without the application having to worry about it.
1624 </para>
1626 <para>
1627 The first trigger argument must be the name of the <type>tsvector</>
1628 column to be updated. The second argument specifies the text search
1629 configuration to be used to perform the conversion. For
1630 <function>tsvector_update_trigger</>, the configuration name is simply
1631 given as the second trigger argument. It must be schema-qualified as
1632 shown above, so that the trigger behavior will not change with changes
1633 in <varname>search_path</>. For
1634 <function>tsvector_update_trigger_column</>, the second trigger argument
1635 is the name of another table column, which must be of type
1636 <type>regconfig</>. This allows a per-row selection of configuration
1637 to be made. The remaining argument(s) are the names of textual columns
1638 (of type <type>text</>, <type>varchar</>, or <type>char</>). These
1639 will be included in the document in the order given. NULL values will
1640 be skipped (but the other columns will still be indexed).
1641 </para>
1643 <para>
1644 A limitation of the built-in triggers is that they treat all the
1645 input columns alike. To process columns differently &mdash; for
1646 example, to weight title differently from body &mdash; it is necessary
1647 to write a custom trigger. Here is an example using
1648 <application>PL/pgSQL</application> as the trigger language:
1650 <programlisting>
1651 CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
1652 begin
1653 new.tsv :=
1654 setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') ||
1655 setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
1656 return new;
1658 $$ LANGUAGE plpgsql;
1660 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
1661 ON messages FOR EACH ROW EXECUTE PROCEDURE messages_trigger();
1662 </programlisting>
1663 </para>
1665 <para>
1666 Keep in mind that it is important to specify the configuration name
1667 explicitly when creating <type>tsvector</> values inside triggers,
1668 so that the column's contents will not be affected by changes to
1669 <varname>default_text_search_config</>. Failure to do this is likely to
1670 lead to problems such as search results changing after a dump and reload.
1671 </para>
1673 </sect2>
1675 <sect2 id="textsearch-statistics">
1676 <title>Gathering Document Statistics</title>
1678 <indexterm>
1679 <primary>ts_stat</primary>
1680 </indexterm>
1682 <para>
1683 The function <function>ts_stat</> is useful for checking your
1684 configuration and for finding stop-word candidates.
1685 </para>
1687 <synopsis>
1688 ts_stat(<replaceable class="PARAMETER">sqlquery</replaceable> <type>text</>, <optional> <replaceable class="PARAMETER">weights</replaceable> <type>text</>, </optional> OUT <replaceable class="PARAMETER">word</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">ndoc</replaceable> <type>integer</>, OUT <replaceable class="PARAMETER">nentry</replaceable> <type>integer</>) returns <type>setof record</>
1689 </synopsis>
1691 <para>
1692 <replaceable>sqlquery</replaceable> is a text value containing a SQL
1693 query which must return a single <type>tsvector</type> column.
1694 <function>ts_stat</> executes the query and returns statistics about
1695 each distinct lexeme (word) contained in the <type>tsvector</type>
1696 data. The columns returned are
1698 <itemizedlist spacing="compact" mark="bullet">
1699 <listitem>
1700 <para>
1701 <replaceable>word</> <type>text</> &mdash; the value of a lexeme
1702 </para>
1703 </listitem>
1704 <listitem>
1705 <para>
1706 <replaceable>ndoc</> <type>integer</> &mdash; number of documents
1707 (<type>tsvector</>s) the word occurred in
1708 </para>
1709 </listitem>
1710 <listitem>
1711 <para>
1712 <replaceable>nentry</> <type>integer</> &mdash; total number of
1713 occurrences of the word
1714 </para>
1715 </listitem>
1716 </itemizedlist>
1718 If <replaceable>weights</replaceable> is supplied, only occurrences
1719 having one of those weights are counted.
1720 </para>
1722 <para>
1723 For example, to find the ten most frequent words in a document collection:
1725 <programlisting>
1726 SELECT * FROM ts_stat('SELECT vector FROM apod')
1727 ORDER BY nentry DESC, ndoc DESC, word
1728 LIMIT 10;
1729 </programlisting>
1731 The same, but counting only word occurrences with weight <literal>A</>
1732 or <literal>B</>:
1734 <programlisting>
1735 SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
1736 ORDER BY nentry DESC, ndoc DESC, word
1737 LIMIT 10;
1738 </programlisting>
1739 </para>
1741 </sect2>
1743 </sect1>
1745 <sect1 id="textsearch-parsers">
1746 <title>Parsers</title>
1748 <para>
1749 Text search parsers are responsible for splitting raw document text
1750 into <firstterm>tokens</> and identifying each token's type, where
1751 the set of possible types is defined by the parser itself.
1752 Note that a parser does not modify the text at all &mdash; it simply
1753 identifies plausible word boundaries. Because of this limited scope,
1754 there is less need for application-specific custom parsers than there is
1755 for custom dictionaries. At present <productname>PostgreSQL</productname>
1756 provides just one built-in parser, which has been found to be useful for a
1757 wide range of applications.
1758 </para>
1760 <para>
1761 The built-in parser is named <literal>pg_catalog.default</>.
1762 It recognizes 23 token types:
1763 </para>
1765 <table id="textsearch-default-parser">
1766 <title>Default Parser's Token Types</title>
1767 <tgroup cols="3">
1768 <thead>
1769 <row>
1770 <entry>Alias</entry>
1771 <entry>Description</entry>
1772 <entry>Example</entry>
1773 </row>
1774 </thead>
1775 <tbody>
1776 <row>
1777 <entry><literal>asciiword</></entry>
1778 <entry>Word, all ASCII letters</entry>
1779 <entry><literal>elephant</literal></entry>
1780 </row>
1781 <row>
1782 <entry><literal>word</></entry>
1783 <entry>Word, all letters</entry>
1784 <entry><literal>ma&ntilde;ana</literal></entry>
1785 </row>
1786 <row>
1787 <entry><literal>numword</></entry>
1788 <entry>Word, letters and digits</entry>
1789 <entry><literal>beta1</literal></entry>
1790 </row>
1791 <row>
1792 <entry><literal>asciihword</></entry>
1793 <entry>Hyphenated word, all ASCII</entry>
1794 <entry><literal>up-to-date</literal></entry>
1795 </row>
1796 <row>
1797 <entry><literal>hword</></entry>
1798 <entry>Hyphenated word, all letters</entry>
1799 <entry><literal>l&oacute;gico-matem&aacute;tica</literal></entry>
1800 </row>
1801 <row>
1802 <entry><literal>numhword</></entry>
1803 <entry>Hyphenated word, letters and digits</entry>
1804 <entry><literal>postgresql-beta1</literal></entry>
1805 </row>
1806 <row>
1807 <entry><literal>hword_asciipart</></entry>
1808 <entry>Hyphenated word part, all ASCII</entry>
1809 <entry><literal>postgresql</literal> in the context <literal>postgresql-beta1</literal></entry>
1810 </row>
1811 <row>
1812 <entry><literal>hword_part</></entry>
1813 <entry>Hyphenated word part, all letters</entry>
1814 <entry><literal>l&oacute;gico</literal> or <literal>matem&aacute;tica</literal>
1815 in the context <literal>l&oacute;gico-matem&aacute;tica</literal></entry>
1816 </row>
1817 <row>
1818 <entry><literal>hword_numpart</></entry>
1819 <entry>Hyphenated word part, letters and digits</entry>
1820 <entry><literal>beta1</literal> in the context
1821 <literal>postgresql-beta1</literal></entry>
1822 </row>
1823 <row>
1824 <entry><literal>email</></entry>
1825 <entry>Email address</entry>
1826 <entry><literal>foo@example.com</literal></entry>
1827 </row>
1828 <row>
1829 <entry><literal>protocol</></entry>
1830 <entry>Protocol head</entry>
1831 <entry><literal>http://</literal></entry>
1832 </row>
1833 <row>
1834 <entry><literal>url</></entry>
1835 <entry>URL</entry>
1836 <entry><literal>example.com/stuff/index.html</literal></entry>
1837 </row>
1838 <row>
1839 <entry><literal>host</></entry>
1840 <entry>Host</entry>
1841 <entry><literal>example.com</literal></entry>
1842 </row>
1843 <row>
1844 <entry><literal>url_path</></entry>
1845 <entry>URL path</entry>
1846 <entry><literal>/stuff/index.html</literal>, in the context of a URL</entry>
1847 </row>
1848 <row>
1849 <entry><literal>file</></entry>
1850 <entry>File or path name</entry>
1851 <entry><literal>/usr/local/foo.txt</literal>, if not within a URL</entry>
1852 </row>
1853 <row>
1854 <entry><literal>sfloat</></entry>
1855 <entry>Scientific notation</entry>
1856 <entry><literal>-1.234e56</literal></entry>
1857 </row>
1858 <row>
1859 <entry><literal>float</></entry>
1860 <entry>Decimal notation</entry>
1861 <entry><literal>-1.234</literal></entry>
1862 </row>
1863 <row>
1864 <entry><literal>int</></entry>
1865 <entry>Signed integer</entry>
1866 <entry><literal>-1234</literal></entry>
1867 </row>
1868 <row>
1869 <entry><literal>uint</></entry>
1870 <entry>Unsigned integer</entry>
1871 <entry><literal>1234</literal></entry>
1872 </row>
1873 <row>
1874 <entry><literal>version</></entry>
1875 <entry>Version number</entry>
1876 <entry><literal>8.3.0</literal></entry>
1877 </row>
1878 <row>
1879 <entry><literal>tag</></entry>
1880 <entry>XML tag</entry>
1881 <entry><literal>&lt;a href="dictionaries.html"&gt;</literal></entry>
1882 </row>
1883 <row>
1884 <entry><literal>entity</></entry>
1885 <entry>XML entity</entry>
1886 <entry><literal>&amp;amp;</literal></entry>
1887 </row>
1888 <row>
1889 <entry><literal>blank</></entry>
1890 <entry>Space symbols</entry>
1891 <entry>(any whitespace or punctuation not otherwise recognized)</entry>
1892 </row>
1893 </tbody>
1894 </tgroup>
1895 </table>
1897 <note>
1898 <para>
1899 The parser's notion of a <quote>letter</> is determined by the database's
1900 locale setting, specifically <varname>lc_ctype</>. Words containing
1901 only the basic ASCII letters are reported as a separate token type,
1902 since it is sometimes useful to distinguish them. In most European
1903 languages, token types <literal>word</> and <literal>asciiword</>
1904 should always be treated alike.
1905 </para>
1906 </note>
1908 <para>
1909 It is possible for the parser to produce overlapping tokens from the same
1910 piece of text. As an example, a hyphenated word will be reported both
1911 as the entire word and as each component:
1913 <programlisting>
1914 SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
1915 alias | description | token
1916 -----------------+------------------------------------------+---------------
1917 numhword | Hyphenated word, letters and digits | foo-bar-beta1
1918 hword_asciipart | Hyphenated word part, all ASCII | foo
1919 blank | Space symbols | -
1920 hword_asciipart | Hyphenated word part, all ASCII | bar
1921 blank | Space symbols | -
1922 hword_numpart | Hyphenated word part, letters and digits | beta1
1923 </programlisting>
1925 This behavior is desirable since it allows searches to work for both
1926 the whole compound word and for components. Here is another
1927 instructive example:
1929 <programlisting>
1930 SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
1931 alias | description | token
1932 ----------+---------------+------------------------------
1933 protocol | Protocol head | http://
1934 url | URL | example.com/stuff/index.html
1935 host | Host | example.com
1936 url_path | URL path | /stuff/index.html
1937 </programlisting>
1938 </para>
1940 </sect1>
1942 <sect1 id="textsearch-dictionaries">
1943 <title>Dictionaries</title>
1945 <para>
1946 Dictionaries are used to eliminate words that should not be considered in a
1947 search (<firstterm>stop words</>), and to <firstterm>normalize</> words so
1948 that different derived forms of the same word will match. A successfully
1949 normalized word is called a <firstterm>lexeme</>. Aside from
1950 improving search quality, normalization and removal of stop words reduce the
1951 size of the <type>tsvector</type> representation of a document, thereby
1952 improving performance. Normalization does not always have linguistic meaning
1953 and usually depends on application semantics.
1954 </para>
1956 <para>
1957 Some examples of normalization:
1959 <itemizedlist spacing="compact" mark="bullet">
1961 <listitem>
1962 <para>
1963 Linguistic - Ispell dictionaries try to reduce input words to a
1964 normalized form; stemmer dictionaries remove word endings
1965 </para>
1966 </listitem>
1967 <listitem>
1968 <para>
1969 <acronym>URL</acronym> locations can be canonicalized to make
1970 equivalent URLs match:
1972 <itemizedlist spacing="compact" mark="bullet">
1973 <listitem>
1974 <para>
1975 http://www.pgsql.ru/db/mw/index.html
1976 </para>
1977 </listitem>
1978 <listitem>
1979 <para>
1980 http://www.pgsql.ru/db/mw/
1981 </para>
1982 </listitem>
1983 <listitem>
1984 <para>
1985 http://www.pgsql.ru/db/../db/mw/index.html
1986 </para>
1987 </listitem>
1988 </itemizedlist>
1989 </para>
1990 </listitem>
1991 <listitem>
1992 <para>
1993 Color names can be replaced by their hexadecimal values, e.g.,
1994 <literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal>
1995 </para>
1996 </listitem>
1997 <listitem>
1998 <para>
1999 If indexing numbers, we can
2000 remove some fractional digits to reduce the range of possible
2001 numbers, so for example <emphasis>3.14</emphasis>159265359,
2002 <emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same
2003 after normalization if only two digits are kept after the decimal point.
2004 </para>
2005 </listitem>
2006 </itemizedlist>
2008 </para>
2010 <para>
2011 A dictionary is a program that accepts a token as
2012 input and returns:
2013 <itemizedlist spacing="compact" mark="bullet">
2014 <listitem>
2015 <para>
2016 an array of lexemes if the input token is known to the dictionary
2017 (notice that one token can produce more than one lexeme)
2018 </para>
2019 </listitem>
2020 <listitem>
2021 <para>
2022 an empty array if the dictionary knows the token, but it is a stop word
2023 </para>
2024 </listitem>
2025 <listitem>
2026 <para>
2027 <literal>NULL</literal> if the dictionary does not recognize the input token
2028 </para>
2029 </listitem>
2030 </itemizedlist>
2031 </para>
2033 <para>
2034 <productname>PostgreSQL</productname> provides predefined dictionaries for
2035 many languages. There are also several predefined templates that can be
2036 used to create new dictionaries with custom parameters. Each predefined
2037 dictionary template is described below. If no existing
2038 template is suitable, it is possible to create new ones; see the
2039 <filename>contrib/</> area of the <productname>PostgreSQL</> distribution
2040 for examples.
2041 </para>
2043 <para>
2044 A text search configuration binds a parser together with a set of
2045 dictionaries to process the parser's output tokens. For each token
2046 type that the parser can return, a separate list of dictionaries is
2047 specified by the configuration. When a token of that type is found
2048 by the parser, each dictionary in the list is consulted in turn,
2049 until some dictionary recognizes it as a known word. If it is identified
2050 as a stop word, or if no dictionary recognizes the token, it will be
2051 discarded and not indexed or searched for.
2052 The general rule for configuring a list of dictionaries
2053 is to place first the most narrow, most specific dictionary, then the more
2054 general dictionaries, finishing with a very general dictionary, like
2055 a <application>Snowball</> stemmer or <literal>simple</>, which
2056 recognizes everything. For example, for an astronomy-specific search
2057 (<literal>astro_en</literal> configuration) one could bind token type
2058 <type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical
2059 terms, a general English dictionary and a <application>Snowball</> English
2060 stemmer:
2062 <programlisting>
2063 ALTER TEXT SEARCH CONFIGURATION astro_en
2064 ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
2065 </programlisting>
2066 </para>
2068 <sect2 id="textsearch-stopwords">
2069 <title>Stop Words</title>
2071 <para>
2072 Stop words are words that are very common, appear in almost every
2073 document, and have no discrimination value. Therefore, they can be ignored
2074 in the context of full text searching. For example, every English text
2075 contains words like <literal>a</literal> and <literal>the</>, so it is
2076 useless to store them in an index. However, stop words do affect the
2077 positions in <type>tsvector</type>, which in turn affect ranking:
2079 <programlisting>
2080 SELECT to_tsvector('english','in the list of stop words');
2081 to_tsvector
2082 ----------------------------
2083 'list':3 'stop':5 'word':6
2084 </programlisting>
2086 The missing positions 1,2,4 are because of stop words. Ranks
2087 calculated for documents with and without stop words are quite different:
2089 <programlisting>
2090 SELECT ts_rank_cd (to_tsvector('english','in the list of stop words'), to_tsquery('list &amp; stop'));
2091 ts_rank_cd
2092 ------------
2093 0.05
2095 SELECT ts_rank_cd (to_tsvector('english','list stop words'), to_tsquery('list &amp; stop'));
2096 ts_rank_cd
2097 ------------
2099 </programlisting>
2101 </para>
2103 <para>
2104 It is up to the specific dictionary how it treats stop words. For example,
2105 <literal>ispell</literal> dictionaries first normalize words and then
2106 look at the list of stop words, while <literal>Snowball</literal> stemmers
2107 first check the list of stop words. The reason for the different
2108 behavior is an attempt to decrease noise.
2109 </para>
2111 </sect2>
2113 <sect2 id="textsearch-simple-dictionary">
2114 <title>Simple Dictionary</title>
2116 <para>
2117 The <literal>simple</> dictionary template operates by converting the
2118 input token to lower case and checking it against a file of stop words.
2119 If it is found in the file then an empty array is returned, causing
2120 the token to be discarded. If not, the lower-cased form of the word
2121 is returned as the normalized lexeme. Alternatively, the dictionary
2122 can be configured to report non-stop-words as unrecognized, allowing
2123 them to be passed on to the next dictionary in the list.
2124 </para>
2126 <para>
2127 Here is an example of a dictionary definition using the <literal>simple</>
2128 template:
2130 <programlisting>
2131 CREATE TEXT SEARCH DICTIONARY public.simple_dict (
2132 TEMPLATE = pg_catalog.simple,
2133 STOPWORDS = english
2135 </programlisting>
2137 Here, <literal>english</literal> is the base name of a file of stop words.
2138 The file's full name will be
2139 <filename>$SHAREDIR/tsearch_data/english.stop</>,
2140 where <literal>$SHAREDIR</> means the
2141 <productname>PostgreSQL</productname> installation's shared-data directory,
2142 often <filename>/usr/local/share/postgresql</> (use <command>pg_config
2143 --sharedir</> to determine it if you're not sure).
2144 The file format is simply a list
2145 of words, one per line. Blank lines and trailing spaces are ignored,
2146 and upper case is folded to lower case, but no other processing is done
2147 on the file contents.
2148 </para>
2150 <para>
2151 Now we can test our dictionary:
2153 <programlisting>
2154 SELECT ts_lexize('public.simple_dict','YeS');
2155 ts_lexize
2156 -----------
2157 {yes}
2159 SELECT ts_lexize('public.simple_dict','The');
2160 ts_lexize
2161 -----------
2163 </programlisting>
2164 </para>
2166 <para>
2167 We can also choose to return <literal>NULL</>, instead of the lower-cased
2168 word, if it is not found in the stop words file. This behavior is
2169 selected by setting the dictionary's <literal>Accept</> parameter to
2170 <literal>false</>. Continuing the example:
2172 <programlisting>
2173 ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
2175 SELECT ts_lexize('public.simple_dict','YeS');
2176 ts_lexize
2177 -----------
2180 SELECT ts_lexize('public.simple_dict','The');
2181 ts_lexize
2182 -----------
2184 </programlisting>
2185 </para>
2187 <para>
2188 With the default setting of <literal>Accept</> = <literal>true</>,
2189 it is only useful to place a <literal>simple</> dictionary at the end
2190 of a list of dictionaries, since it will never pass on any token to
2191 a following dictionary. Conversely, <literal>Accept</> = <literal>false</>
2192 is only useful when there is at least one following dictionary.
2193 </para>
2195 <caution>
2196 <para>
2197 Most types of dictionaries rely on configuration files, such as files of
2198 stop words. These files <emphasis>must</> be stored in UTF-8 encoding.
2199 They will be translated to the actual database encoding, if that is
2200 different, when they are read into the server.
2201 </para>
2202 </caution>
2204 <caution>
2205 <para>
2206 Normally, a database session will read a dictionary configuration file
2207 only once, when it is first used within the session. If you modify a
2208 configuration file and want to force existing sessions to pick up the
2209 new contents, issue an <command>ALTER TEXT SEARCH DICTIONARY</> command
2210 on the dictionary. This can be a <quote>dummy</> update that doesn't
2211 actually change any parameter values.
2212 </para>
2213 </caution>
2215 </sect2>
2217 <sect2 id="textsearch-synonym-dictionary">
2218 <title>Synonym Dictionary</title>
2220 <para>
2221 This dictionary template is used to create dictionaries that replace a
2222 word with a synonym. Phrases are not supported (use the thesaurus
2223 template (<xref linkend="textsearch-thesaurus">) for that). A synonym
2224 dictionary can be used to overcome linguistic problems, for example, to
2225 prevent an English stemmer dictionary from reducing the word 'Paris' to
2226 'pari'. It is enough to have a <literal>Paris paris</literal> line in the
2227 synonym dictionary and put it before the <literal>english_stem</>
2228 dictionary. For example:
2230 <programlisting>
2231 SELECT * FROM ts_debug('english', 'Paris');
2232 alias | description | token | dictionaries | dictionary | lexemes
2233 -----------+-----------------+-------+----------------+--------------+---------
2234 asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
2236 CREATE TEXT SEARCH DICTIONARY my_synonym (
2237 TEMPLATE = synonym,
2238 SYNONYMS = my_synonyms
2241 ALTER TEXT SEARCH CONFIGURATION english
2242 ALTER MAPPING FOR asciiword WITH my_synonym, english_stem;
2244 SELECT * FROM ts_debug('english', 'Paris');
2245 alias | description | token | dictionaries | dictionary | lexemes
2246 -----------+-----------------+-------+---------------------------+------------+---------
2247 asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
2248 </programlisting>
2249 </para>
2251 <para>
2252 The only parameter required by the <literal>synonym</> template is
2253 <literal>SYNONYMS</>, which is the base name of its configuration file
2254 &mdash; <literal>my_synonyms</> in the above example.
2255 The file's full name will be
2256 <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</>
2257 (where <literal>$SHAREDIR</> means the
2258 <productname>PostgreSQL</> installation's shared-data directory).
2259 The file format is just one line
2260 per word to be substituted, with the word followed by its synonym,
2261 separated by white space. Blank lines and trailing spaces are ignored.
2262 </para>
2264 <para>
2265 The <literal>synonym</> template also has an optional parameter
2266 <literal>CaseSensitive</>, which defaults to <literal>false</>. When
2267 <literal>CaseSensitive</> is <literal>false</>, words in the synonym file
2268 are folded to lower case, as are input tokens. When it is
2269 <literal>true</>, words and tokens are not folded to lower case,
2270 but are compared as-is.
2271 </para>
2272 </sect2>
2274 <sect2 id="textsearch-thesaurus">
2275 <title>Thesaurus Dictionary</title>
2277 <para>
2278 A thesaurus dictionary (sometimes abbreviated as <acronym>TZ</acronym>) is
2279 a collection of words that includes information about the relationships
2280 of words and phrases, i.e., broader terms (<acronym>BT</acronym>), narrower
2281 terms (<acronym>NT</acronym>), preferred terms, non-preferred terms, related
2282 terms, etc.
2283 </para>
2285 <para>
2286 Basically a thesaurus dictionary replaces all non-preferred terms by one
2287 preferred term and, optionally, preserves the original terms for indexing
2288 as well. <productname>PostgreSQL</>'s current implementation of the
2289 thesaurus dictionary is an extension of the synonym dictionary with added
2290 <firstterm>phrase</firstterm> support. A thesaurus dictionary requires
2291 a configuration file of the following format:
2293 <programlisting>
2294 # this is a comment
2295 sample word(s) : indexed word(s)
2296 more sample word(s) : more indexed word(s)
2298 </programlisting>
2300 where the colon (<symbol>:</symbol>) symbol acts as a delimiter between a
2301 a phrase and its replacement.
2302 </para>
2304 <para>
2305 A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which
2306 is specified in the dictionary's configuration) to normalize the input
2307 text before checking for phrase matches. It is only possible to select one
2308 subdictionary. An error is reported if the subdictionary fails to
2309 recognize a word. In that case, you should remove the use of the word or
2310 teach the subdictionary about it. You can place an asterisk
2311 (<symbol>*</symbol>) at the beginning of an indexed word to skip applying
2312 the subdictionary to it, but all sample words <emphasis>must</> be known
2313 to the subdictionary.
2314 </para>
2316 <para>
2317 The thesaurus dictionary chooses the longest match if there are multiple
2318 phrases matching the input, and ties are broken by using the last
2319 definition.
2320 </para>
2322 <para>
2323 Specific stop words recognized by the subdictionary cannot be
2324 specified; instead use <literal>?</> to mark the location where any
2325 stop word can appear. For example, assuming that <literal>a</> and
2326 <literal>the</> are stop words according to the subdictionary:
2328 <programlisting>
2329 ? one ? two : swsw
2330 </programlisting>
2332 matches <literal>a one the two</> and <literal>the one a two</>;
2333 both would be replaced by <literal>swsw</>.
2334 </para>
2336 <para>
2337 Since a thesaurus dictionary has the capability to recognize phrases it
2338 must remember its state and interact with the parser. A thesaurus dictionary
2339 uses these assignments to check if it should handle the next word or stop
2340 accumulation. The thesaurus dictionary must be configured
2341 carefully. For example, if the thesaurus dictionary is assigned to handle
2342 only the <literal>asciiword</literal> token, then a thesaurus dictionary
2343 definition like <literal>one 7</> will not work since token type
2344 <literal>uint</literal> is not assigned to the thesaurus dictionary.
2345 </para>
2347 <caution>
2348 <para>
2349 Thesauruses are used during indexing so any change in the thesaurus
2350 dictionary's parameters <emphasis>requires</emphasis> reindexing.
2351 For most other dictionary types, small changes such as adding or
2352 removing stopwords does not force reindexing.
2353 </para>
2354 </caution>
2356 <sect3 id="textsearch-thesaurus-config">
2357 <title>Thesaurus Configuration</title>
2359 <para>
2360 To define a new thesaurus dictionary, use the <literal>thesaurus</>
2361 template. For example:
2363 <programlisting>
2364 CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
2365 TEMPLATE = thesaurus,
2366 DictFile = mythesaurus,
2367 Dictionary = pg_catalog.english_stem
2369 </programlisting>
2371 Here:
2372 <itemizedlist spacing="compact" mark="bullet">
2373 <listitem>
2374 <para>
2375 <literal>thesaurus_simple</literal> is the new dictionary's name
2376 </para>
2377 </listitem>
2378 <listitem>
2379 <para>
2380 <literal>mythesaurus</literal> is the base name of the thesaurus
2381 configuration file.
2382 (Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</>,
2383 where <literal>$SHAREDIR</> means the installation shared-data
2384 directory.)
2385 </para>
2386 </listitem>
2387 <listitem>
2388 <para>
2389 <literal>pg_catalog.english_stem</literal> is the subdictionary (here,
2390 a Snowball English stemmer) to use for thesaurus normalization.
2391 Notice that the subdictionary will have its own
2392 configuration (for example, stop words), which is not shown here.
2393 </para>
2394 </listitem>
2395 </itemizedlist>
2397 Now it is possible to bind the thesaurus dictionary <literal>thesaurus_simple</literal>
2398 to the desired token types in a configuration, for example:
2400 <programlisting>
2401 ALTER TEXT SEARCH CONFIGURATION russian
2402 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart WITH thesaurus_simple;
2403 </programlisting>
2404 </para>
2406 </sect3>
2408 <sect3 id="textsearch-thesaurus-examples">
2409 <title>Thesaurus Example</title>
2411 <para>
2412 Consider a simple astronomical thesaurus <literal>thesaurus_astro</literal>,
2413 which contains some astronomical word combinations:
2415 <programlisting>
2416 supernovae stars : sn
2417 crab nebulae : crab
2418 </programlisting>
2420 Below we create a dictionary and bind some token types to
2421 an astronomical thesaurus and English stemmer:
2423 <programlisting>
2424 CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
2425 TEMPLATE = thesaurus,
2426 DictFile = thesaurus_astro,
2427 Dictionary = english_stem
2430 ALTER TEXT SEARCH CONFIGURATION russian
2431 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart WITH thesaurus_astro, english_stem;
2432 </programlisting>
2434 Now we can see how it works.
2435 <function>ts_lexize</function> is not very useful for testing a thesaurus,
2436 because it treats its input as a single token. Instead we can use
2437 <function>plainto_tsquery</function> and <function>to_tsvector</function>
2438 which will break their input strings into multiple tokens:
2440 <programlisting>
2441 SELECT plainto_tsquery('supernova star');
2442 plainto_tsquery
2443 -----------------
2444 'sn'
2446 SELECT to_tsvector('supernova star');
2447 to_tsvector
2448 -------------
2449 'sn':1
2450 </programlisting>
2452 In principle, one can use <function>to_tsquery</function> if you quote
2453 the argument:
2455 <programlisting>
2456 SELECT to_tsquery('''supernova star''');
2457 to_tsquery
2458 ------------
2459 'sn'
2460 </programlisting>
2462 Notice that <literal>supernova star</literal> matches <literal>supernovae
2463 stars</literal> in <literal>thesaurus_astro</literal> because we specified
2464 the <literal>english_stem</literal> stemmer in the thesaurus definition.
2465 The stemmer removed the <literal>e</> and <literal>s</>.
2466 </para>
2468 <para>
2469 To index the original phrase as well as the substitute, just include it
2470 in the right-hand part of the definition:
2472 <programlisting>
2473 supernovae stars : sn supernovae stars
2475 SELECT plainto_tsquery('supernova star');
2476 plainto_tsquery
2477 -----------------------------
2478 'sn' &amp; 'supernova' &amp; 'star'
2479 </programlisting>
2480 </para>
2482 </sect3>
2484 </sect2>
2486 <sect2 id="textsearch-ispell-dictionary">
2487 <title><application>Ispell</> Dictionary</title>
2489 <para>
2490 The <application>Ispell</> dictionary template supports
2491 <firstterm>morphological dictionaries</>, which can normalize many
2492 different linguistic forms of a word into the same lexeme. For example,
2493 an English <application>Ispell</> dictionary can match all declensions and
2494 conjugations of the search term <literal>bank</literal>, e.g.
2495 <literal>banking</>, <literal>banked</>, <literal>banks</>,
2496 <literal>banks'</>, and <literal>bank's</>.
2497 </para>
2499 <para>
2500 The standard <productname>PostgreSQL</productname> distribution does
2501 not include any <application>Ispell</> configuration files.
2502 Dictionaries for a large number of languages are available from <ulink
2503 url="http://ficus-www.cs.ucla.edu/geoff/ispell.html">Ispell</ulink>.
2504 Also, some more modern dictionary file formats are supported &mdash; <ulink
2505 url="http://en.wikipedia.org/wiki/MySpell">MySpell</ulink> (OO &lt; 2.0.1)
2506 and <ulink url="http://sourceforge.net/projects/hunspell">Hunspell</ulink>
2507 (OO &gt;= 2.0.2). A large list of dictionaries is available on the <ulink
2508 url="http://wiki.services.openoffice.org/wiki/Dictionaries">OpenOffice
2509 Wiki</ulink>.
2510 </para>
2512 <para>
2513 To create an <application>Ispell</> dictionary, use the built-in
2514 <literal>ispell</literal> template and specify several parameters:
2515 </para>
2517 <programlisting>
2518 CREATE TEXT SEARCH DICTIONARY english_ispell (
2519 TEMPLATE = ispell,
2520 DictFile = english,
2521 AffFile = english,
2522 StopWords = english
2524 </programlisting>
2526 <para>
2527 Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
2528 specify the base names of the dictionary, affixes, and stop-words files.
2529 The stop-words file has the same format explained above for the
2530 <literal>simple</> dictionary type. The format of the other files is
2531 not specified here but is available from the above-mentioned web sites.
2532 </para>
2534 <para>
2535 Ispell dictionaries usually recognize a limited set of words, so they
2536 should be followed by another broader dictionary; for
2537 example, a Snowball dictionary, which recognizes everything.
2538 </para>
2540 <para>
2541 Ispell dictionaries support splitting compound words.
2542 This is a nice feature and
2543 <productname>PostgreSQL</productname> supports it.
2544 Notice that the affix file should specify a special flag using the
2545 <literal>compoundwords controlled</literal> statement that marks dictionary
2546 words that can participate in compound formation:
2548 <programlisting>
2549 compoundwords controlled z
2550 </programlisting>
2552 Here are some examples for the Norwegian language:
2554 <programlisting>
2555 SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
2556 {over,buljong,terning,pakk,mester,assistent}
2557 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
2558 {sjokoladefabrikk,sjokolade,fabrikk}
2559 </programlisting>
2560 </para>
2562 <note>
2563 <para>
2564 <application>MySpell</> does not support compound words.
2565 <application>Hunspell</> has sophisticated support for compound words. At
2566 present, <productname>PostgreSQL</productname> implements only the basic
2567 compound word operations of Hunspell.
2568 </para>
2569 </note>
2571 </sect2>
2573 <sect2 id="textsearch-snowball-dictionary">
2574 <title><application>Snowball</> Dictionary</title>
2576 <para>
2577 The <application>Snowball</> dictionary template is based on the project
2578 of Martin Porter, inventor of the popular Porter's stemming algorithm
2579 for the English language. Snowball now provides stemming algorithms for
2580 many languages (see the <ulink url="http://snowball.tartarus.org">Snowball
2581 site</ulink> for more information). Each algorithm understands how to
2582 reduce common variant forms of words to a base, or stem, spelling within
2583 its language. A Snowball dictionary requires a <literal>language</>
2584 parameter to identify which stemmer to use, and optionally can specify a
2585 <literal>stopword</> file name that gives a list of words to eliminate.
2586 (<productname>PostgreSQL</productname>'s standard stopword lists are also
2587 provided by the Snowball project.)
2588 For example, there is a built-in definition equivalent to
2590 <programlisting>
2591 CREATE TEXT SEARCH DICTIONARY english_stem (
2592 TEMPLATE = snowball,
2593 Language = english,
2594 StopWords = english
2596 </programlisting>
2598 The stopword file format is the same as already explained.
2599 </para>
2601 <para>
2602 A <application>Snowball</> dictionary recognizes everything, whether
2603 or not it is able to simplify the word, so it should be placed
2604 at the end of the dictionary list. It is useless to have it
2605 before any other dictionary because a token will never pass through it to
2606 the next dictionary.
2607 </para>
2609 </sect2>
2611 </sect1>
2613 <sect1 id="textsearch-configuration">
2614 <title>Configuration Example</title>
2616 <para>
2617 A text search configuration specifies all options necessary to transform a
2618 document into a <type>tsvector</type>: the parser to use to break text
2619 into tokens, and the dictionaries to use to transform each token into a
2620 lexeme. Every call of
2621 <function>to_tsvector</function> or <function>to_tsquery</function>
2622 needs a text search configuration to perform its processing.
2623 The configuration parameter
2624 <xref linkend="guc-default-text-search-config">
2625 specifies the name of the default configuration, which is the
2626 one used by text search functions if an explicit configuration
2627 parameter is omitted.
2628 It can be set in <filename>postgresql.conf</filename>, or set for an
2629 individual session using the <command>SET</> command.
2630 </para>
2632 <para>
2633 Several predefined text search configurations are available, and
2634 you can create custom configurations easily. To facilitate management
2635 of text search objects, a set of <acronym>SQL</acronym> commands
2636 is available, and there are several <application>psql</application> commands that display information
2637 about text search objects (<xref linkend="textsearch-psql">).
2638 </para>
2640 <para>
2641 As an example, we will create a configuration
2642 <literal>pg</literal>, starting from a duplicate of the built-in
2643 <literal>english</> configuration.
2645 <programlisting>
2646 CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = pg_catalog.english );
2647 </programlisting>
2648 </para>
2650 <para>
2651 We will use a PostgreSQL-specific synonym list
2652 and store it in <filename>$SHAREDIR/tsearch_data/pg_dict.syn</filename>.
2653 The file contents look like:
2655 <programlisting>
2656 postgres pg
2657 pgsql pg
2658 postgresql pg
2659 </programlisting>
2661 We define the synonym dictionary like this:
2663 <programlisting>
2664 CREATE TEXT SEARCH DICTIONARY pg_dict (
2665 TEMPLATE = synonym,
2666 SYNONYMS = pg_dict
2668 </programlisting>
2670 Next we register the <productname>Ispell</> dictionary
2671 <literal>english_ispell</literal>, which has its own configuration files:
2673 <programlisting>
2674 CREATE TEXT SEARCH DICTIONARY english_ispell (
2675 TEMPLATE = ispell,
2676 DictFile = english,
2677 AffFile = english,
2678 StopWords = english
2680 </programlisting>
2682 Now we can set up the mappings for words in configuration
2683 <literal>pg</>:
2685 <programlisting>
2686 ALTER TEXT SEARCH CONFIGURATION pg
2687 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
2688 word, hword, hword_part
2689 WITH pg_dict, english_ispell, english_stem;
2690 </programlisting>
2692 We choose not to index or search some token types that the built-in
2693 configuration does handle:
2695 <programlisting>
2696 ALTER TEXT SEARCH CONFIGURATION pg
2697 DROP MAPPING FOR email, url, url_path, sfloat, float;
2698 </programlisting>
2699 </para>
2701 <para>
2702 Now we can test our configuration:
2704 <programlisting>
2705 SELECT * FROM ts_debug('public.pg', '
2706 PostgreSQL, the highly scalable, SQL compliant, open source object-relational
2707 database management system, is now undergoing beta testing of the next
2708 version of our software.
2710 </programlisting>
2711 </para>
2713 <para>
2714 The next step is to set the session to use the new configuration, which was
2715 created in the <literal>public</> schema:
2717 <programlisting>
2718 =&gt; \dF
2719 List of text search configurations
2720 Schema | Name | Description
2721 ---------+------+-------------
2722 public | pg |
2724 SET default_text_search_config = 'public.pg';
2727 SHOW default_text_search_config;
2728 default_text_search_config
2729 ----------------------------
2730 public.pg
2731 </programlisting>
2732 </para>
2734 </sect1>
2736 <sect1 id="textsearch-debugging">
2737 <title>Testing and Debugging Text Search</title>
2739 <para>
2740 The behavior of a custom text search configuration can easily become
2741 complicated enough to be confusing or undesirable. The functions described
2742 in this section are useful for testing text search objects. You can
2743 test a complete configuration, or test parsers and dictionaries separately.
2744 </para>
2746 <sect2 id="textsearch-configuration-testing">
2747 <title>Configuration Testing</title>
2749 <para>
2750 The function <function>ts_debug</function> allows easy testing of a
2751 text search configuration.
2752 </para>
2754 <indexterm>
2755 <primary>ts_debug</primary>
2756 </indexterm>
2758 <synopsis>
2759 ts_debug(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>,
2760 OUT <replaceable class="PARAMETER">alias</> <type>text</>,
2761 OUT <replaceable class="PARAMETER">description</> <type>text</>,
2762 OUT <replaceable class="PARAMETER">token</> <type>text</>,
2763 OUT <replaceable class="PARAMETER">dictionaries</> <type>regdictionary[]</>,
2764 OUT <replaceable class="PARAMETER">dictionary</> <type>regdictionary</>,
2765 OUT <replaceable class="PARAMETER">lexemes</> <type>text[]</>)
2766 returns setof record
2767 </synopsis>
2769 <para>
2770 <function>ts_debug</> displays information about every token of
2771 <replaceable class="PARAMETER">document</replaceable> as produced by the
2772 parser and processed by the configured dictionaries. It uses the
2773 configuration specified by <replaceable
2774 class="PARAMETER">config</replaceable>,
2775 or <varname>default_text_search_config</varname> if that argument is
2776 omitted.
2777 </para>
2779 <para>
2780 <function>ts_debug</> returns one row for each token identified in the text
2781 by the parser. The columns returned are
2783 <itemizedlist spacing="compact" mark="bullet">
2784 <listitem>
2785 <para>
2786 <replaceable>alias</> <type>text</> &mdash; short name of the token type
2787 </para>
2788 </listitem>
2789 <listitem>
2790 <para>
2791 <replaceable>description</> <type>text</> &mdash; description of the
2792 token type
2793 </para>
2794 </listitem>
2795 <listitem>
2796 <para>
2797 <replaceable>token</> <type>text</> &mdash; text of the token
2798 </para>
2799 </listitem>
2800 <listitem>
2801 <para>
2802 <replaceable>dictionaries</> <type>regdictionary[]</> &mdash; the
2803 dictionaries selected by the configuration for this token type
2804 </para>
2805 </listitem>
2806 <listitem>
2807 <para>
2808 <replaceable>dictionary</> <type>regdictionary</> &mdash; the dictionary
2809 that recognized the token, or <literal>NULL</> if none did
2810 </para>
2811 </listitem>
2812 <listitem>
2813 <para>
2814 <replaceable>lexemes</> <type>text[]</> &mdash; the lexeme(s) produced
2815 by the dictionary that recognized the token, or <literal>NULL</> if
2816 none did; an empty array (<literal>{}</>) means it was recognized as a
2817 stop word
2818 </para>
2819 </listitem>
2820 </itemizedlist>
2821 </para>
2823 <para>
2824 Here is a simple example:
2826 <programlisting>
2827 SELECT * FROM ts_debug('english','a fat cat sat on a mat - it ate a fat rats');
2828 alias | description | token | dictionaries | dictionary | lexemes
2829 -----------+-----------------+-------+----------------+--------------+---------
2830 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
2831 blank | Space symbols | | {} | |
2832 asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
2833 blank | Space symbols | | {} | |
2834 asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat}
2835 blank | Space symbols | | {} | |
2836 asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat}
2837 blank | Space symbols | | {} | |
2838 asciiword | Word, all ASCII | on | {english_stem} | english_stem | {}
2839 blank | Space symbols | | {} | |
2840 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
2841 blank | Space symbols | | {} | |
2842 asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat}
2843 blank | Space symbols | | {} | |
2844 blank | Space symbols | - | {} | |
2845 asciiword | Word, all ASCII | it | {english_stem} | english_stem | {}
2846 blank | Space symbols | | {} | |
2847 asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate}
2848 blank | Space symbols | | {} | |
2849 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
2850 blank | Space symbols | | {} | |
2851 asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
2852 blank | Space symbols | | {} | |
2853 asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat}
2854 </programlisting>
2855 </para>
2857 <para>
2858 For a more extensive demonstration, we
2859 first create a <literal>public.english</literal> configuration and
2860 Ispell dictionary for the English language:
2861 </para>
2863 <programlisting>
2864 CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
2866 CREATE TEXT SEARCH DICTIONARY english_ispell (
2867 TEMPLATE = ispell,
2868 DictFile = english,
2869 AffFile = english,
2870 StopWords = english
2873 ALTER TEXT SEARCH CONFIGURATION public.english
2874 ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
2875 </programlisting>
2877 <programlisting>
2878 SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
2879 alias | description | token | dictionaries | dictionary | lexemes
2880 -----------+-----------------+-------------+-------------------------------+----------------+-------------
2881 asciiword | Word, all ASCII | The | {english_ispell,english_stem} | english_ispell | {}
2882 blank | Space symbols | | {} | |
2883 asciiword | Word, all ASCII | Brightest | {english_ispell,english_stem} | english_ispell | {bright}
2884 blank | Space symbols | | {} | |
2885 asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem | {supernova}
2886 </programlisting>
2888 <para>
2889 In this example, the word <literal>Brightest</> was recognized by the
2890 parser as an <literal>ASCII word</literal> (alias <literal>asciiword</literal>).
2891 For this token type the dictionary list is
2892 <literal>english_ispell</> and
2893 <literal>english_stem</literal>. The word was recognized by
2894 <literal>english_ispell</literal>, which reduced it to the noun
2895 <literal>bright</literal>. The word <literal>supernovaes</literal> is
2896 unknown to the <literal>english_ispell</literal> dictionary so it
2897 was passed to the next dictionary, and, fortunately, was recognized (in
2898 fact, <literal>english_stem</literal> is a Snowball dictionary which
2899 recognizes everything; that is why it was placed at the end of the
2900 dictionary list).
2901 </para>
2903 <para>
2904 The word <literal>The</literal> was recognized by the
2905 <literal>english_ispell</literal> dictionary as a stop word (<xref
2906 linkend="textsearch-stopwords">) and will not be indexed.
2907 The spaces are discarded too, since the configuration provides no
2908 dictionaries at all for them.
2909 </para>
2911 <para>
2912 You can reduce the volume of output by explicitly specifying which columns
2913 you want to see:
2915 <programlisting>
2916 SELECT alias, token, dictionary, lexemes
2917 FROM ts_debug('public.english','The Brightest supernovaes');
2918 alias | token | dictionary | lexemes
2919 -----------+-------------+----------------+-------------
2920 asciiword | The | english_ispell | {}
2921 blank | | |
2922 asciiword | Brightest | english_ispell | {bright}
2923 blank | | |
2924 asciiword | supernovaes | english_stem | {supernova}
2925 </programlisting>
2926 </para>
2928 </sect2>
2930 <sect2 id="textsearch-parser-testing">
2931 <title>Parser Testing</title>
2933 <para>
2934 The following functions allow direct testing of a text search parser.
2935 </para>
2937 <indexterm>
2938 <primary>ts_parse</primary>
2939 </indexterm>
2941 <synopsis>
2942 ts_parse(<replaceable class="PARAMETER">parser_name</replaceable> <type>text</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
2943 ts_parse(<replaceable class="PARAMETER">parser_oid</replaceable> <type>oid</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
2944 </synopsis>
2946 <para>
2947 <function>ts_parse</> parses the given <replaceable>document</replaceable>
2948 and returns a series of records, one for each token produced by
2949 parsing. Each record includes a <varname>tokid</varname> showing the
2950 assigned token type and a <varname>token</varname> which is the text of the
2951 token. For example:
2953 <programlisting>
2954 SELECT * FROM ts_parse('default', '123 - a number');
2955 tokid | token
2956 -------+--------
2957 22 | 123
2958 12 |
2959 12 | -
2960 1 | a
2961 12 |
2962 1 | number
2963 </programlisting>
2964 </para>
2966 <indexterm>
2967 <primary>ts_token_type</primary>
2968 </indexterm>
2970 <synopsis>
2971 ts_token_type(<replaceable class="PARAMETER">parser_name</> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
2972 ts_token_type(<replaceable class="PARAMETER">parser_oid</> <type>oid</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
2973 </synopsis>
2975 <para>
2976 <function>ts_token_type</> returns a table which describes each type of
2977 token the specified parser can recognize. For each token type, the table
2978 gives the integer <varname>tokid</varname> that the parser uses to label a
2979 token of that type, the <varname>alias</varname> that names the token type
2980 in configuration commands, and a short <varname>description</varname>. For
2981 example:
2983 <programlisting>
2984 SELECT * FROM ts_token_type('default');
2985 tokid | alias | description
2986 -------+-----------------+------------------------------------------
2987 1 | asciiword | Word, all ASCII
2988 2 | word | Word, all letters
2989 3 | numword | Word, letters and digits
2990 4 | email | Email address
2991 5 | url | URL
2992 6 | host | Host
2993 7 | sfloat | Scientific notation
2994 8 | version | Version number
2995 9 | hword_numpart | Hyphenated word part, letters and digits
2996 10 | hword_part | Hyphenated word part, all letters
2997 11 | hword_asciipart | Hyphenated word part, all ASCII
2998 12 | blank | Space symbols
2999 13 | tag | XML tag
3000 14 | protocol | Protocol head
3001 15 | numhword | Hyphenated word, letters and digits
3002 16 | asciihword | Hyphenated word, all ASCII
3003 17 | hword | Hyphenated word, all letters
3004 18 | url_path | URL path
3005 19 | file | File or path name
3006 20 | float | Decimal notation
3007 21 | int | Signed integer
3008 22 | uint | Unsigned integer
3009 23 | entity | XML entity
3010 </programlisting>
3011 </para>
3013 </sect2>
3015 <sect2 id="textsearch-dictionary-testing">
3016 <title>Dictionary Testing</title>
3018 <para>
3019 The <function>ts_lexize</> function facilitates dictionary testing.
3020 </para>
3022 <indexterm>
3023 <primary>ts_lexize</primary>
3024 </indexterm>
3026 <synopsis>
3027 ts_lexize(<replaceable class="PARAMETER">dict</replaceable> <type>regdictionary</>, <replaceable class="PARAMETER">token</replaceable> <type>text</>) returns <type>text[]</>
3028 </synopsis>
3030 <para>
3031 <function>ts_lexize</> returns an array of lexemes if the input
3032 <replaceable>token</replaceable> is known to the dictionary,
3033 or an empty array if the token
3034 is known to the dictionary but it is a stop word, or
3035 <literal>NULL</literal> if it is an unknown word.
3036 </para>
3038 <para>
3039 Examples:
3041 <programlisting>
3042 SELECT ts_lexize('english_stem', 'stars');
3043 ts_lexize
3044 -----------
3045 {star}
3047 SELECT ts_lexize('english_stem', 'a');
3048 ts_lexize
3049 -----------
3051 </programlisting>
3052 </para>
3054 <note>
3055 <para>
3056 The <function>ts_lexize</function> function expects a single
3057 <emphasis>token</emphasis>, not text. Here is a case
3058 where this can be confusing:
3060 <programlisting>
3061 SELECT ts_lexize('thesaurus_astro','supernovae stars') is null;
3062 ?column?
3063 ----------
3065 </programlisting>
3067 The thesaurus dictionary <literal>thesaurus_astro</literal> does know the
3068 phrase <literal>supernovae stars</literal>, but <function>ts_lexize</>
3069 fails since it does not parse the input text but treats it as a single
3070 token. Use <function>plainto_tsquery</> or <function>to_tsvector</> to
3071 test thesaurus dictionaries, for example:
3073 <programlisting>
3074 SELECT plainto_tsquery('supernovae stars');
3075 plainto_tsquery
3076 -----------------
3077 'sn'
3078 </programlisting>
3079 </para>
3080 </note>
3082 </sect2>
3084 </sect1>
3086 <sect1 id="textsearch-indexes">
3087 <title>GiST and GIN Index Types</title>
3089 <indexterm zone="textsearch-indexes">
3090 <primary>text search</primary>
3091 <secondary>indexes</secondary>
3092 </indexterm>
3094 <para>
3095 There are two kinds of indexes that can be used to speed up full text
3096 searches.
3097 Note that indexes are not mandatory for full text searching, but in
3098 cases where a column is searched on a regular basis, an index will
3099 usually be desirable.
3101 <variablelist>
3103 <varlistentry>
3105 <indexterm zone="textsearch-indexes">
3106 <primary>index</primary>
3107 <secondary>GiST</secondary>
3108 <tertiary>text search</tertiary>
3109 </indexterm>
3111 <term>
3112 <synopsis>
3113 CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gist(<replaceable>column</replaceable>);
3114 </synopsis>
3115 </term>
3117 <listitem>
3118 <para>
3119 Creates a GiST (Generalized Search Tree)-based index.
3120 The <replaceable>column</replaceable> can be of <type>tsvector</> or
3121 <type>tsquery</> type.
3122 </para>
3123 </listitem>
3124 </varlistentry>
3126 <varlistentry>
3128 <indexterm zone="textsearch-indexes">
3129 <primary>index</primary>
3130 <secondary>GIN</secondary>
3131 <tertiary>text search</tertiary>
3132 </indexterm>
3134 <term>
3135 <synopsis>
3136 CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gin(<replaceable>column</replaceable>);
3137 </synopsis>
3138 </term>
3140 <listitem>
3141 <para>
3142 Creates a GIN (Generalized Inverted Index)-based index.
3143 The <replaceable>column</replaceable> must be of <type>tsvector</> type.
3144 </para>
3145 </listitem>
3146 </varlistentry>
3148 </variablelist>
3149 </para>
3151 <para>
3152 There are substantial performance differences between the two index types,
3153 so it is important to understand which to use.
3154 </para>
3156 <para>
3157 A GiST index is <firstterm>lossy</firstterm>, meaning that the index
3158 may produce false matches, and it is necessary
3159 to check the actual table row to eliminate such false matches.
3160 (<productname>PostgreSQL</productname> does this automatically when needed.)
3161 GiST indexes are lossy because each document is represented in the
3162 index by a fixed-length signature. The signature is generated by hashing
3163 each word into a random bit in an n-bit string, with all these bits OR-ed
3164 together to produce an n-bit document signature. When two words hash to
3165 the same bit position there will be a false match. If all words in
3166 the query have matches (real or false) then the table row must be
3167 retrieved to see if the match is correct.
3168 </para>
3170 <para>
3171 Lossiness causes performance degradation due to useless fetches of table
3172 records that turn out to be false matches. Since random access to table
3173 records is slow, this limits the usefulness of GiST indexes. The
3174 likelihood of false matches depends on several factors, in particular the
3175 number of unique words, so using dictionaries to reduce this number is
3176 recommended.
3177 </para>
3179 <para>
3180 GIN indexes are not lossy for standard queries, but their performance
3181 depends logarithmically on the number of unique words.
3182 (However, GIN indexes store only the words (lexemes) of <type>tsvector</>
3183 values, and not their weight labels. Thus a table row recheck is needed
3184 when using a query that involves weights.)
3185 </para>
3187 <para>
3188 In choosing which index type to use, GiST or GIN, consider these
3189 performance differences:
3191 <itemizedlist spacing="compact" mark="bullet">
3192 <listitem>
3193 <para>
3194 GIN index lookups are about three times faster than GiST
3195 </para>
3196 </listitem>
3197 <listitem>
3198 <para>
3199 GIN indexes take about three times longer to build than GiST
3200 </para>
3201 </listitem>
3202 <listitem>
3203 <para>
3204 GIN indexes are about ten times slower to update than GiST
3205 </para>
3206 </listitem>
3207 <listitem>
3208 <para>
3209 GIN indexes are two-to-three times larger than GiST
3210 </para>
3211 </listitem>
3212 </itemizedlist>
3213 </para>
3215 <para>
3216 As a rule of thumb, <acronym>GIN</acronym> indexes are best for static data
3217 because lookups are faster. For dynamic data, GiST indexes are
3218 faster to update. Specifically, <acronym>GiST</acronym> indexes are very
3219 good for dynamic data and fast if the number of unique words (lexemes) is
3220 under 100,000, while <acronym>GIN</acronym> indexes will handle 100,000+
3221 lexemes better but are slower to update.
3222 </para>
3224 <para>
3225 Note that <acronym>GIN</acronym> index build time can often be improved
3226 by increasing <xref linkend="guc-maintenance-work-mem">, while
3227 <acronym>GiST</acronym> index build time is not sensitive to that
3228 parameter.
3229 </para>
3231 <para>
3232 Partitioning of big collections and the proper use of GiST and GIN indexes
3233 allows the implementation of very fast searches with online update.
3234 Partitioning can be done at the database level using table inheritance
3235 and <varname>constraint_exclusion</>, or by distributing documents over
3236 servers and collecting search results using the <filename>contrib/dblink</>
3237 extension module. The latter is possible because ranking functions use
3238 only local information.
3239 </para>
3241 </sect1>
3243 <sect1 id="textsearch-psql">
3244 <title><application>psql</> Support</title>
3246 <para>
3247 Information about text search configuration objects can be obtained
3248 in <application>psql</application> using a set of commands:
3249 <synopsis>
3250 \dF{d,p,t}<optional>+</optional> <optional>PATTERN</optional>
3251 </synopsis>
3252 An optional <literal>+</literal> produces more details.
3253 </para>
3255 <para>
3256 The optional parameter <literal>PATTERN</literal> should be the name of
3257 a text search object, optionally schema-qualified. If
3258 <literal>PATTERN</literal> is omitted then information about all
3259 visible objects will be displayed. <literal>PATTERN</literal> can be a
3260 regular expression and can provide <emphasis>separate</emphasis> patterns
3261 for the schema and object names. The following examples illustrate this:
3263 <programlisting>
3264 =&gt; \dF *fulltext*
3265 List of text search configurations
3266 Schema | Name | Description
3267 --------+--------------+-------------
3268 public | fulltext_cfg |
3269 </programlisting>
3271 <programlisting>
3272 =&gt; \dF *.fulltext*
3273 List of text search configurations
3274 Schema | Name | Description
3275 ----------+----------------------------
3276 fulltext | fulltext_cfg |
3277 public | fulltext_cfg |
3278 </programlisting>
3280 The available commands are:
3281 </para>
3283 <variablelist>
3285 <varlistentry>
3286 <term><synopsis>\dF<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3288 <listitem>
3289 <para>
3290 List text search configurations (add <literal>+</> for more detail).
3291 </para>
3293 <para>
3295 <programlisting>
3296 =&gt; \dF russian
3297 List of text search configurations
3298 Schema | Name | Description
3299 ------------+---------+------------------------------------
3300 pg_catalog | russian | configuration for russian language
3302 =&gt; \dF+ russian
3303 Text search configuration "pg_catalog.russian"
3304 Parser: "pg_catalog.default"
3305 Token | Dictionaries
3306 -----------------+--------------
3307 asciihword | english_stem
3308 asciiword | english_stem
3309 email | simple
3310 file | simple
3311 float | simple
3312 host | simple
3313 hword | russian_stem
3314 hword_asciipart | english_stem
3315 hword_numpart | simple
3316 hword_part | russian_stem
3317 int | simple
3318 numhword | simple
3319 numword | simple
3320 sfloat | simple
3321 uint | simple
3322 url | simple
3323 url_path | simple
3324 version | simple
3325 word | russian_stem
3326 </programlisting>
3327 </para>
3328 </listitem>
3329 </varlistentry>
3331 <varlistentry>
3332 <term><synopsis>\dFd<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3333 <listitem>
3334 <para>
3335 List text search dictionaries (add <literal>+</> for more detail).
3336 </para>
3338 <para>
3339 <programlisting>
3340 =&gt; \dFd
3341 List of text search dictionaries
3342 Schema | Name | Description
3343 ------------+-----------------+-----------------------------------------------------------
3344 pg_catalog | danish_stem | snowball stemmer for danish language
3345 pg_catalog | dutch_stem | snowball stemmer for dutch language
3346 pg_catalog | english_stem | snowball stemmer for english language
3347 pg_catalog | finnish_stem | snowball stemmer for finnish language
3348 pg_catalog | french_stem | snowball stemmer for french language
3349 pg_catalog | german_stem | snowball stemmer for german language
3350 pg_catalog | hungarian_stem | snowball stemmer for hungarian language
3351 pg_catalog | italian_stem | snowball stemmer for italian language
3352 pg_catalog | norwegian_stem | snowball stemmer for norwegian language
3353 pg_catalog | portuguese_stem | snowball stemmer for portuguese language
3354 pg_catalog | romanian_stem | snowball stemmer for romanian language
3355 pg_catalog | russian_stem | snowball stemmer for russian language
3356 pg_catalog | simple | simple dictionary: just lower case and check for stopword
3357 pg_catalog | spanish_stem | snowball stemmer for spanish language
3358 pg_catalog | swedish_stem | snowball stemmer for swedish language
3359 pg_catalog | turkish_stem | snowball stemmer for turkish language
3360 </programlisting>
3361 </para>
3362 </listitem>
3363 </varlistentry>
3365 <varlistentry>
3367 <term><synopsis>\dFp<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3368 <listitem>
3369 <para>
3370 List text search parsers (add <literal>+</> for more detail).
3371 </para>
3373 <para>
3374 <programlisting>
3375 =&gt; \dFp
3376 List of text search parsers
3377 Schema | Name | Description
3378 ------------+---------+---------------------
3379 pg_catalog | default | default word parser
3380 =&gt; \dFp+
3381 Text search parser "pg_catalog.default"
3382 Method | Function | Description
3383 -----------------+----------------+-------------
3384 Start parse | prsd_start |
3385 Get next token | prsd_nexttoken |
3386 End parse | prsd_end |
3387 Get headline | prsd_headline |
3388 Get token types | prsd_lextype |
3390 Token types for parser "pg_catalog.default"
3391 Token name | Description
3392 -----------------+------------------------------------------
3393 asciihword | Hyphenated word, all ASCII
3394 asciiword | Word, all ASCII
3395 blank | Space symbols
3396 email | Email address
3397 entity | XML entity
3398 file | File or path name
3399 float | Decimal notation
3400 host | Host
3401 hword | Hyphenated word, all letters
3402 hword_asciipart | Hyphenated word part, all ASCII
3403 hword_numpart | Hyphenated word part, letters and digits
3404 hword_part | Hyphenated word part, all letters
3405 int | Signed integer
3406 numhword | Hyphenated word, letters and digits
3407 numword | Word, letters and digits
3408 protocol | Protocol head
3409 sfloat | Scientific notation
3410 tag | XML tag
3411 uint | Unsigned integer
3412 url | URL
3413 url_path | URL path
3414 version | Version number
3415 word | Word, all letters
3416 (23 rows)
3417 </programlisting>
3418 </para>
3419 </listitem>
3420 </varlistentry>
3422 <varlistentry>
3424 <term><synopsis>\dFt<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3425 <listitem>
3426 <para>
3427 List text search templates (add <literal>+</> for more detail).
3428 </para>
3430 <para>
3431 <programlisting>
3432 =&gt; \dFt
3433 List of text search templates
3434 Schema | Name | Description
3435 ------------+-----------+-----------------------------------------------------------
3436 pg_catalog | ispell | ispell dictionary
3437 pg_catalog | simple | simple dictionary: just lower case and check for stopword
3438 pg_catalog | snowball | snowball stemmer
3439 pg_catalog | synonym | synonym dictionary: replace word by its synonym
3440 pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution
3441 </programlisting>
3442 </para>
3443 </listitem>
3444 </varlistentry>
3446 </variablelist>
3448 </sect1>
3450 <sect1 id="textsearch-limitations">
3451 <title>Limitations</title>
3453 <para>
3454 The current limitations of <productname>PostgreSQL</productname>'s
3455 text search features are:
3456 <itemizedlist spacing="compact" mark="bullet">
3457 <listitem>
3458 <para>The length of each lexeme must be less than 2K bytes</para>
3459 </listitem>
3460 <listitem>
3461 <para>The length of a <type>tsvector</type> (lexemes + positions) must be
3462 less than 1 megabyte</para>
3463 </listitem>
3464 <listitem>
3465 <!-- TODO: number of lexemes in what? This is unclear -->
3466 <para>The number of lexemes must be less than
3467 2<superscript>64</superscript></para>
3468 </listitem>
3469 <listitem>
3470 <para>Position values in <type>tsvector</> must be greater than 0 and
3471 no more than 16,383</para>
3472 </listitem>
3473 <listitem>
3474 <para>No more than 256 positions per lexeme</para>
3475 </listitem>
3476 <listitem>
3477 <para>The number of nodes (lexemes + operators) in a <type>tsquery</type>
3478 must be less than 32,768</para>
3479 </listitem>
3480 </itemizedlist>
3481 </para>
3483 <para>
3484 For comparison, the <productname>PostgreSQL</productname> 8.1 documentation
3485 contained 10,441 unique words, a total of 335,420 words, and the most
3486 frequent word <quote>postgresql</> was mentioned 6,127 times in 655
3487 documents.
3488 </para>
3490 <!-- TODO we need to put a date on these numbers? -->
3491 <para>
3492 Another example &mdash; the <productname>PostgreSQL</productname> mailing
3493 list archives contained 910,989 unique words with 57,491,343 lexemes in
3494 461,020 messages.
3495 </para>
3497 </sect1>
3499 <sect1 id="textsearch-migration">
3500 <title>Migration from Pre-8.3 Text Search</title>
3502 <para>
3503 Applications that used the <filename>contrib/tsearch2</> add-on module
3504 for text searching will need some adjustments to work with the
3505 built-in features:
3506 </para>
3508 <itemizedlist>
3509 <listitem>
3510 <para>
3511 Some functions have been renamed or had small adjustments in their
3512 argument lists, and all of them are now in the <literal>pg_catalog</>
3513 schema, whereas in a previous installation they would have been in
3514 <literal>public</> or another non-system schema. There is a new
3515 version of <filename>contrib/tsearch2</> (see <xref linkend="tsearch2">)
3516 that provides a compatibility layer to solve most problems in this
3517 area.
3518 </para>
3519 </listitem>
3521 <listitem>
3522 <para>
3523 The old <filename>contrib/tsearch2</> functions and other objects
3524 <emphasis>must</> be suppressed when loading <application>pg_dump</>
3525 output from a pre-8.3 database. While many of them won't load anyway,
3526 a few will and then cause problems. One simple way to deal with this
3527 is to load the new <filename>contrib/tsearch2</> module before restoring
3528 the dump; then it will block the old objects from being loaded.
3529 </para>
3530 </listitem>
3532 <listitem>
3533 <para>
3534 Text search configuration setup is completely different now.
3535 Instead of manually inserting rows into configuration tables,
3536 search is configured through the specialized SQL commands shown
3537 earlier in this chapter. There is not currently any automated
3538 support for converting an existing custom configuration for 8.3;
3539 you're on your own here.
3540 </para>
3541 </listitem>
3543 <listitem>
3544 <para>
3545 Most types of dictionaries rely on some outside-the-database
3546 configuration files. These are largely compatible with pre-8.3
3547 usage, but note the following differences:
3549 <itemizedlist spacing="compact" mark="bullet">
3550 <listitem>
3551 <para>
3552 Configuration files now must be placed in a single specified
3553 directory (<filename>$SHAREDIR/tsearch_data</>), and must have
3554 a specific extension depending on the type of file, as noted
3555 previously in the descriptions of the various dictionary types.
3556 This restriction was added to forestall security problems.
3557 </para>
3558 </listitem>
3560 <listitem>
3561 <para>
3562 Configuration files must be encoded in UTF-8 encoding,
3563 regardless of what database encoding is used.
3564 </para>
3565 </listitem>
3567 <listitem>
3568 <para>
3569 In thesaurus configuration files, stop words must be marked with
3570 <literal>?</>.
3571 </para>
3572 </listitem>
3573 </itemizedlist>
3574 </para>
3575 </listitem>
3577 </itemizedlist>
3579 </sect1>
3581 </chapter>