xapian-core/docs/termgenerator.rst

   1 .. Copyright (C) 2007 Olly Betts
   2
   3 ========================================
   4 Xapian 1.0 Term Indexing/Querying Scheme
   5 ========================================
   6
   7 .. contents:: Table of contents
   8
   9 Introduction
  10 ============
  11
  12 In Xapian 1.0, the default indexing scheme has been changed significantly, to address
  13 lessons learned from observing the old scheme in real world use.  This document
  14 describes the new scheme, with references to differences from the old.
  15
  16 Stemming
  17 ========
  18
  19 The most obvious difference is the handling of stemmed forms.
  20
  21 Previously all words were indexed stemmed without a prefix, and capitalised words were
  22 indexed unstemmed (but lower cased) with an 'R' prefix.  The rationale for doing this was
  23 that people want to be able to search for exact proper nouns (e.g. the English stemmer
  24 conflates ``Tony`` and ``Toni``).  But of course this also indexes words at the start
  25 of sentences, words in titles, and in German all nouns are capitalised so will be indexed.
  26 Both the normal and R-prefixed terms were indexed with positional information.
  27
  28 Now we index all words lowercased with positional information, and also stemmed with a
  29 'Z' prefix (unless they start with a digit), but without positional information.  By default
  30 a Xapian::Stopper is used to avoid indexed stemmed forms of stopwords (tests show this shaves
  31 around 1% off the database size).
  32
  33 The new scheme allows exact phrase searching (which the old scheme didn't).  ``NEAR``
  34 now has to operate on unstemmed forms, but that's reasonable enough.  We can also disable
  35 stemming of words which are capitalised in the query, to achieve good results for
  36 proper nouns.  And Omega's $topterms will now always suggest unstemmed forms!
  37
  38 The main rationale for prefixing the stemmed forms is that there are simply fewer of
  39 them!  As a side benefit, it opens the way for storing stemmed forms for multiple
  40 languages (e.g. Z:en:, Z:fr: or something like that).
  41
  42 The special handling of a trailing ``.`` in the QueryParser (which would often
  43 mistakenly trigger for pasted text) has been removed.  This feature was there to
  44 support Omega's topterms adding stemmed forms, but Omega no longer needs to do this
  45 as it can suggest unstemmed forms instead.
  46
  47 Word Characters
  48 ===============
  49
  50 By default, Unicode characters of category CONNECTOR_PUNCTUATION (``_`` and a
  51 handful of others) are now word characters, which provides better indexing of
  52 identifiers, without much degradation of other cases.  Previously cases like
  53 ``time_t`` required a phrase search.
  54
  55 Trailing ``+`` and ``#`` are still included on terms (up to 3 characters at most), but
  56 ``-`` no longer is by default.  The examples it benefits aren't compelling
  57 (``nethack--``, ``Cl-``) and it tends to glue hyphens on to terms.
  58
  59 A single embedded ``'`` (apostrophe) is now included in a term.
  60 Previously this caused a slow phrase search, and added junk terms to the index
  61 (``didn't`` -> ``didn`` and ``t``, etc).  Various Unicode characters used for apostrophes
  62 are all mapped to the ASCII representation.
  63
  64 A few other characters (taken from the Unicode definition of a word) are included
  65 in terms if they occur between two word characters, and ``.``, ``,`` and a
  66 few others are included in terms if they occur between two decimal digit characters.