xapian-applications/omega/docs/overview.rst

   1 ==============
   2 Omega overview
   3 ==============
   4
   5 If you just want a very quick overview, you might prefer to read the
   6 `quick-start guide <quickstart.html>`_.
   7
   8 Omega operates on a set of databases.  Each database is created and updated
   9 separately using either omindex or `scriptindex <scriptindex.html>`_.  You can
  10 search these databases (or any other Xapian database with suitable contents)
  11 via a web front-end provided by omega, a CGI application.  A search can also be
  12 done over more than one database at once.
  13
  14 There are separate documents covering `CGI parameters <cgiparams.html>`_, the
  15 `Term Prefixes <termprefixes.html>`_ which are conventionally used, and
  16 `OmegaScript <omegascript.html>`_, the language used to define omega's web
  17 interface.  Omega ships with several OmegaScript templates and you can
  18 use these, modify them, or just write your own.  See the "Supplied Templates"
  19 section below for details of the supplied templates.
  20
  21 Omega parses queries using the ``Xapian::QueryParser`` class - for the supported
  22 syntax, see queryparser.html in the xapian-core documentation
  23 - available online at: https://xapian.org/docs/queryparser.html
  24
  25 Term construction
  26 =================
  27
  28 Documents within an omega database are indexed by two types of terms: those
  29 used for a weighted search from a parsed query string (the CGI parameter
  30 ``P``), and those used for boolean filtering (the CGI parameters ``B`` and
  31 ``N`` - the latter is a negated variant of 'B' and was added in Omega 1.3.5).
  32
  33 Boolean terms always start with a prefix which is an initial capital letter (or
  34 multiple capital letters if the first character is `X`) which denotes the
  35 category of the term (e.g. `M` for MIME type).
  36
  37 Parsed query terms may have a prefix, but don't always.  Those from the body of
  38 the document in unstemmed form don't; stemmed terms have a `Z` prefix; terms
  39 from other fields have a prefix to indicate the field, such as `S` for the
  40 document title; stemmed terms from a field have both prefixes, e.g. `ZS`.
  41
  42 The "english" stemmer is used by default - you can configure this for omindex
  43 and scriptindex with ``--stemmer=LANGUAGE`` (use ``--stemmer=none`` to disable
  44 stemming, see omindex ``--help`` for the list of accepted language names).  At
  45 search time you can configure the stemmer by adding ``$set{stemmer,LANGUAGE}``
  46 to the top of your OmegaScript template.
  47
  48 The two term types are used as follows when building the query:
  49
  50 The ``P`` parameter is parsed using `Xapian::QueryParser` to give a
  51 `Xapian::Query` object denoted as `P-terms` below.
  52
  53 There are two ways that ``B`` and ``N`` parameters are handled, depending if
  54 the term-prefix has been configured as "non-exclusive" or not.  The default is
  55 "exclusive" (and in versions before 1.3.4, this was how all ``B`` parameters
  56 were handled).
  57
  58 Exclusive Boolean Prefix
  59 ------------------------
  60
  61 B(oolean) terms from 'B' parameters with the same prefix are ORed together,
  62 like so::
  63
  64
  65                     [   OR   ]
  66                    /    | ... \
  67               B(F,1) B(F,2)...B(F,n)
  68
  69 Where B(F,1) is the first boolean term with prefix F from a 'B' parameter, and
  70 so on.
  71
  72 Non-Exclusive Boolean Prefix
  73 ----------------------------
  74
  75 For example, ``$setmap{nonexclusiveprefix,K,true}`` sets prefix `K` as
  76 non-exclusive, which means that multiple filter terms from 'B' parameters will
  77 be combined with "AND" instead of "OR", like so::
  78
  79                     [   AND   ]
  80                    /     | ... \
  81               B(K,1) B(K,2)... B(K,m)
  82
  83 Combining the Boolean Filters
  84 -----------------------------
  85
  86 The subqueries for each prefix from "B" parameters are combined with AND,
  87 to make this (which we refer to as "B-filter" below)::
  88
  89                          [     AND     ]
  90                         /       |  ...  \
  91                        /                 \
  92                  [   OR   ]               [   AND  ]
  93                 /    | ... \             /    | ... \
  94            B(F,1) B(F,2)...B(F,n)   B(K,1) B(K,2)...B(K,m)
  95
  96
  97 Negated Boolean Terms
  98 ---------------------
  99
 100 All the terms from all 'N' parameters are combined together with "OR", to
 101 make this (which we refer to as "N-filter" below)::
 102
 103                     [       OR       ]
 104                    / ... |     |  ... \
 105               N(F,1)...N(F,n) N(K,1)...N(K,m)
 106
 107 Putting it all together
 108 -----------------------
 109
 110 The P-terms are filtered by the B-filter using "FILTER" and by the N-filter
 111 using "AND_NOT"::
 112
 113                         [ AND_NOT ]
 114                        /           \
 115                       /             \
 116             [ FILTER ]             N-terms
 117              /      \
 118             /        \
 119        P-terms      B-terms
 120
 121 The intent here is to allow filtering on arbitrary (and, typically,
 122 orthogonal) characteristics of the document. For instance, by adding
 123 boolean terms "Ttext/html", "Ttext/plain" and "J/press" you would be
 124 filtering the parsed query to only retrieve documents that are both in
 125 the "/press" site *and* which are either of MIME type text/html or
 126 text/plain. (See below for more information about sites.)
 127
 128 If B-terms or N-terms is absent, that part of the query is simply omitted.
 129
 130 If there is no parsed query, the boolean filter is promoted to
 131 be the query, and the weighting scheme is set to boolean.  This has
 132 the effect of applying the boolean filter to the whole database.  If
 133 there are only N-terms, then ``Query::MatchAll`` is used for the left
 134 side of the "AND_NOT".
 135
 136 In order to add more boolean prefixes, you will need to alter the
 137 ``index_file()`` function in omindex.cc. Currently omindex adds several
 138 useful ones, detailed below.
 139
 140 Parsed query terms are constructed from the title, body and keywords
 141 of a document. (Not all document types support all three areas of
 142 text.) Title terms are stored with position data starting at 0, body
 143 terms starting 100 beyond title terms, and keyword terms starting 100
 144 beyond body terms. This allows queries using positional data without
 145 causing false matches across the different types of term.
 146
 147 Sites
 148 =====
 149
 150 Within a database, Omega supports multiple sites. These are recorded
 151 using boolean terms (see 'Term construction', above) to allow
 152 filtering on them.
 153
 154 Sites work by having all documents within them having a common base
 155 URL. For instance, you might have two sites, one for your press area
 156 and one for your product descriptions:
 157
 158 - \http://example.com/press/index.html
 159 - \http://example.com/press/bigrelease.html
 160 - \http://example.com/products/bigproduct.html
 161 - \http://example.com/products/littleproduct.html
 162
 163 You could index all documents within \http://example.com/press/ using a
 164 site of '/press', and all within \http://example.com/products/ using
 165 '/products'.
 166
 167 Sites are also useful because omindex indexes documents through the
 168 file system, not by fetching from the web server. If you don't have a
 169 URL to file system mapping which puts all documents under one
 170 hierarchy, you'll need to index each separate section as a site.
 171
 172 An obvious example of this is the way that many web servers map URLs
 173 of the form <\http://example.com/~<username>/> to a directory within
 174 that user's home directory (such as ~<username>/pub on a Unix
 175 system). In this case, you can index each user's home page separately,
 176 as a site of the form '/~<username>'. You can then use boolean
 177 filters to allow people to search only a specific home page (or a
 178 group of them), or omit such terms to search everyone's pages.
 179
 180 Note that the site specified when you index is used to build the
 181 complete URL that the results page links to. Thus while sites will
 182 typically want to be relative to the hostname part of the URL (e.g.
 183 '/site' rather than '\http://example.com/site'), you can use them
 184 to have a single search across several different hostnames. This will
 185 still work if you actually store each distinct hostname in a different
 186 database.
 187
 188 omindex operation
 189 =================
 190
 191 omindex is fairly simple to use, for example::
 192
 193   omindex --db default --url http://example.com/ /var/www/example.com
 194
 195 For a full list of command line options supported, see ``man omindex``
 196 or ``omindex --help``.
 197
 198 You *must* specify the database to index into (it's created if it doesn't
 199 exist, but parent directories must exist).  You will often also want to specify
 200 the base URL (which is used as the site, and can be relative to the hostname -
 201 starts '/' - or absolute - starts with a scheme, e.g.
 202 '\http://example.com/products/').  If not specified, the base URL defaults to
 203 ``/``.
 204
 205 You also need to tell omindex which directory to index. This should be
 206 either a single directory (in which case it is taken to be the
 207 directory base of the entire site being indexed), or as two arguments,
 208 the first being the directory base of the site being indexed, and the
 209 second being a relative directory within that to index.
 210
 211 For instance, in the example above, if you separate your products by
 212 size, you might end up with:
 213
 214 - \http://example.com/press/index.html
 215 - \http://example.com/press/bigrelease.html
 216 - \http://example.com/products/large/bigproduct.html
 217 - \http://example.com/products/small/littleproduct.html
 218
 219 If the entire website is stored in the file system under the directory
 220 /www/example, then you would probably index the site in two
 221 passes, one for the '/press' site and one for the '/products' site. You
 222 might use the following commands::
 223
 224 $ omindex -p --db /var/lib/omega/data/default --url /press /www/example/press
 225 $ omindex -p --db /var/lib/omega/data/default --url /products /www/example/products
 226
 227 If you add a new large products, but don't want to reindex the whole of
 228 the products section, you could do::
 229
 230 $ omindex -p --db /var/lib/omega/data/default --url /products /www/example/products large
 231
 232 and just the large products will be reindexed. You need to do it like that, and
 233 not as::
 234
 235 $ omindex -p --db /var/lib/omega/data/default --url /products/large /www/example/products/large
 236
 237 because that would make the large products part of a new site,
 238 '/products/large', which is unlikely to be what you want, as large
 239 products would no longer come up in a search of the products
 240 site. (Note that the ``--depth-limit`` option may come in handy if you have
 241 sites '/products' and '/products/large', or similar.)
 242
 243 omindex has built-in support for indexing HTML, PHP, text files, CSV
 244 (Comma-Separated Values) files, SVG, Atom feeds, and AbiWord documents.  It can
 245 also index a number of other formats using external programs or libraries.  Filter programs and libraries
 246 are run with CPU, time and memory limits to prevent them from
 247 blocking indexing of other files or crashing omindex. If for one format both
 248 options are available, libraries would be preferred because they have a better runtime behaviour.
 249
 250 The way omindex decides how to index a file is based around MIME content-types.
 251 First of all omindex will look up a file's extension in its extension to MIME
 252 type map.  If there's no entry, it will then ask libmagic to examine the
 253 contents of the file and try to determine a MIME type.
 254
 255 The following formats are supported as standard (you can tell omindex to use
 256 other filters too - see below):
 257
 258 * HTML (.html, .htm, .shtml, .shtm, .xhtml, .xhtm)
 259 * PHP (.php) - our HTML parser knows to ignore PHP code
 260 * text files (.txt, .text)
 261 * SVG (.svg)
 262 * Compressed SVG (.svgz)
 263 * CSV (Comma-Separated Values) files (.csv)
 264 * PDF (.pdf) if pdftotext (comes with poppler or xpdf) or libpoppler
 265   (in particular libpoppler-glib-dev) are available
 266 * PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes
 267   with poppler or xpdf) or libpoppler (in particular libpoppler-glib-dev) are available
 268 * OpenOffice/StarOffice documents (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm,
 269   .sxw, .sxg, .stw) if unzip is available
 270 * OpenDocument format documents (.odt, .ods, .odp, .odg, .odc, .odf, .odb,
 271   .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) if unzip is
 272   available
 273 * MS Word documents (.dot) if antiword is available (.doc files are left to
 274   libmagic, as they may actually be RTF (AbiWord saves RTF when asked to save
 275   as .doc, and Microsoft Word quietly loads RTF files with a .doc extension),
 276   or plain-text).
 277 * MS Excel documents (.xls, .xlb, .xlt, .xlr, .xla) if xls2csv is available
 278   (comes with catdoc)
 279 * MS Powerpoint documents (.ppt, .pps) if catppt is available (comes with
 280   catdoc)
 281 * MS Office 2007 documents (.docx, .docm, .dotx, .dotm, .xlsx, .xlsm, .xltx,
 282   .xltm, .pptx, .pptm, .potx, .potm, .ppsx, .ppsm) if unzip is available
 283 * Wordperfect documents (.wpd) if wpd2text is available (comes with libwpd)
 284 * MS Works documents (.wps, .wpt) if wps2text is available (comes with libwps)
 285 * MS Outlook message (.msg) if perl with Email::Outlook::Message and
 286   HTML::Parser modules is available
 287 * MS Publisher documents (.pub) if pub2xhtml is available (comes with libmspub)
 288 * MS Visio documents (.vsd, .vss, .vst, .vsw, .vsdx, .vssx, .vstx, .vsdm,
 289   .vssm, .vstm) if vsd2xhtml is available (comes with libvisio)
 290 * Apple Keynote documents (.key, .kth, .apxl) if libetonyek is available (it is
 291   also possible to use key2text as an external filter)
 292 * Apple Numbers documents (.numbers) if libetonyek is available (it is
 293   also possible to use numbers2text as an external filter)
 294 * Apple Pages documents (.pages) if libetonyek is available (it is
 295   also possible to use pages2text as an external filter)
 296 * AbiWord documents (.abw, .awt)
 297 * Compressed AbiWord documents (.zabw)
 298 * Rich Text Format documents (.rtf) if unrtf is available
 299 * Perl POD documentation (.pl, .pm, .pod) if pod2text is available
 300 * reStructured text (.rst, .rest) if rst2html is available (comes with
 301   docutils)
 302 * Markdown (.md, .markdown) if markdown is available
 303 * TeX DVI files (.dvi) if catdvi is available
 304 * DjVu files (.djv, .djvu) if djvutxt is available
 305 * OpenXPS and XPS files (.oxps, .xps) if unzip is available
 306 * Debian packages (.deb, .udeb) if dpkg-deb is available
 307 * RPM packages (.rpm) if rpm is available
 308 * Atom feeds (.atom)
 309 * MAFF (.maff) if unzip is available
 310 * MHTML (.mhtml, .mht) if perl with MIME::Tools is available
 311 * MIME email messages (.eml) and USENET articles if gmime >= 2.6 or perl with
 312   MIME::Tools and HTML::Parser is available
 313 * vCard files (.vcf, .vcard) if perl with Text::vCard is available
 314 * EPUB if libgepub is available
 315 * FictionBook v.2 files (.fb2) if libe-book is available
 316 * QiOO (mobile format, for java-enabled cellphones) files (.jar) if libe-book is available
 317 * TCR (simple compressed text format) files (.tcr) if libe-book is available
 318 * eReader files (.pdb) if libe-book is available
 319 * Sony eBook files (.lrf) if libe-book is available
 320 * Bitmap image files that contain text (.png, .jpg, .jpeg, .jfif, .jpe, .webp,
 321   .tif, .tiff, .pbm, .gif, .ppm, .pgm) if libtesseract is available
 322 * AppleWorks/ClarisWorks documents (.cwk) if libmwaw is available
 323 * Apple PICT files (.pict, .pct, .pic) if libmwaw is available
 324 * Any format LibreOffice supports reading if LibreOffice is available.  This
 325   is implemented via the ``omindex_libreofficekit`` worker.  No MIME types are
 326   mapped to this worker by default because converting using it tends to be
 327   rather slow and we have alternative filters for supporting most of these
 328   formats.  The advantages of using LibreOffice are that it may successfully
 329   handle more files of some types than other filters (e.g. it handles
 330   "small-block" ``.doc`` files whereas antiword doesn't) and it may extract
 331   more metadata (e.g. with antiword you only get file extension, MIME type and
 332   last modified).  Enable use with ``--worker`` as documented below.
 333
 334 If you have additional extensions that represent one of these types, you can
 335 add an additional MIME mapping using the ``--mime-type`` option.  For
 336 instance, if your press releases are PostScript files with extension
 337 ``.posts`` you can tell omindex this like so::
 338
 339 $ omindex --db /var/lib/omega/data/default --url /press /www/example/press --mime-type posts:application/postscript
 340
 341 The syntax of ``--mime-type`` is 'ext:type', where ext is the extension of
 342 a file of that type (everything after the last '.').  The ``type`` can be any
 343 string, but to be useful there either needs to be a filter set for that type
 344 (using ``--filter`` or ``--read-filters``) or a worker set (using ``--worker``
 345 or ``--read-workers``), or by ``type`` being understood by default:
 346
 347 .. include:: inc/mimetypes.rst
 348
 349 You can specify ``*`` as the MIME sub-type for ``--filter`` or ``--worker``
 350 (arbitrary wildcards are not supported, just ``*`` for the entire sub-type).
 351 For example if you have a filter you want to apply to any video files, you
 352 could specify it using ``--filter 'video/*:index-video-file'``.  Note that this
 353 is checked right after checking for the exact MIME type, so will override any
 354 built-in filters which would otherwise match.  Be careful to quote ``*``
 355 to protect it from the shell.  Support for this was added in 1.3.3.
 356
 357 If there's no specific filter, and no subtype wildcard, then ``*/*`` is checked
 358 (assuming the mimetype contains a ``/``), and after that ``*`` (for any
 359 mimetype string).  Combined with filter command ``true`` for indexing by
 360 meta-data only, you can specify a fall back case of indexing by meta-data
 361 only using ``--filter '*:true'``.  Support for this was added in 1.3.4.
 362
 363 There are also two special values that can be specified instead of a MIME
 364 type:
 365
 366 * ignore - tells omindex to quietly ignore such files
 367 * skip - tells omindex to skip such files
 368
 369 By default no extensions are marked as "skip", and the following extensions are
 370 marked as "ignore":
 371
 372 .. include:: inc/ignored.rst
 373
 374 If you wish to remove a MIME mapping, you can do this by omitting the type -
 375 for example if you have ``.dot`` files which are inputs for the graphviz
 376 tool ``dot``, then you may wish to remove the default mapping for ``.dot``
 377 files and let libmagic be used to determine their type, which you can do
 378 using: ``--mime-type=dot:`` (if you want to *ignore* all ``.dot`` files,
 379 instead use ``--mime-type=dot:ignore``).
 380
 381 The lookup of extensions in the MIME mappings is case sensitive, but if an
 382 extension isn't found and includes upper case ASCII letters, they're converted
 383 to lower case and the lookup is repeated, so you effectively get case
 384 insensitive lookup for mappings specified with a lower-case extension, but
 385 you can set different handling for differently cased variants if you need
 386 to.
 387
 388 You can add support for additional MIME content types (or override existing
 389 ones) using the ``--filter`` and/or ``--read-filters`` options to specify a
 390 command to run.  At present, this command needs to produce output in either
 391 HTML, SVG, or plain text format (as of 1.3.3, you can specify the character
 392 encoding that the output will be in; in earlier versions, plain text output had
 393 to be UTF-8).  Support for SVG output from external commands was added in
 394 1.4.8.
 395
 396 If you need to use a literal ``%`` in the command string, it needs to be
 397 written as ``%%`` (since 1.3.3).
 398
 399 This command can take input in the following ways:
 400
 401 * (Since 1.5.0): If the command string has a ``|`` prefix, then the input file
 402   will be fed to the command on ``stdin``.  This is slightly more efficient as
 403   it often avoids having to open the input file an extra time (omindex needs to
 404   open the input file so it can calculate a checksum of the contents for
 405   duplicate detection, and also may need to use libmagic to find the file's
 406   MIME Content-Type).  In the future it will probably also allow extracting
 407   text from documents attached to emails, in ZIP files, etc without having to
 408   write them to a temporary file to run the filter on them.
 409
 410 * (Since 1.3.3): Any ``%f`` placeholder in the command string will be replaced
 411   with the filename of the file to extract (suitably escaped to protect it from
 412   the shell, so don't put quotes around ``%f``).
 413
 414 * If neither are present (and always in versions before 1.3.3) the filename is
 415   appended to the command (suitably escaped to protect it from the shell).
 416
 417 Output from the command can be handled in the following ways:
 418
 419 * (Since 1.3.3): Any ``%t`` in this command will be replaced with a filename in
 420   a temporary directory (suitably escaped to protect it from the shell, so
 421   don't put quotes around ``%t``).  The extension of this filename will reflect
 422   the expected output format (either ``.html``, ``.svg`` or ``.txt``).
 423
 424 * If you don't use ``%t`` in the command, then omindex will expect output on
 425   ``stdout`` (prior to 1.3.3, output had to be on ``stdout``).
 426
 427 For example, if you'd prefer to use Abiword to extract text from word documents
 428 (by default, omindex uses antiword), then you can pass the option
 429 ``--filter=application/msword:'abiword --to=txt --to-name=fd://1'`` to
 430 omindex.
 431
 432 Another example - if you wanted to handle files of MIME type
 433 ``application/octet-stream`` by piping them into ``strings -n8``, you can
 434 pass the option ``--filter=application/octet-stream:'|strings -n8'`` (since
 435 ``strings`` reads from ``stdin`` if no filename is specified, at least in
 436 the GNU binutils implementation).
 437
 438 A more complex example: to process ``.foo`` files with the (fictional)
 439 ``foo2utf16`` utility which produces UTF-16 text but doesn't support writing
 440 output to stdout, run omindex with ``-Mfoo:text/x-foo
 441 -Ftext/x-foo,,utf-16:'foo2utf16 %f %t'``.
 442
 443 A less contrived example of the use of ``--filter`` makes use of LibreOffice,
 444 via the unoconv script, to extract text from various formats.  First you
 445 need to start a listening instance (if you don't, unoconv will start up
 446 LibreOffice for every file, which is rather inefficient) - the ``&`` tells
 447 the shell to run it in the background::
 448
 449   unoconv --listener &
 450
 451 Then run omindex with options such as
 452 ``--filter=application/msword,html:'unoconv --stdout -f html'`` (you'll want
 453 to repeat this for each format which you want to use LibreOffice on).
 454
 455 If you specify ``false`` as the command in ``--filter``, omindex will skip
 456 files with the specified MIME type.  (As of 1.2.20 and 1.3.3 ``false`` is
 457 explicitly checked for; in earlier versions this will also work, at least
 458 on Unix where ``false`` is a command which ignores its arguments and exits with
 459 a non-zero status).
 460
 461 If you specify ``true`` as the command in ``--filter``, omindex won't try
 462 to extract text from the file, but will index it such that it can be searched
 463 for via metadata which comes from the filing system (filename, extension, mime
 464 content-type, last modified time, size).  (As of 1.2.22 and 1.3.4 ``true`` is
 465 explicitly checked for; in earlier versions this will also work, at least
 466 on Unix where ``true`` is a command which ignores its arguments and exits with
 467 a status zero).
 468
 469 If you know of a reliable filter which can extract text from a file format
 470 which might be of interest to others, please let us know so we can consider
 471 including it as a standard filter.
 472
 473 Since 1.5.0, omindex supports worker modules which provide integrations with
 474 extraction libraries without having to run a command line tool for every
 475 file.  These workers can typically extract metadata that a ``foo2text``
 476 program can't.  The worker runs as a subprocess, and is reused for multiple
 477 files.  This also means bugs in the library can only crash the worker process.
 478
 479 In most cases we default to setting a worker to be used for the types it
 480 supports, but for example the ``omindex_libreofficekit`` worker is not
 481 hooked up by default.  You can explicitly set a MIME type to worker mapping
 482 using ``--worker=TYPE:WORKER`` - e.g.
 483 ``--worker=application/msword:omindex_libreofficekit``.  This also supports
 484 wildcarding of the MIME type like ``--filter`` does.
 485
 486 The ``--duplicates`` option controls how omindex handles documents which map
 487 to a URL which is already in the database.  The default (which can be
 488 explicitly set with ``--duplicates=replace``) is to reindex if the last
 489 modified time of the file is newer than that recorded in the database.
 490 The alternative is ``--duplicates=ignore``, which will never reindex an
 491 existing document.  If you only add documents, this avoids the overhead
 492 of checking the last modified time.  It also allows you to prioritise
 493 adding completely new documents to the database over updating existing ones.
 494
 495 By default, omindex will remove any document in the database which has a URL
 496 that doesn't correspond to a file seen on disk - in other words, it will clear
 497 out everything that doesn't exist any more.  However if you are building up
 498 an omega database with several runs of omindex, this is not
 499 appropriate (as each run would delete the data from the previous run),
 500 so you should use the ``--no-delete`` option.  Note that if you
 501 choose to work like this, it is impossible to prune old documents from
 502 the database using omindex. If this is a problem for you, an
 503 alternative is to index each subsite into a different database, and
 504 merge all the databases together when searching.
 505
 506 ``--depth-limit`` allows you to prevent omindex from descending more than
 507 a certain number of directories.  Specifying ``--depth-limit=0`` means no limit
 508 is imposed on recursion; ``--depth-limit=1`` means don't descend into any
 509 subdirectories of the start directory.
 510
 511 Tracking files which couldn't be indexed
 512 ----------------------------------------
 513
 514 In older versions, omindex only tracked files which it successfully indexed -
 515 if a file couldn't be read, or a filter program failed on it, or it was marked
 516 not to be indexed (e.g. with an HTML meta tag) then it would be retried on
 517 subsequent runs.  Starting from version 1.3.4, omindex now tracks failed
 518 files in the user metadata of the database, along with their sizes and last
 519 modified times, and uses this data to skip files which previously failed and
 520 haven't changed since.
 521
 522 You can force omindex to retry such files using the ``--retry-failed`` option.
 523 One situation in which this is useful is if you've upgraded a filter program
 524 to a newer version which you suspect will index some files which previously
 525 failed.
 526
 527 Currently there's no mechanism for automatically removing failure entries
 528 when the file they refer to is removed or renamed.  These lingering entries are
 529 harmless, except they bloat the database a little.  A simple way to clear them
 530 out is to run periodically with ``--retry-failed`` as this removes any existing
 531 failure entries before indexing starts.
 532
 533 HTML Parsing
 534 ============
 535
 536 The document ``<title>`` tag is used as the document title.  Metadata in various
 537 ``<meta>`` tags is also understood - these values of the ``name`` parameter are
 538 currently handled when found:
 539
 540  * ``author``, ``dcterms.creator``, ``dcterms.contributor``: author(s)
 541  * ``created``, ``dcterms.issued``: document creation date
 542  * ``classification``: document topic
 543  * ``keywords``, ``dcterms.subject``, ``dcterms.description``: indexed as extra
 544    document text (but not stored in the sample)
 545  * ``description``: by default, handled as ``keywords``, as of Omega 1.4.4.
 546    If ``omindex`` is run with ``--sample=description``, then this is used as
 547    the preferred source for the stored sample of document text (HTML documents
 548    with no ``description`` fall back to a sample from the body; if
 549    ``description`` occurs multiple times then second and subsequent are handled
 550    as ``keywords``).  In Omega 1.4.2 and earlier, ``--sample`` wasn't supported
 551    and the behaviour was as if ``--sample=description`` had been specified.  In
 552    Omega 1.4.3, ``--sample`` was added, but the default was
 553    ``--sample=description`` (contrary to the intended and documented behaviour)
 554    - you can use ``--sample=body`` with 1.4.3 and later to store a sample from
 555    the document body.
 556
 557 The HTML parser will look for the 'robots' META tag, and won't index pages
 558 which are marked as ``noindex`` or ``none``, for example any of the following::
 559
 560     <meta name="robots" content="noindex,nofollow">
 561     <meta name="robots" content="noindex">
 562     <meta name="robots" content="none">
 563
 564 The ``omindex`` option ``--ignore-exclusions`` disables this behaviour, so
 565 the files with the above will be indexed anyway.
 566
 567 Sometimes it is useful to be able to exclude just part of a page from being
 568 indexed (for example you may not want to index navigation links, or a footer
 569 which appears on every page).  To allow this, the parser supports "magic"
 570 comments to mark sections of the document to not index.  Two formats are
 571 supported - htdig_noindex (used by ht://Dig) and UdmComment (used by
 572 mnoGoSearch)::
 573
 574     Index this bit <!--htdig_noindex-->but <b>not</b> this<!--/htdig_noindex-->
 575
 576 ::
 577
 578     <!--UdmComment--><div>Boring copyright notice</div><!--/UdmComment-->
 579
 580 Boolean terms
 581 =============
 582
 583 omindex will create the following boolean terms when it indexes a
 584 document:
 585
 586 E
 587     Extension of the file (e.g. `Epdf`) [since Omega 1.2.5]
 588 T
 589     MIME type
 590
 591 J
 592     The base URL, omitting any trailing slash (so if the base URL was just
 593     `/`, the term is just `J`).  If the resulting term would be > 240
 594     bytes, it's hashed in the same way an `U` prefix terms are.  Mnemonic: the
 595     Jumping-off point. [since Omega 1.3.4]
 596 H
 597     hostname of site (if supplied - this term won't exist if you index a
 598     site with base URL '/press', for instance).  Since Omega 1.3.4, if the
 599     resulting term would be > 240 bytes, it's hashed in the same way as `U`
 600     prefix terms are.
 601 P
 602     path terms - one term for the directory which the document is in, and for
 603     each parent directories, with no trailing slashes [since Omega 1.3.4 -
 604     in earlier versions, there was just one `P` term for the path of site (i.e.
 605     the rest of the site base URL) - this will be amongst the terms Omega 1.3.4
 606     adds].  Since Omega 1.3.4, if the resulting term would be > 240 bytes, it's
 607     hashed in the same way as `U` prefix terms are.
 608 U
 609     full URL of indexed document - if the resulting term would be > 240 bytes,
 610     a hashing scheme is used to avoid overflowing Xapian's term length limit.
 611
 612 If the ``--date-terms`` option is used, then the following additional boolean
 613 terms are added to documents (prior to Omega 1.5.0 they were added unless the
 614 ``--no-date-terms`` option was used; this option was added in 1.4.22, and
 615 before that they were unconditionally added):
 616
 617 D
 618     date (numeric format: YYYYMMDD)
 619
 620     date can also have the magical form "latest" - a document indexed
 621     by the term Dlatest matches any date-range without an end date.
 622     You can index dynamic documents which are always up to date
 623     with Dlatest and they'll match as expected.  (If you use sort by date,
 624     you'll probably also want to set the value containing the timestamp to
 625     a "max" value so dynamic documents match a date in the far future).
 626 M
 627     month (numeric format: YYYYMM)
 628 Y
 629     year (four digits)
 630
 631 omega configuration
 632 ===================
 633
 634 Most of the omega CGI configuration is dynamic, by setting CGI
 635 parameters. However some things must be configured using a
 636 configuration file.  The configuration file is searched for in
 637 various locations:
 638
 639 - Firstly, if the "OMEGA_CONFIG_FILE" environment variable is
 640   set, its value is used as the full path to a configuration file
 641   to read.
 642 - Next (if the environment variable is not set, or the file pointed
 643   to is not present), the file "omega.conf" in the same directory as
 644   the Omega CGI is used.
 645 - Next (if neither of the previous steps found a file), the file
 646   "${sysconfdir}/omega.conf" (e.g. /etc/omega.conf on Linux systems)
 647   is used.
 648 - Finally, if no configuration file is found, default values are used.
 649
 650 The format of the file is very simple: a line per option, with the
 651 option name followed by its value, separated by a whitespace.  Blank
 652 lines are ignored.  If the first non-whitespace character on a line
 653 is a '#', omega treats the line as a comment and ignores it.
 654
 655 The current options are:
 656
 657 - `database_dir`: the directory containing all the Omega databases
 658 - `template_dir`: the directory containing the OmegaScript templates
 659 - `log_dir`: the directory which the OmegaScript `$log` command writes log
 660   files to
 661 - `cdb_dir`: the directory which the OmegaScript `$lookup` command
 662   looks for CDB files in
 663
 664 The default values (used if no configuration file is found) are::
 665
 666  database_dir /var/lib/omega/data
 667  template_dir /var/lib/omega/templates
 668  log_dir /var/log/omega
 669  cdb_dir /var/lib/omega/cdb
 670
 671 Note that, with apache, environment variables may be set using mod_env, and
 672 with apache 1.3.7 or later this may be used inside a .htaccess file.  This
 673 makes it reasonably easy to share a single system installed copy of Omega
 674 between multiple users.
 675
 676 Supplied Templates
 677 ==================
 678
 679 The OmegaScript templates supplied with Omega are:
 680
 681 * query - This is the default template, providing a typical Web search
 682   interface.
 683 * topterms - This is just like query, but provides a "top terms" feature
 684   which suggests terms the user might want to add to their query to
 685   obtain better results.
 686 * godmode - Allows you to inspect a database showing which terms index
 687   each document, and which documents are indexed by each term.
 688 * opensearch - Provides results in OpenSearch format (for more details
 689   see http://www.opensearch.org/).
 690 * xml - Provides results in a custom XML format.
 691 * emptydocs - Shows a list of documents with zero length.  If CGI parameter
 692   TERM is set to a non-empty value, then only documents indexed by that given
 693   term are shown (e.g. TERM=Tapplication/pdf to show PDF files with no text);
 694   otherwise all zero length documents are shown.
 695
 696 There are also "helper fragments" used by the templates above:
 697
 698 * inc/anyalldropbox - Provides a choice of matching "any" or "all" terms
 699   by default as a drop down box.
 700 * inc/anyallradio - Provides a choice of matching "any" or "all" terms
 701   by default as radio buttons.
 702 * toptermsjs - Provides some JavaScript used by the topterms template.
 703
 704 Document data construction
 705 ==========================
 706
 707 This is only useful if you need to inject your own documents into the
 708 database independently of omindex, such as if you are indexing
 709 dynamically-generated documents that are served using a server-side
 710 system such as PHP or ASP, but which you can determine the contents of
 711 in some way, such as documents generated from reasonably static
 712 database contents.
 713
 714 The document data field stores some summary information about the
 715 document, in the following (sample) format::
 716
 717  url=<baseurl>
 718  sample=<sample>
 719  caption=<title>
 720  type=<mimetype>
 721
 722 Further fields may be added (although omindex doesn't currently add any
 723 others), and may be looked up from OmegaScript using the $field{}
 724 command.
 725
 726 As of Omega 0.9.3, you can alternatively add something like this near the
 727 start of your OmegaScript template::
 728
 729 $set{fieldnames,$split{caption sample url}}
 730
 731 Then you need only give the field values in the document data, which can
 732 save a lot of space in a large database.  With the setting of fieldnames
 733 above, the first line of document data can be accessed with $field{caption},
 734 the second with $field{sample}, and the third with $field{url}.
 735
 736 Stopword List
 737 =============
 738
 739 At search time, Omega uses a built-in list of stopwords, which are::
 740
 741     a about an and are as at be by en for from how i in is it of on or that the
 742     this to was what when where which who why will with you your