xapian-applications/omega/docs/overview.rst

   1 ==============
   2 Omega overview
   3 ==============
   4
   5 If you just want a very quick overview, you might prefer to read the
   6 `quick-start guide <quickstart.html>`_.
   7
   8 Omega operates on a set of databases.  Each database is created and updated
   9 separately using either omindex or `scriptindex <scriptindex.html>`_.  You can
  10 search these databases (or any other Xapian database with suitable contents)
  11 via a web front-end provided by omega, a CGI application.  A search can also be
  12 done over more than one database at once.
  13
  14 There are separate documents covering `CGI parameters <cgiparams.html>`_, the
  15 `Term Prefixes <termprefixes.html>`_ which are conventionally used, and
  16 `OmegaScript <omegascript.html>`_, the language used to define omega's web
  17 interface.  Omega ships with several OmegaScript templates and you can
  18 use these, modify them, or just write your own.  See the "Supplied Templates"
  19 section below for details of the supplied templates.
  20
  21 Omega parses queries using the ``Xapian::QueryParser`` class - for the supported
  22 syntax, see queryparser.html in the xapian-core documentation
  23 - available online at: https://xapian.org/docs/queryparser.html
  24
  25 Term construction
  26 =================
  27
  28 Documents within an omega database are indexed by two types of terms: those
  29 used for a weighted search from a parsed query string (the CGI parameter
  30 ``P``), and those used for boolean filtering (the CGI parameters ``B`` and
  31 ``N`` - the latter is a negated variant of 'B' and was added in Omega 1.3.5).
  32
  33 Boolean terms always start with a prefix which is an initial capital letter (or
  34 multiple capital letters if the first character is `X`) which denotes the
  35 category of the term (e.g. `M` for MIME type).
  36
  37 Parsed query terms may have a prefix, but don't always.  Those from the body of
  38 the document in unstemmed form don't; stemmed terms have a `Z` prefix; terms
  39 from other fields have a prefix to indicate the field, such as `S` for the
  40 document title; stemmed terms from a field have both prefixes, e.g. `ZS`.
  41
  42 The "english" stemmer is used by default - you can configure this for omindex
  43 and scriptindex with ``--stemmer=LANGUAGE`` (use ``--stemmer=none`` to disable
  44 stemming, see omindex ``--help`` for the list of accepted language names).  At
  45 search time you can configure the stemmer by adding ``$set{stemmer,LANGUAGE}``
  46 to the top of your OmegaScript template.
  47
  48 The two term types are used as follows when building the query:
  49
  50 The ``P`` parameter is parsed using `Xapian::QueryParser` to give a
  51 `Xapian::Query` object denoted as `P-terms` below.
  52
  53 There are two ways that ``B`` and ``N`` parameters are handled, depending if
  54 the term-prefix has been configured as "non-exclusive" or not.  The default is
  55 "exclusive" (and in versions before 1.3.4, this was how all ``B`` parameters
  56 were handled).
  57
  58 Exclusive Boolean Prefix
  59 ------------------------
  60
  61 B(oolean) terms from 'B' parameters with the same prefix are ORed together,
  62 like so::
  63
  64
  65                     [   OR   ]
  66                    /    | ... \
  67               B(F,1) B(F,2)...B(F,n)
  68
  69 Where B(F,1) is the first boolean term with prefix F from a 'B' parameter, and
  70 so on.
  71
  72 Non-Exclusive Boolean Prefix
  73 ----------------------------
  74
  75 For example, ``$setmap{nonexclusiveprefix,K,true}`` sets prefix `K` as
  76 non-exclusive, which means that multiple filter terms from 'B' parameters will
  77 be combined with "AND" instead of "OR", like so::
  78
  79                     [   AND   ]
  80                    /     | ... \
  81               B(K,1) B(K,2)... B(K,m)
  82
  83 Combining the Boolean Filters
  84 -----------------------------
  85
  86 The subqueries for each prefix from "B" parameters are combined with AND,
  87 to make this (which we refer to as "B-filter" below)::
  88
  89                          [     AND     ]
  90                         /       |  ...  \
  91                        /                 \
  92                  [   OR   ]               [   AND  ]
  93                 /    | ... \             /    | ... \
  94            B(F,1) B(F,2)...B(F,n)   B(K,1) B(K,2)...B(K,m)
  95
  96
  97 Negated Boolean Terms
  98 ---------------------
  99
 100 All the terms from all 'N' parameters are combined together with "OR", to
 101 make this (which we refer to as "N-filter" below)::
 102
 103                     [       OR       ]
 104                    / ... |     |  ... \
 105               N(F,1)...N(F,n) N(K,1)...N(K,m)
 106
 107 Putting it all together
 108 -----------------------
 109
 110 The P-terms are filtered by the B-filter using "FILTER" and by the N-filter
 111 using "AND_NOT"::
 112
 113                         [ AND_NOT ]
 114                        /           \
 115                       /             \
 116             [ FILTER ]             N-terms
 117              /      \
 118             /        \
 119        P-terms      B-terms
 120
 121 The intent here is to allow filtering on arbitrary (and, typically,
 122 orthogonal) characteristics of the document. For instance, by adding
 123 boolean terms "Ttext/html", "Ttext/plain" and "J/press" you would be
 124 filtering the parsed query to only retrieve documents that are both in
 125 the "/press" site *and* which are either of MIME type text/html or
 126 text/plain. (See below for more information about sites.)
 127
 128 If B-terms or N-terms is absent, that part of the query is simply omitted.
 129
 130 If there is no parsed query, the boolean filter is promoted to
 131 be the query, and the weighting scheme is set to boolean.  This has
 132 the effect of applying the boolean filter to the whole database.  If
 133 there are only N-terms, then ``Query::MatchAll`` is used for the left
 134 side of the "AND_NOT".
 135
 136 In order to add more boolean prefixes, you will need to alter the
 137 ``index_file()`` function in omindex.cc. Currently omindex adds several
 138 useful ones, detailed below.
 139
 140 Parsed query terms are constructed from the title, body and keywords
 141 of a document. (Not all document types support all three areas of
 142 text.) Title terms are stored with position data starting at 0, body
 143 terms starting 100 beyond title terms, and keyword terms starting 100
 144 beyond body terms. This allows queries using positional data without
 145 causing false matches across the different types of term.
 146
 147 Sites
 148 =====
 149
 150 Within a database, Omega supports multiple sites. These are recorded
 151 using boolean terms (see 'Term construction', above) to allow
 152 filtering on them.
 153
 154 Sites work by having all documents within them having a common base
 155 URL. For instance, you might have two sites, one for your press area
 156 and one for your product descriptions:
 157
 158 - \http://example.com/press/index.html
 159 - \http://example.com/press/bigrelease.html
 160 - \http://example.com/products/bigproduct.html
 161 - \http://example.com/products/littleproduct.html
 162
 163 You could index all documents within \http://example.com/press/ using a
 164 site of '/press', and all within \http://example.com/products/ using
 165 '/products'.
 166
 167 Sites are also useful because omindex indexes documents through the
 168 file system, not by fetching from the web server. If you don't have a
 169 URL to file system mapping which puts all documents under one
 170 hierarchy, you'll need to index each separate section as a site.
 171
 172 An obvious example of this is the way that many web servers map URLs
 173 of the form <\http://example.com/~<username>/> to a directory within
 174 that user's home directory (such as ~<username>/pub on a Unix
 175 system). In this case, you can index each user's home page separately,
 176 as a site of the form '/~<username>'. You can then use boolean
 177 filters to allow people to search only a specific home page (or a
 178 group of them), or omit such terms to search everyone's pages.
 179
 180 Note that the site specified when you index is used to build the
 181 complete URL that the results page links to. Thus while sites will
 182 typically want to be relative to the hostname part of the URL (e.g.
 183 '/site' rather than '\http://example.com/site'), you can use them
 184 to have a single search across several different hostnames. This will
 185 still work if you actually store each distinct hostname in a different
 186 database.
 187
 188 omindex operation
 189 =================
 190
 191 omindex is fairly simple to use, for example::
 192
 193   omindex --db default --url http://example.com/ /var/www/example.com
 194
 195 For a full list of command line options supported, see ``man omindex``
 196 or ``omindex --help``.
 197
 198 You *must* specify the database to index into (it's created if it doesn't
 199 exist, but parent directories must exist).  You will often also want to specify
 200 the base URL (which is used as the site, and can be relative to the hostname -
 201 starts '/' - or absolute - starts with a scheme, e.g.
 202 '\http://example.com/products/').  If not specified, the base URL defaults to
 203 ``/``.
 204
 205 You also need to tell omindex which directory to index. This should be
 206 either a single directory (in which case it is taken to be the
 207 directory base of the entire site being indexed), or as two arguments,
 208 the first being the directory base of the site being indexed, and the
 209 second being a relative directory within that to index.
 210
 211 For instance, in the example above, if you separate your products by
 212 size, you might end up with:
 213
 214 - \http://example.com/press/index.html
 215 - \http://example.com/press/bigrelease.html
 216 - \http://example.com/products/large/bigproduct.html
 217 - \http://example.com/products/small/littleproduct.html
 218
 219 If the entire website is stored in the file system under the directory
 220 /www/example, then you would probably index the site in two
 221 passes, one for the '/press' site and one for the '/products' site. You
 222 might use the following commands::
 223
 224 $ omindex -p --db /var/lib/omega/data/default --url /press /www/example/press
 225 $ omindex -p --db /var/lib/omega/data/default --url /products /www/example/products
 226
 227 If you add a new large products, but don't want to reindex the whole of
 228 the products section, you could do::
 229
 230 $ omindex -p --db /var/lib/omega/data/default --url /products /www/example/products large
 231
 232 and just the large products will be reindexed. You need to do it like that, and
 233 not as::
 234
 235 $ omindex -p --db /var/lib/omega/data/default --url /products/large /www/example/products/large
 236
 237 because that would make the large products part of a new site,
 238 '/products/large', which is unlikely to be what you want, as large
 239 products would no longer come up in a search of the products
 240 site. (Note that the ``--depth-limit`` option may come in handy if you have
 241 sites '/products' and '/products/large', or similar.)
 242
 243 omindex has built-in support for indexing HTML, PHP, text files, CSV
 244 (Comma-Separated Values) files, SVG, Atom feeds, and AbiWord documents.  It can
 245 also index a number of other formats using external programs.  Filter programs
 246 are run with CPU, time and memory limits to prevent a runaway filter from
 247 blocking indexing of other files.
 248
 249 The way omindex decides how to index a file is based around MIME content-types.
 250 First of all omindex will look up a file's extension in its extension to MIME
 251 type map.  If there's no entry, it will then ask libmagic to examine the
 252 contents of the file and try to determine a MIME type.
 253
 254 The following formats are supported as standard (you can tell omindex to use
 255 other filters too - see below):
 256
 257 * HTML (.html, .htm, .shtml, .shtm, .xhtml, .xhtm)
 258 * PHP (.php) - our HTML parser knows to ignore PHP code
 259 * text files (.txt, .text)
 260 * SVG (.svg)
 261 * CSV (Comma-Separated Values) files (.csv)
 262 * PDF (.pdf) if pdftotext is available (comes with poppler or xpdf)
 263 * PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes
 264   with poppler or xpdf) are available
 265 * OpenOffice/StarOffice documents (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm,
 266   .sxw, .sxg, .stw) if unzip is available
 267 * OpenDocument format documents (.odt, .ods, .odp, .odg, .odc, .odf, .odb,
 268   .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) if unzip is
 269   available
 270 * MS Word documents (.dot) if antiword is available (.doc files are left to
 271   libmagic, as they may actually be RTF (AbiWord saves RTF when asked to save
 272   as .doc, and Microsoft Word quietly loads RTF files with a .doc extension),
 273   or plain-text).
 274 * MS Excel documents (.xls, .xlb, .xlt, .xlr, .xla) if xls2csv is available
 275   (comes with catdoc)
 276 * MS Powerpoint documents (.ppt, .pps) if catppt is available (comes with
 277   catdoc)
 278 * MS Office 2007 documents (.docx, .docm, .dotx, .dotm, .xlsx, .xlsm, .xltx,
 279   .xltm, .pptx, .pptm, .potx, .potm, .ppsx, .ppsm) if unzip is available
 280 * Wordperfect documents (.wpd) if wpd2text is available (comes with libwpd)
 281 * MS Works documents (.wps, .wpt) if wps2text is available (comes with libwps)
 282 * MS Outlook message (.msg) if perl with Email::Outlook::Message and
 283   HTML::Parser modules is available
 284 * MS Publisher documents (.pub) if pub2xhtml is available (comes with libmspub)
 285 * MS Visio documents (.vsd, .vss, .vst, .vsw, .vsdx, .vssx, .vstx, .vsdm,
 286   .vssm, .vstm) if vsd2xhtml is available (comes with libvisio)
 287 * AbiWord documents (.abw)
 288 * Compressed AbiWord documents (.zabw)
 289 * Rich Text Format documents (.rtf) if unrtf is available
 290 * Perl POD documentation (.pl, .pm, .pod) if pod2text is available
 291 * reStructured text (.rst, .rest) if rst2html is available (comes with
 292   docutils)
 293 * Markdown (.md, .markdown) if markdown is available
 294 * TeX DVI files (.dvi) if catdvi is available
 295 * DjVu files (.djv, .djvu) if djvutxt is available
 296 * XPS files (.xps) if unzip is available
 297 * Debian packages (.deb, .udeb) if dpkg-deb is available
 298 * RPM packages (.rpm) if rpm is available
 299 * Atom feeds (.atom)
 300 * MAFF (.maff) if unzip is available
 301 * MHTML (.mhtml, .mht) if perl with MIME::Tools is available
 302 * MIME email messages (.eml) and USENET articles if perl with MIME::Tools and
 303   HTML::Parser is available
 304 * vCard files (.vcf, .vcard) if perl with Text::vCard is available
 305
 306 If you have additional extensions that represent one of these types, you can
 307 add an additional MIME mapping using the ``--mime-type`` option.  For
 308 instance, if your press releases are PostScript files with extension
 309 ``.posts`` you can tell omindex this like so::
 310
 311 $ omindex --db /var/lib/omega/data/default --url /press /www/example/press --mime-type posts:application/postscript
 312
 313 The syntax of ``--mime-type`` is 'ext:type', where ext is the extension of
 314 a file of that type (everything after the last '.').  The ``type`` can be any
 315 string, but to be useful there either needs to be a filter set for that type
 316 - either using ``--filter`` or by ``type`` being understood by default:
 317
 318 .. include:: inc/mimetypes.rst
 319
 320 You can specify ``*`` as the MIME sub-type for ``--filter``, for example if you
 321 have a filter you want to apply to any video files, you could specify it using
 322 ``--filter 'video/*:index-video-file'``.  Note that this is checked right after
 323 checking for the exact MIME type, so will override any built-in filters which
 324 would otherwise match.  Also you can't use arbitrary wildcards, just ``*`` for
 325 the entire sub-type.  And be careful to quote ``*`` to protect it from the
 326 shell.  Support for this was added in 1.3.3.
 327
 328 If there's no specific filter, and no subtype wildcard, then ``*/*`` is checked
 329 (assuming the mimetype contains a ``/``), and after that ``*`` (for any
 330 mimetype string).  Combined with filter command ``true`` for indexing by
 331 meta-data only, you can specify a fall back case of indexing by meta-data
 332 only using ``--filter '*:true'``.  Support for this was added in 1.3.4.
 333
 334 There are also two special values that can be specified instead of a MIME
 335 type:
 336
 337 * ignore - tells omindex to quietly ignore such files
 338 * skip - tells omindex to skip such files
 339
 340 By default no extensions are marked as "skip", and the following extensions are
 341 marked as "ignore":
 342
 343 .. include:: inc/ignored.rst
 344
 345 If you wish to remove a MIME mapping, you can do this by omitting the type -
 346 for example if you have ``.dot`` files which are inputs for the graphviz
 347 tool ``dot``, then you may wish to remove the default mapping for ``.dot``
 348 files and let libmagic be used to determine their type, which you can do
 349 using: ``--mime-type=dot:`` (if you want to *ignore* all ``.dot`` files,
 350 instead use ``--mime-type=dot:ignore``).
 351
 352 The lookup of extensions in the MIME mappings is case sensitive, but if an
 353 extension isn't found and includes upper case ASCII letters, they're converted
 354 to lower case and the lookup is repeated, so you effectively get case
 355 insensitive lookup for mappings specified with a lower-case extension, but
 356 you can set different handling for differently cased variants if you need
 357 to.
 358
 359 You can add support for additional MIME content types (or override existing
 360 ones) using the ``--filter`` option to specify a command to run.  At present,
 361 this command needs to produce output in either HTML, SVG, or plain text format
 362 (as of 1.3.3, you can specify the character encoding that the output will be
 363 in; in earlier versions, plain text output had to be UTF-8).  Support for SVG
 364 output from external commands was added in 1.4.8.
 365
 366 As of 1.3.3, the command can include certain placeholders which are substituted
 367 by omindex:
 368
 369 * Any ``%f`` in this command will be replaced with the filename of the file to
 370   extract (suitably escaped to protect it from the shell, so don't put quotes
 371   around ``%f``).
 372
 373   If you don't include ``%f`` in the command, then the filename of the file to
 374   be extracted will be appended to the command, separated by a space.
 375
 376 * Any ``%t`` in this command will be replaced with a filename in a temporary
 377   directory (suitably escaped to protect it from the shell, so don't put
 378   quotes around ``%t``).  The extension of this filename will reflect the
 379   expected output format (either ``.html``, ``.svg`` or ``.txt``).  If you
 380   don't use ``%t`` in the command, then omindex will expect output on
 381   ``stdout`` (prior to 1.3.3, output had to be on ``stdout``).
 382
 383 * ``%%`` can be used should you need a literal ``%`` in the command.
 384
 385 For example, if you'd prefer to use Abiword to extract text from word documents
 386 (by default, omindex uses antiword), then you can pass the option
 387 ``--filter=application/msword:'abiword --to=txt --to-name=fd://1'`` to
 388 omindex.
 389
 390 Another example - if you wanted to handle files of MIME type
 391 ``application/octet-stream`` by running them through ``strings -n8``, you can
 392 pass the option ``--filter=application/octet-stream:'strings -n8'``.
 393
 394 A more complex example: to process ``.foo`` files with the (fictional)
 395 ``foo2utf16`` utility which produces UTF-16 text but doesn't support writing
 396 output to stdout, run omindex with ``-Mfoo:text/x-foo
 397 -Ftext/x-foo,,utf-16:'foo2utf16 %f %t'``.
 398
 399 A less contrived example of the use of ``--filter`` makes use of LibreOffice,
 400 via the unoconv script, to extract text from various formats.  First you
 401 need to start a listening instance (if you don't, unoconv will start up
 402 LibreOffice for every file, which is rather inefficient) - the ``&`` tells
 403 the shell to run it in the background::
 404
 405   unoconv --listener &
 406
 407 Then run omindex with options such as
 408 ``--filter=application/msword,html:'unoconv --stdout -f html'`` (you'll want
 409 to repeat this for each format which you want to use LibreOffice on).
 410
 411 If you specify ``false`` as the command in ``--filter``, omindex will skip
 412 files with the specified MIME type.  (As of 1.2.20 and 1.3.3 ``false`` is
 413 explicitly checked for; in earlier versions this will also work, at least
 414 on Unix where ``false`` is a command which ignores its arguments and exits with
 415 a non-zero status).
 416
 417 If you specify ``true`` as the command in ``--filter``, omindex won't try
 418 to extract text from the file, but will index it such that it can be searched
 419 for via metadata which comes from the filing system (filename, extension, mime
 420 content-type, last modified time, size).  (As of 1.2.22 and 1.3.4 ``true`` is
 421 explicitly checked for; in earlier versions this will also work, at least
 422 on Unix where ``true`` is a command which ignores its arguments and exits with
 423 a status zero).
 424
 425 If you know of a reliable filter which can extract text from a file format
 426 which might be of interest to others, please let us know so we can consider
 427 including it as a standard filter.
 428
 429 The ``--duplicates`` option controls how omindex handles documents which map
 430 to a URL which is already in the database.  The default (which can be
 431 explicitly set with ``--duplicates=replace``) is to reindex if the last
 432 modified time of the file is newer than that recorded in the database.
 433 The alternative is ``--duplicates=ignore``, which will never reindex an
 434 existing document.  If you only add documents, this avoids the overhead
 435 of checking the last modified time.  It also allows you to prioritise
 436 adding completely new documents to the database over updating existing ones.
 437
 438 By default, omindex will remove any document in the database which has a URL
 439 that doesn't correspond to a file seen on disk - in other words, it will clear
 440 out everything that doesn't exist any more.  However if you are building up
 441 an omega database with several runs of omindex, this is not
 442 appropriate (as each run would delete the data from the previous run),
 443 so you should use the ``--no-delete`` option.  Note that if you
 444 choose to work like this, it is impossible to prune old documents from
 445 the database using omindex. If this is a problem for you, an
 446 alternative is to index each subsite into a different database, and
 447 merge all the databases together when searching.
 448
 449 ``--depth-limit`` allows you to prevent omindex from descending more than
 450 a certain number of directories.  Specifying ``--depth-limit=0`` means no limit
 451 is imposed on recursion; ``--depth-limit=1`` means don't descend into any
 452 subdirectories of the start directory.
 453
 454 Tracking files which couldn't be indexed
 455 ----------------------------------------
 456
 457 In older versions, omindex only tracked files which it successfully indexed -
 458 if a file couldn't be read, or a filter program failed on it, or it was marked
 459 not to be indexed (e.g. with an HTML meta tag) then it would be retried on
 460 subsequent runs.  Starting from version 1.3.4, omindex now tracks failed
 461 files in the user metadata of the database, along with their sizes and last
 462 modified times, and uses this data to skip files which previously failed and
 463 haven't changed since.
 464
 465 You can force omindex to retry such files using the ``--retry-failed`` option.
 466 One situation in which this is useful is if you've upgraded a filter program
 467 to a newer version which you suspect will index some files which previously
 468 failed.
 469
 470 Currently there's no mechanism for automatically removing failure entries
 471 when the file they refer to is removed or renamed.  These lingering entries are
 472 harmless, except they bloat the database a little.  A simple way to clear them
 473 out is to run periodically with ``--retry-failed`` as this removes any existing
 474 failure entries before indexing starts.
 475
 476 HTML Parsing
 477 ============
 478
 479 The document ``<title>`` tag is used as the document title.  Metadata in various
 480 ``<meta>`` tags is also understood - these values of the ``name`` parameter are
 481 currently handled when found:
 482
 483  * ``author``, ``dcterms.creator``, ``dcterms.contributor``: author(s)
 484  * ``created``, ``dcterms.issued``: document creation date
 485  * ``classification``: document topic
 486  * ``keywords``, ``dcterms.subject``, ``dcterms.description``: indexed as extra
 487    document text (but not stored in the sample)
 488  * ``description``: by default, handled as ``keywords``, as of Omega 1.4.4.
 489    If ``omindex`` is run with ``--sample=description``, then this is used as
 490    the preferred source for the stored sample of document text (HTML documents
 491    with no ``description`` fall back to a sample from the body; if
 492    ``description`` occurs multiple times then second and subsequent are handled
 493    as ``keywords``).  In Omega 1.4.2 and earlier, ``--sample`` wasn't supported
 494    and the behaviour was as if ``--sample=description`` had been specified.  In
 495    Omega 1.4.3, ``--sample`` was added, but the default was
 496    ``--sample=description`` (contrary to the intended and documented behaviour)
 497    - you can use ``--sample=body`` with 1.4.3 and later to store a sample from
 498    the document body.
 499
 500 The HTML parser will look for the 'robots' META tag, and won't index pages
 501 which are marked as ``noindex`` or ``none``, for example any of the following::
 502
 503     <meta name="robots" content="noindex,nofollow">
 504     <meta name="robots" content="noindex">
 505     <meta name="robots" content="none">
 506
 507 The ``omindex`` option ``--ignore-exclusions`` disables this behaviour, so
 508 the files with the above will be indexed anyway.
 509
 510 Sometimes it is useful to be able to exclude just part of a page from being
 511 indexed (for example you may not want to index navigation links, or a footer
 512 which appears on every page).  To allow this, the parser supports "magic"
 513 comments to mark sections of the document to not index.  Two formats are
 514 supported - htdig_noindex (used by ht://Dig) and UdmComment (used by
 515 mnoGoSearch)::
 516
 517     Index this bit <!--htdig_noindex-->but <b>not</b> this<!--/htdig_noindex-->
 518
 519 ::
 520
 521     <!--UdmComment--><div>Boring copyright notice</div><!--/UdmComment-->
 522
 523 Boolean terms
 524 =============
 525
 526 omindex will create the following boolean terms when it indexes a
 527 document:
 528
 529 E
 530     Extension of the file (e.g. `Epdf`) [since Omega 1.2.5]
 531 T
 532     MIME type
 533
 534 J
 535     The base URL, omitting any trailing slash (so if the base URL was just
 536     `/`, the term is just `J`).  If the resulting term would be > 240
 537     bytes, it's hashed in the same way an `U` prefix terms are.  Mnemonic: the
 538     Jumping-off point. [since Omega 1.3.4]
 539 H
 540     hostname of site (if supplied - this term won't exist if you index a
 541     site with base URL '/press', for instance).  Since Omega 1.3.4, if the
 542     resulting term would be > 240 bytes, it's hashed in the same way as `U`
 543     prefix terms are.
 544 P
 545     path terms - one term for the directory which the document is in, and for
 546     each parent directories, with no trailing slashes [since Omega 1.3.4 -
 547     in earlier versions, there was just one `P` term for the path of site (i.e.
 548     the rest of the site base URL) - this will be amongst the terms Omega 1.3.4
 549     adds].  Since Omega 1.3.4, if the resulting term would be > 240 bytes, it's
 550     hashed in the same way as `U` prefix terms are.
 551 U
 552     full URL of indexed document - if the resulting term would be > 240 bytes,
 553     a hashing scheme is used to avoid overflowing Xapian's term length limit.
 554
 555 D
 556     date (numeric format: YYYYMMDD)
 557
 558     date can also have the magical form "latest" - a document indexed
 559     by the term Dlatest matches any date-range without an end date.
 560     You can index dynamic documents which are always up to date
 561     with Dlatest and they'll match as expected.  (If you use sort by date,
 562     you'll probably also want to set the value containing the timestamp to
 563     a "max" value so dynamic documents match a date in the far future).
 564 M
 565     month (numeric format: YYYYMM)
 566 Y
 567     year (four digits)
 568
 569 omega configuration
 570 ===================
 571
 572 Most of the omega CGI configuration is dynamic, by setting CGI
 573 parameters. However some things must be configured using a
 574 configuration file.  The configuration file is searched for in
 575 various locations:
 576
 577 - Firstly, if the "OMEGA_CONFIG_FILE" environment variable is
 578   set, its value is used as the full path to a configuration file
 579   to read.
 580 - Next (if the environment variable is not set, or the file pointed
 581   to is not present), the file "omega.conf" in the same directory as
 582   the Omega CGI is used.
 583 - Next (if neither of the previous steps found a file), the file
 584   "${sysconfdir}/omega.conf" (e.g. /etc/omega.conf on Linux systems)
 585   is used.
 586 - Finally, if no configuration file is found, default values are used.
 587
 588 The format of the file is very simple: a line per option, with the
 589 option name followed by its value, separated by a whitespace.  Blank
 590 lines are ignored.  If the first non-whitespace character on a line
 591 is a '#', omega treats the line as a comment and ignores it.
 592
 593 The current options are:
 594
 595 - `database_dir`: the directory containing all the Omega databases
 596 - `template_dir`: the directory containing the OmegaScript templates
 597 - `log_dir`: the directory which the OmegaScript `$log` command writes log
 598   files to
 599 - `cdb_dir`: the directory which the OmegaScript `$lookup` command
 600   looks for CDB files in
 601
 602 The default values (used if no configuration file is found) are::
 603
 604  database_dir /var/lib/omega/data
 605  template_dir /var/lib/omega/templates
 606  log_dir /var/log/omega
 607  cdb_dir /var/lib/omega/cdb
 608
 609 Note that, with apache, environment variables may be set using mod_env, and
 610 with apache 1.3.7 or later this may be used inside a .htaccess file.  This
 611 makes it reasonably easy to share a single system installed copy of Omega
 612 between multiple users.
 613
 614 Supplied Templates
 615 ==================
 616
 617 The OmegaScript templates supplied with Omega are:
 618
 619 * query - This is the default template, providing a typical Web search
 620   interface.
 621 * topterms - This is just like query, but provides a "top terms" feature
 622   which suggests terms the user might want to add to their query to
 623   obtain better results.
 624 * godmode - Allows you to inspect a database showing which terms index
 625   each document, and which documents are indexed by each term.
 626 * opensearch - Provides results in OpenSearch format (for more details
 627   see http://www.opensearch.org/).
 628 * xml - Provides results in a custom XML format.
 629 * emptydocs - Shows a list of documents with zero length.  If CGI parameter
 630   TERM is set to a non-empty value, then only documents indexed by that given
 631   term are shown (e.g. TERM=Tapplication/pdf to show PDF files with no text);
 632   otherwise all zero length documents are shown.
 633
 634 There are also "helper fragments" used by the templates above:
 635
 636 * inc/anyalldropbox - Provides a choice of matching "any" or "all" terms
 637   by default as a drop down box.
 638 * inc/anyallradio - Provides a choice of matching "any" or "all" terms
 639   by default as radio buttons.
 640 * toptermsjs - Provides some JavaScript used by the topterms template.
 641
 642 Document data construction
 643 ==========================
 644
 645 This is only useful if you need to inject your own documents into the
 646 database independently of omindex, such as if you are indexing
 647 dynamically-generated documents that are served using a server-side
 648 system such as PHP or ASP, but which you can determine the contents of
 649 in some way, such as documents generated from reasonably static
 650 database contents.
 651
 652 The document data field stores some summary information about the
 653 document, in the following (sample) format::
 654
 655  url=<baseurl>
 656  sample=<sample>
 657  caption=<title>
 658  type=<mimetype>
 659
 660 Further fields may be added (although omindex doesn't currently add any
 661 others), and may be looked up from OmegaScript using the $field{}
 662 command.
 663
 664 As of Omega 0.9.3, you can alternatively add something like this near the
 665 start of your OmegaScript template::
 666
 667 $set{fieldnames,$split{caption sample url}}
 668
 669 Then you need only give the field values in the document data, which can
 670 save a lot of space in a large database.  With the setting of fieldnames
 671 above, the first line of document data can be accessed with $field{caption},
 672 the second with $field{sample}, and the third with $field{url}.
 673
 674 Stopword List
 675 =============
 676
 677 At search time, Omega uses a built-in list of stopwords, which are::
 678
 679     a about an and are as at be by en for from how i in is it of on or that the
 680     this to was what when where which who why will with you your