5 The basic format is one or more field names followed by a colon, followed by
6 one or more actions. Some actions take an optional or required parameter.
8 Since Omega 1.4.6, the parameter value can be enclosed in double quotes,
9 which is necessary if it contains whitespace; it's also needed for
10 parameter values containing a comma for actions which support multiple
11 parameters (such as ``split``) since there unquoted commas are interpreted
12 as separating parameters.
14 Since Omega 1.4.8, the following C-like escape sequences are supported
15 for parameter values enclosed in double quotes: ``\\``, ``\"``, ``\0``, ``\t``,
16 ``\n``, ``\r``, and ``\x`` followed by two hex digits.
18 The actions are applied in the specified order to each field listed, and
19 fields can be listed in several lines.
23 desc1 : unhtml index truncate=200 field=sample
24 desc2 desc3 desc4 : unhtml index
25 name : field=caption weight=3 index
26 ref : field=ref boolean=Q unique=Q
27 type : field=type boolean=XT
29 Don't put spaces around the ``=`` separating an action and its argument -
30 current versions allow spaces here (though this was never documented as
31 supported) but it leads to a missing argument quietly swallowing the next
32 action rather than using an empty value or giving an error, e.g. this takes
33 ``hash`` as the field name, which is unlikely to be what was intended::
35 url : field= hash boolean=Q unique=Q
37 Since 1.4.6 a deprecation warning is emitted for spaces before or after the
43 index the text as a single boolean term (with prefix PREFIX). If
44 there's no text, no term is added. Omega expects certain prefixes to
45 be used for certain purposes - those starting "X" are reserved for user
46 applications. Q is reserved for a unique ID term.
49 generate terms for date range searching. If FORMAT is "unix", then the
50 value is interpreted as a Unix time_t (seconds since 1970). If
51 FORMAT is "yyyymmdd", then the value is interpreted as an 8 digit
52 string, e.g. 20021221 for 21st December 2002. Unknown formats,
53 and invalid values are ignored at present.
56 add as a field to the Xapian record. FIELDNAME defaults to the field
57 name in the dumpfile. It is valid to have more than one instance of
58 a given field: all instances will be processed and stored in the
62 Xapian has a limit on the length of a term. To handle arbitrarily
63 long URLs as terms, omindex implements a scheme where the end of
64 a long URL is hashed (short URLs are left as-is). You can use this
65 same scheme in scriptindex. LENGTH defaults to 239, which if you
66 index with prefix "U" produces url terms compatible with omindex.
67 If specified, LENGTH must be at least 6 (because the hash takes 6
71 converts pairs of hex digits to binary byte values (providing a way
72 to specify arbitrary binary strings e.g. for use in a document value
73 slot). The input should have an even length and be composed entirely
74 of hex digits (if it isn't, an error is reported and the value is
77 ``hextobin`` was added in Omega 1.4.6.
80 split text into words and index (with prefix PREFIX if specified).
83 split text into words and index (with prefix PREFIX if specified), but
84 don't include positional information in the database - this makes the
85 database smaller, but phrase searching won't work.
88 reads the contents of the file using the current text as the filename
89 and then sets the current text to the contents. If the file can't be
90 loaded (not found, wrong permissions, etc) then a diagnostic message is
91 sent to stderr and the current text is set to empty. If the next
92 action is truncate, then scriptindex is smart enough to know it only
93 needs to load the start of a large file.
96 lowercase the text (useful for generating boolean terms)
99 parse the text as a date string using ``strptime()`` with the format
100 specified by ``FORMAT``, and set the text to the result as a Unix
101 ``time_t`` (seconds since 1970), which can then be fed into ``date``
102 or ``valuepacked``, for example::
104 last_update : parsedate="%Y%m%d %T" field=lastmod valuepacked=0
106 ``parsedate`` was added in Omega 1.4.6.
109 Generate spelling correction data for any ``index`` or ``indexnopos``
110 actions in the remainder of this list of actions.
112 split=DELIMITER[,OPERATION]
113 Split the text at each occurrence of ``DELIMITER``, discard any empty
114 strings, perform ``OPERATION`` on the resulting list, and then for each
115 entry perform all the actions which follow ``split`` in the current rule.
117 ``OPERATION`` can be ``dedup`` (remove second and subsequent
118 occurrences from the list of any value), ``sort`` (sort), or ``none``
121 If you want to specify ``,`` for delimiter, you need to quote it, e.g.
125 truncate to at most LENGTH bytes, but avoid chopping off a word (useful
126 for sample and title fields)
132 use the value in this field for a unique ID. If the value is empty,
133 a warning is issued but nothing else is done. Only one record with
134 each value of the ID may be present in the index: adding a new record
135 with an ID which is already present will cause the old record to be
136 replaced (or deleted if the new record is otherwise empty). You should
137 also index the field as a boolean field using the same prefix so that
138 the old record can be found. In Omega, Q is reserved for use as the
139 prefix of a unique term. You can use ``unique`` at most once in each
140 index script (this is only enforced since Omega 1.4.5, but older
141 versions didn't handle multiple instances usefully).
144 add as a Xapian document value in slot VALUESLOT. Values can be used
145 for collapsing equivalent documents, sorting the MSet, etc. If you
146 want to perform numeric sorting, use the valuenumeric action instead.
148 valuenumeric=VALUESLOT
149 Like value=VALUESLOT, this adds as a Xapian document value in slot
150 VALUESLOT, but it first encodes for numeric sorting using
151 Xapian::sortable_serialise(). Values set with this action can be
152 used for numeric sorting of the MSet.
154 valuepacked=VALUESLOT
155 Like value=VALUESLOT, this adds as a Xapian document value in slot
156 VALUESLOT, but it first encodes as a 4 byte big-endian binary string.
157 If the input is a Unix time_t value, the resulting slot can be used for
158 date range filtering and to sort the MSet by date. Can be used in
159 combination with ``parsedate``, for example::
161 last_update : parsedate="%Y%m%d %T" field=lastmod valuepacked=0
163 ``valuepacked`` was added in Omega 1.4.6.
166 set the weighting factor to FACTOR (an integer) for any ``index`` or
167 ``indexnopos`` actions in the remainder of this list of actions. The
168 default is 1. Use this to add extra weight to titles, keyword fields,
169 etc, so that words in them are regarded as more important by searches.
174 The data to be indexed is read in from one or more files. Each file has
175 records separated by a blank line. Each record contains one or more fields of
176 the form "name=value". If value contains newlines, these must be escaped by
177 inserting an equals sign ('=') after each newline. Here's an example record::
181 value=This is a multi-line
182 =value. Note how each newline
189 See mbox2omega and mbox2omega.script for an example of how you can generate a
190 dump file from an external source and write an index script to be used with it.
191 Try "mbox2omega --help" for more information.