manual/luatex-languages.tex

   1 % language=uk
   2
   3 \environment luatex-style
   4 \environment luatex-logos
   5
   6 \startcomponent luatex-languages
   7
   8 \startchapter[reference=languages,title={Languages, characters, fonts and glyphs}]
   9
  10 \LUATEX's internal handling of the characters and glyphs that eventually become
  11 typeset is quite different from the way \TEX82 handles those same objects. The
  12 easiest way to explain the difference is to focus on unrestricted horizontal mode
  13 (i.e.\ paragraphs) and hyphenation first. Later on, it will be easy to deal
  14 with the differences that occur in horizontal and math modes.
  15
  16 In \TEX82, the characters you type are converted into \type {char_node} records
  17 when they are encountered by the main control loop. \TEX\ attaches and processes
  18 the font information while creating those records, so that the resulting \quote
  19 {horizontal list} contains the final forms of ligatures and implicit kerning.
  20 This packaging is needed because we may want to get the effective width of for
  21 instance a horizontal box.
  22
  23 When it becomes necessary to hyphenate words in a paragraph, \TEX\ converts (one
  24 word at time) the \type {char_node} records into a string by replacing ligatures
  25 with their components and ignoring the kerning. Then it runs the hyphenation
  26 algorithm on this string, and converts the hyphenated result back into a \quote
  27 {horizontal list} that is consecutively spliced back into the paragraph stream.
  28 Keep in mind that the paragraph may contain unboxed horizontal material, which
  29 then already contains ligatures and kerns and the words therein are part of the
  30 hyphenation process.
  31
  32 Those \type {char_node} records are somewhat misnamed, as they are glyph
  33 positions in specific fonts, and therefore not really \quote {characters} in the
  34 linguistic sense. There is no language information inside the \type {char_node}
  35 records at all. Instead, language information is passed along using \type
  36 {language whatsit} records inside the horizontal list.
  37
  38 In \LUATEX, the situation is quite different. The characters you type are always
  39 converted into \type {glyph_node} records with a special subtype to identify them
  40 as being intended as linguistic characters. \LUATEX\ stores the needed language
  41 information in those records, but does not do any font|-|related processing at
  42 the time of node creation. It only stores the index of the current font and a
  43 reference to a character in that font.
  44
  45 When it becomes necessary to typeset a paragraph, \LUATEX\ first inserts all
  46 hyphenation points right into the whole node list. Next, it processes all the
  47 font information in the whole list (creating ligatures and adjusting kerning),
  48 and finally it adjusts all the subtype identifiers so that the records are \quote
  49 {glyph nodes} from now on.
  50
  51 \section[charsandglyphs]{Characters and glyphs}
  52
  53 \TEX82 (including \PDFTEX) differentiates between \type {char_node}s and \type
  54 {lig_node}s. The former are simple items that contained nothing but a \quote
  55 {character} and a \quote {font} field, and they lived in the same memory as
  56 tokens did. The latter also contained a list of components, and a subtype
  57 indicating whether this ligature was the result of a word boundary, and it was
  58 stored in the same place as other nodes like boxes and kerns and glues.
  59
  60 In \LUATEX, these two types are merged into one, somewhat larger structure called
  61 a \type {glyph_node}. Besides having the old character, font, and component
  62 fields, and the new special fields like \quote {attr} (see~\in {section}
  63 [glyphnodes]), these nodes also contain:
  64
  65 \startitemize
  66
  67 \startitem A subtype, split into four main types:
  68
  69     \startitemize
  70         \startitem
  71             \type {character}, for characters to be hyphenated: the lowest bit
  72             (bit 0) is set to 1.
  73         \stopitem
  74         \startitem
  75             \type {glyph}, for specific font glyphs: the lowest bit (bit 0) is
  76             not set.
  77         \stopitem
  78         \startitem
  79             \type {ligature}, for ligatures (bit 1 is set)
  80         \stopitem
  81         \startitem
  82             \type {ghost}, for \quote {ghost objects} (bit 2 is set)
  83         \stopitem
  84     \stopitemize
  85
  86     The latter two make further use of two extra fields (bits 3 and 4):
  87
  88     \startitemize
  89         \startitem
  90             \type {left}, for ligatures created from a left word boundary and for
  91             ghosts created from \type {\leftghost}
  92         \stopitem
  93         \startitem
  94             \type {right}, for ligatures created from a right word boundary and
  95             for ghosts created from \type {\rightghost}
  96         \stopitem
  97    \stopitemize
  98
  99    For ligatures, both bits can be set at the same time (in case of a
 100    single|-|glyph word).
 101
 102 \stopitem
 103
 104 \startitem
 105     \type {glyph_node}s of type \quote {character} also contain language data,
 106     split into four items that were current when the node was created: the
 107     \type {\setlanguage} (15 bits), \type {\lefthyphenmin} (8 bits), \type
 108     {\righthyphenmin} (8 bits), and \type {\uchyph} (1 bit).
 109 \stopitem
 110
 111 \stopitemize
 112
 113 Incidentally, \LUATEX\ allows 16383 separate languages, and words can be 256
 114 characters long. The language is stored with each character. You can set
 115 \type {\firstvalidlanguage} to for instance~1 and make thereby language~0
 116 an ignored hyphenation language.
 117
 118 The new primitive \type {\hyphenationmin} can be used to signal the minimal length
 119 of a word. This value stored with the (current) language.
 120
 121 Because the \type {\uchyph} value is saved in the actual nodes, its handling is
 122 subtly different from \TEX82: changes to \type {\uchyph} become effective
 123 immediately, not at the end of the current partial paragraph.
 124
 125 Typeset boxes now always have their language information embedded in the nodes
 126 themselves, so there is no longer a possible dependency on the surrounding
 127 language settings. In \TEX82, a mid-paragraph statement like \type {\unhbox0} would
 128 process the box using the current paragraph language unless there was a
 129 \type {\setlanguage} issued inside the box. In \LUATEX, all language variables are
 130 already frozen.
 131
 132 In traditional \TEX\ the process of hyphenation is driven by \type {lccode}s. In
 133 \LUATEX\ we made this dependency less strong. There are several strategies
 134 possible. When you do nothing, the currently used \type {lccode}s are used, when
 135 loading patterns, setting exceptions or hyphenating a list.
 136
 137 When you set \type {\savinghyphcodes} to a value larger than zero the current set
 138 of \type {lccode}s will be saved with the language. In that case changing a \type
 139 {lccode} afterwards has no effect. However, you can adapt the set with:
 140
 141 \starttyping
 142 \hjcode`a=`a
 143 \stoptyping
 144
 145 This change is global which makes sense if you keep in mind that the moment that
 146 hyphenation happens is (normally) when the paragraph or a horizontal box is
 147 constructed. When \type {\savinghyphcodes} was zero when the language got
 148 initialized you start out with nothing, otherwise you already have a set.
 149
 150 When a \type {\hjcode} is larger than $0$ but smaller than $32$ is indicates the
 151 to be used length. In the following example we map a character (\type {x}) onto
 152 another one in the patterns and tell the engine that \type {œ} counts as one
 153 character. Because traditionally zero itself is reserved for inhibiting
 154 hyphenation, a value of $32$ counts as zero.
 155
 156 \starttyping
 157 % assuming french patterns:
 158 foobar % foo-bar
 159
 160 \hjcode`x=`o
 161
 162 fxxbar % fxx-bar
 163
 164 \lefthyphenmin3
 165
 166 œdipus % œdi-pus
 167
 168 \lefthyphenmin4
 169
 170 œdipus % œdipus
 171
 172 \hjcode`œ=2
 173
 174 œdipus % œdi-pus
 175
 176 \hjcode`i=32
 177 \hjcode`d=32
 178
 179 œdipus % œdipus
 180 \stoptyping
 181
 182 Carrying all this information with each glyph would give too much overhead and
 183 also make the process of setting up thee codes more complex. A solution with
 184 \type {hjcode} sets was considered but rejected because in practice the current
 185 approach is sufficient and it would not be compatible anyway.
 186
 187 Beware: the values are always saved in the format, independent of the setting
 188 of \type {\savinghyphcodes} at the moment the format is dumped.
 189
 190 A boundary node normally would mark the end of a word which interferes with for
 191 instance discretionary injection. For this you can use the \type {\wordboundary}
 192 as trigger. Here are a few examples of usage:
 193
 194 \startbuffer
 195     discrete---discrete
 196 \stopbuffer
 197 \typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
 198 \startbuffer
 199     discrete\discretionary{}{}{---}discrete
 200 \stopbuffer
 201 \typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
 202 \startbuffer
 203     discrete\wordboundary\discretionary{}{}{---}discrete
 204 \stopbuffer
 205 \typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
 206 \startbuffer
 207     discrete\wordboundary\discretionary{}{}{---}\wordboundary discrete
 208 \stopbuffer
 209 \typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
 210 \startbuffer
 211     discrete\wordboundary\discretionary{---}{}{}\wordboundary discrete
 212 \stopbuffer
 213 \typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
 214
 215 We only accept an explicit hyphen when there is a preceding glyph and we skip a
 216 sequence of explicit hyphens as that normally indicates a \type {--} or \type
 217 {---} ligature in which case we can in a worse case usage get bad node lists
 218 later on due to messed up ligature building as these dashes are ligatures in base
 219 fonts. This is a side effect of the separating the hyphenation, ligaturing and
 220 kerning steps.
 221
 222 The start and end of a characters is signalled by a glue, penalty, kern or boundary
 223 node. But by default also a hlist, vlist, rule, dir, whatsit, ins, and adjust node
 224 indicate a start or end. You can omit the last set from the test by setting
 225 \type {\hyphenationbounds} to a non|-|zero value:
 226
 227 \starttabulate[|Tl|l|]
 228 \NC 0 \NC not strict \NC \NR
 229 \NC 1 \NC strict start \NC \NR
 230 \NC 2 \NC strict end \NC \NR
 231 \NC 3 \NC strict start and strict end \NC \NR
 232 \stoptabulate
 233
 234 The word start is determined as follows:
 235
 236 \starttabulate[|Bl|l|]
 237 \NC boundary  \NC yes when wordboundary \NC \NR
 238 \NC hlist     \NC when hyphenationbounds 1 or 3 \NC \NR
 239 \NC vlist     \NC when hyphenationbounds 1 or 3 \NC \NR
 240 \NC rule      \NC when hyphenationbounds 1 or 3 \NC \NR
 241 \NC dir       \NC when hyphenationbounds 1 or 3 \NC \NR
 242 \NC whatsit   \NC when hyphenationbounds 1 or 3 \NC \NR
 243 \NC glue      \NC yes \NC \NR
 244 \NC math      \NC skipped \NC \NR
 245 \NC glyph     \NC exhyphenchar (one only) : yes (so no -- ---) \NC \NR
 246 \NC otherwise \NC yes \NC \NR
 247 \stoptabulate
 248
 249 The word end is determined as follows:
 250
 251 \starttabulate[|Bl|l|]
 252 \NC boundary  \NC yes \NC \NR
 253 \NC glyph     \NC yes when different language \NC \NR
 254 \NC glue      \NC yes \NC \NR
 255 \NC penalty   \NC yes \NC \NR
 256 \NC kern      \NC yes when not italic (for some historic reason) \NC \NR
 257 \NC hlist     \NC when hyphenationbounds 2 or 3 \NC \NR
 258 \NC vlist     \NC when hyphenationbounds 2 or 3 \NC \NR
 259 \NC rule      \NC when hyphenationbounds 2 or 3 \NC \NR
 260 \NC dir       \NC when hyphenationbounds 2 or 3 \NC \NR
 261 \NC whatsit   \NC when hyphenationbounds 2 or 3 \NC \NR
 262 \NC ins       \NC when hyphenationbounds 2 or 3 \NC \NR
 263 \NC adjust    \NC when hyphenationbounds 2 or 3 \NC \NR
 264 \stoptabulate
 265
 266 (Future versions of \LUATEX\ might provide more granularity.)
 267
 268 \section{The main control loop}
 269
 270 In \LUATEX's main loop, almost all input characters that are to be typeset are
 271 converted into \type {glyph} node records with subtype \quote {character}, but
 272 there are a few exceptions.
 273
 274 First, the \type {\accent} primitives creates nodes with subtype \quote {glyph}
 275 instead of \quote {character}: one for the actual accent and one for the
 276 accentee. The primary reason for this is that \type {\accent} in \TEX82 is
 277 explicitly dependent on the current font encoding, so it would not make much
 278 sense to attach a new meaning to the primitive's name, as that would invalidate
 279 many old documents and macro packages. \footnote {Of course, modern packages will
 280 not use the \type {\accent} primitive at all but try to map directly on composed
 281 characters.} A secondary reason is that in \TEX82, \type {\accent} prohibits
 282 hyphenation of the current word. Since in \LUATEX\ hyphenation only takes place
 283 on \quote {character} nodes, it is possible to achieve the same effect.
 284
 285 This change of meaning did happen with \type {\char}, that now generates \quote
 286 {glyph} nodes with a character subtype. In traditional \TEX\ there was a strong
 287 relationship between the 8|-|bit input encoding, hyphenation and glyphs taken
 288 from a font. In \LUATEX\ we have \UTF\ input, and in most cases this maps
 289 directly to a character in a font, apart from glyph replacement in the font
 290 engine. If you want to access arbitrary glyphs in a font directly you can always
 291 use \LUA\ to do so, because fonts are available as \LUA\ table.
 292
 293 Second, all the results of processing in math mode eventually become nodes with
 294 \quote {glyph} subtypes.
 295
 296 Third, the \ALEPH|-|derived commands \type {\leftghost} and \type {\rightghost}
 297 create nodes of a third subtype: \quote {ghost}. These nodes are ignored
 298 completely by all further processing until the stage where inter|-|glyph kerning
 299 is added.
 300
 301 Fourth, automatic discretionaries are handled differently. \TEX82 inserts an
 302 empty discretionary after sensing an input character that matches the \type
 303 {\hyphenchar} in the current font. This test is wrong in our opinion: whether or
 304 not hyphenation takes place should not depend on the current font, it is a
 305 language property. \footnote {When \TEX\ showed up we didn't have \UNICODE\ yet
 306 and being limited to eight bits meant that one sometimes had to compromise
 307 between supporting character input, glyph rendering, hyphenation.}
 308
 309 In \LUATEX, it works like this: if \LUATEX\ senses a string of input characters
 310 that matches the value of the new integer parameter \type {\exhyphenchar}, it will
 311 insert an explicit discretionary after that series of nodes. Initex sets the \type
 312 {\exhyphenchar=`\-}. Incidentally, this is a global parameter instead of a
 313 language-specific one because it may be useful to change the value depending on
 314 the document structure instead of the text language.
 315
 316 The insertion of discretionaries after a sequence of explicit hyphens happens at
 317 the same time as the other hyphenation processing, {\it not\/} inside the main
 318 control loop.
 319
 320 The only use \LUATEX\ has for \type {\hyphenchar} is at the check whether a word
 321 should be considered for hyphenation at all. If the \type {\hyphenchar} of the
 322 font attached to the first character node in a word is negative, then hyphenation
 323 of that word is abandoned immediately. This behaviour is added for backward
 324 compatibility only, and the use of \type {\hyphenchar=-1} as a means of
 325 preventing hyphenation should not be used in new \LUATEX\ documents.
 326
 327 Fifth, \type {\setlanguage} no longer creates whatsits. The meaning of \type
 328 {\setlanguage} is changed so that it is now an integer parameter like all others.
 329 That integer parameter is used in \type {\glyph_node} creation to add language
 330 information to the glyph nodes. In conjunction, the \type {\language} primitive is
 331 extended so that it always also updates the value of \type {\setlanguage}.
 332
 333 Sixth, the \type {\noboundary} command (that prohibits word boundary processing
 334 where that would normally take place) now does create nodes. These nodes are
 335 needed because the exact place of the \type {\noboundary} command in the input
 336 stream has to be retained until after the ligature and font processing stages.
 337
 338 Finally, there is no longer a \type {main_loop} label in the code. Remember that
 339 \TEX82 did quite a lot of processing while adding \type {char_nodes} to the
 340 horizontal list? For speed reasons, it handled that processing code outside of
 341 the \quote {main control} loop, and only the first character of any \quote {word}
 342 was handled by that \quote {main control} loop. In \LUATEX, there is no longer a
 343 need for that (all hard work is done later), and the (now very small) bits of
 344 character|-|handling code have been moved back inline. When \type
 345 {\tracingcommands} is on, this is visible because the full word is reported,
 346 instead of just the initial character.
 347
 348 \section[patternsexceptions]{Loading patterns and exceptions}
 349
 350 The hyphenation algorithm in \LUATEX\ is quite different from the one in \TEX82,
 351 although it uses essentially the same user input.
 352
 353 After expansion, the argument for \type {\patterns} has to be proper \UTF8 with
 354 individual patterns separated by spaces, no \type {\char} or \type {\chardef}d
 355 commands are allowed. The current implementation quite strict and will reject all
 356 non|-|\UNICODE\ characters.
 357
 358 Likewise, the expanded argument for \type {\hyphenation} also has to be proper
 359 \UTF8, but here a bit of extra syntax is provided:
 360
 361 \startitemize[n]
 362 \startitem
 363     Three sets of arguments in curly braces (\type {{}{}{}}) indicates a desired
 364     complex discretionary, with arguments as in \type {\discretionary}'s command in
 365     normal document input.
 366 \stopitem
 367 \startitem
 368     A \type {-} indicates a desired simple discretionary, cf.\ \type {\-} and \type
 369     {\discretionary{-}{}{}} in normal document input.
 370 \stopitem
 371 \startitem
 372     Internal command names are ignored. This rule is provided especially for \type
 373     {\discretionary}, but it also helps to deal with \type {\relax} commands that
 374     may sneak in.
 375 \stopitem
 376 \startitem
 377     An \type {=} indicates a (non|-|discretionary) hyphen in the document input.
 378 \stopitem
 379 \stopitemize
 380
 381 The expanded argument is first converted back to a space-separated string while
 382 dropping the internal command names. This string is then converted into a
 383 dictionary by a routine that creates key|-|value pairs by converting the other
 384 listed items. It is important to note that the keys in an exception dictionary
 385 can always be generated from the values. Here are a few examples:
 386
 387 \starttabulate[|l|l|l|]
 388 \NC \bf value \NC \bf implied key (input) \NC \bf effect \NC\NR
 389 \NC \type {ta-ble} \NC table \NC \type {ta\-ble} ($=$ \type {ta\discretionary{-}{}{}ble}) \NC\NR
 390 \NC \type {ba{k-}{}{c}ken} \NC backen \NC \type {ba\discretionary{k-}{}{c}ken} \NC\NR
 391 \stoptabulate
 392
 393 The resultant patterns and exception dictionary will be stored under the language
 394 code that is the present value of \type {\language}.
 395
 396 In the last line of the table, you see there is no \type {\discretionary} command
 397 in the value: the command is optional in the \TEX-based input syntax. The
 398 underlying reason for that is that it is conceivable that a whole dictionary of
 399 words is stored as a plain text file and loaded into \LUATEX\ using one of the
 400 functions in the \LUA\ \type {lang} library. This loading method is quite a bit
 401 faster than going through the \TEX\ language primitives, but some (most?) of that
 402 speed gain would be lost if it had to interpret command sequences while doing so.
 403
 404 It is possible to specify extra hyphenation points in compound words by using
 405 \type {{-}{}{-}} for the explicit hyphen character (replace \type {-} by the
 406 actual explicit hyphen character if needed). For example, this matches the word
 407 \quote {multi|-|word|-|boundaries} and allows an extra break inbetween \quote
 408 {boun} and \quote {daries}:
 409
 410 \starttyping
 411 \hyphenation{multi{-}{}{-}word{-}{}{-}boun-daries}
 412 \stoptyping
 413
 414 The motivation behind the \ETEX\ extension \type {\savinghyphcodes} was that
 415 hyphenation heavily depended on font encodings. This is no longer true in
 416 \LUATEX, and the corresponding primitive is basically ignored. Because we now
 417 have \type {hjcode}, the case relate codes can be used exclusively for \type
 418 {\uppercase} and \type {\lowercase}.
 419
 420 \section{Applying hyphenation}
 421
 422 The internal structures \LUATEX\ uses for the insertion of discretionaries in
 423 words is very different from the ones in \TEX82, and that means there are some
 424 noticeable differences in handling as well.
 425
 426 First and foremost, there is no \quote {compressed trie} involved in hyphenation.
 427 The algorithm still reads \PATGEN-generated pattern files, but \LUATEX\ uses a
 428 finite state hash to match the patterns against the word to be hyphenated. This
 429 algorithm is based on the \quote {libhnj} library used by \OPENOFFICE, which in
 430 turn is inspired by \TEX.
 431
 432 There are a few differences between \LUATEX\ and \TEX82 that are a direct result
 433 of the implementation:
 434
 435 \startitemize
 436 \startitem
 437     \LUATEX\ happily hyphenates the full \UNICODE\ character range.
 438 \stopitem
 439 \startitem
 440     Pattern and exception dictionary size is limited by the available memory
 441     only, all allocations are done dynamically. The trie|-|related settings in
 442     \type {texmf.cnf} are ignored.
 443 \stopitem
 444 \startitem
 445     Because there is no \quote {trie preparation} stage, language patterns never
 446     become frozen. This means that the primitive \type {\patterns} (and its \LUA\
 447     counterpart \type {lang.patterns}) can be used at any time, not only in
 448     ini\TEX.
 449 \stopitem
 450 \startitem
 451     Only the string representation of \type {\patterns} and \type {\hyphenation} is
 452     stored in the format file. At format load time, they are simply
 453     re|-|evaluated. It follows that there is no real reason to preload languages
 454     in the format file. In fact, it is usually not a good idea to do so. It is
 455     much smarter to load patterns no sooner than the first time they are actually
 456     needed.
 457 \stopitem
 458 \startitem
 459     \LUATEX\ uses the language-specific variables \type {\prehyphenchar} and \type
 460     {\posthyphenchar} in the creation of implicit discretionaries, instead of
 461     \TEX82's \type {\hyphenchar}, and the values of the language|-|specific variables
 462     \type {\preexhyphenchar} and \type {\postexhyphenchar} for explicit
 463     discretionaries (instead of \TEX82's empty discretionary).
 464 \stopitem
 465 \startitem
 466     The value of the two counters related to hyphenation, \type {\hyphenpenalty}
 467     and \type {\exhyphenpenalty}, are now stored in the discretionary nodes. This
 468     permits a local overload for explicit \type {\discretionary} commands. The
 469     value current when the hyphenation pass is applied is used. When no callbacks
 470     are used this is compatible with traditional \TEX. When you apply the \LUA\
 471     \type {lang.hyphenate} function the current values are used.
 472 \stopitem
 473 \stopitemize
 474
 475 Because we store penalties in the disc node the \type {\discretionary} command has
 476 been extended to accept an optional penalty specification, so you can do the
 477 following:
 478
 479 \startbuffer
 480 \hsize1mm
 481 1:foo{\hyphenpenalty 10000\discretionary{}{}{}}bar\par
 482 2:foo\discretionary penalty 10000 {}{}{}bar\par
 483 3:foo\discretionary{}{}{}bar\par
 484 \stopbuffer
 485
 486 \typebuffer
 487
 488 This results in:
 489
 490 \blank \start \getbuffer \stop \blank
 491
 492 Inserted characters and ligatures inherit their attributes from the nearest glyph
 493 node item (usually the preceding one, but the following one for the items
 494 inserted at the left-hand side of a word).
 495
 496 Word boundaries are no longer implied by font switches, but by language switches.
 497 One word can have two separate fonts and still be hyphenated correctly (but it
 498 can not have two different languages, the \type {\setlanguage} command forces a
 499 word boundary).
 500
 501 All languages start out with \type {\prehyphenchar=`\-}, \type {\posthyphenchar=0},
 502 \type {\preexhyphenchar=0} and \type {\postexhyphenchar=0}. When you assign the
 503 values of one of these four parameters, you are actually changing the settings
 504 for the current \type {\language}, this behaviour is compatible with \type {\patterns}
 505 and \type {\hyphenation}.
 506
 507 \LUATEX\ also hyphenates the first word in a paragraph. Words can be up to 256
 508 characters long (up from 64 in \TEX82). Longer words generate an error right now,
 509 but eventually either the limitation will be removed or perhaps it will become
 510 possible to silently ignore the excess characters (this is what happens in
 511 \TEX82, but there the behaviour cannot be controlled).
 512
 513 If you are using the \LUA\ function \type {lang.hyphenate}, you should be aware
 514 that this function expects to receive a list of \quote {character} nodes. It will
 515 not operate properly in the presence of \quote {glyph}, \quote {ligature}, or
 516 \quote {ghost} nodes, nor does it know how to deal with kerning.
 517
 518 The hyphenation exception dictionary is maintained as key|-|value hash, and that
 519 is also dynamic, so the \type {hyph_size} setting is not used either.
 520
 521 \section{Applying ligatures and kerning}
 522
 523 After all possible hyphenation points have been inserted in the list, \LUATEX\
 524 will process the list to convert the \quote {character} nodes into \quote {glyph}
 525 and \quote {ligature} nodes. This is actually done in two stages: first all
 526 ligatures are processed, then all kerning information is applied to the result
 527 list. But those two stages are somewhat dependent on each other: If the used font
 528 makes it possible to do so, the ligaturing stage adds virtual \quote {character}
 529 nodes to the word boundaries in the list. While doing so, it removes and
 530 interprets \type {\noboundary} nodes. The kerning stage deletes those word
 531 boundary items after it is done with them, and it does the same for \quote
 532 {ghost} nodes. Finally, at the end of the kerning stage, all remaining \quote
 533 {character} nodes are converted to \quote {glyph} nodes.
 534
 535 This work separation is worth mentioning because, if you overrule from \LUA\ only
 536 one of the two callbacks related to font handling, then you have to make sure you
 537 perform the tasks normally done by \LUATEX\ itself in order to make sure that the
 538 other, non|-|overruled, routine continues to function properly.
 539
 540 Work in this area is not yet complete, but most of the possible cases are handled
 541 by our rewritten ligaturing engine. At some point all of the possible inputs will
 542 become supported. \footnote {Not all of this makes sense because we nowadays have
 543 \OPENTYPE\ fonts and ligature building can happen in ,any different ways there.}
 544
 545 For example, take the word \type {office}, hyphenated \type {of-fice}, using a
 546 \quote {normal} font with all the \type {f}-\type {f} and \type {f}-\type {i}
 547 type ligatures:
 548
 549 \starttabulate[|l|l|]
 550 \NC Initial:               \NC \type {{o}{f}{f}{i}{c}{e}}             \NC\NR
 551 \NC After hyphenation:     \NC \type {{o}{f}{{-},{},{}}{f}{i}{c}{e}}  \NC\NR
 552 \NC First ligature stage:  \NC \type {{o}{{f-},{f},{<ff>}}{i}{c}{e}}  \NC\NR
 553 \NC Final result:          \NC \type {{o}{{f-},{<fi>},{<ffi>}}{c}{e}} \NC\NR
 554 \stoptabulate
 555
 556 That's bad enough, but let us assume that there is also a hyphenation point
 557 between the \type {f} and the \type {i}, to create \type {of-f-ice}. Then the
 558 final result should be:
 559
 560 \starttyping
 561 {o}{{f-},
 562     {{f-},
 563      {i},
 564      {<fi>}},
 565     {{<ff>-},
 566      {i},
 567      {<ffi>}}}{c}{e}
 568 \stoptyping
 569
 570 with discretionaries in the post-break text as well as in the replacement text of
 571 the top-level discretionary that resulted from the first hyphenation point.
 572
 573 Here is that nested solution again, in a different representation:
 574
 575 \starttabulate[|l|l|l|l|]
 576 \NC         \NC pre              \NC post          \NC replace           \NC \NR
 577 \NC topdisc \NC \type {f-}$^1$   \NC sub1          \NC sub2              \NC \NR
 578 \NC sub1    \NC \type {f-}$^2$   \NC \type {i}$^3$ \NC \type {<fi>}$^4$  \NC \NR
 579 \NC sub2    \NC \type {<ff>-}$^5$\NC \type {i}$^6$ \NC \type {<ffi>}$^7$ \NC \NR
 580 \stoptabulate
 581
 582 When line breaking is choosing its breakpoints, the following fields will
 583 eventually be selected:
 584
 585 \starttabulate[|l|l|l|]
 586 \NC \type {of-f-ice} \NC \type {f-}$^1$    \NC \NR
 587 \NC                  \NC \type {f-}$^2$    \NC \NR
 588 \NC                  \NC \type {i}$^3$     \NC \NR
 589 \NC \type {of-fice}  \NC \type {f-}$^1$    \NC \NR
 590 \NC                  \NC \type {<fi>}$^4$  \NC \NR
 591 \NC \type {off-ice}  \NC \type {<ff>-}$^5$ \NC \NR
 592 \NC                  \NC \type {i}$^6$     \NC \NR
 593 \NC \type {office}   \NC \type {<ffi>}$^7$ \NC \NR
 594 \stoptabulate
 595
 596 The current solution in \LUATEX\ is not able to handle nested discretionaries,
 597 but it is in fact smart enough to handle this fictional \type {of-f-ice} example.
 598 It does so by combining two sequential discretionary nodes as if they were a
 599 single object (where the second discretionary node is treated as an extension of
 600 the first node).
 601
 602 One can observe that the \type {of-f-ice} and \type {off-ice} cases both end with
 603 the same actual post replacement list (\type {i}), and that this would be the
 604 case even if that \type {i} was the first item of a potential following ligature
 605 like \type {ic}. This allows \LUATEX\ to do away with one of the fields, and thus
 606 make the whole stuff fit into just two discretionary nodes.
 607
 608 The mapping of the seven list fields to the six fields in this discretionary node
 609 pair is as follows:
 610
 611 \starttabulate[|l|p|]
 612 \NC \bf field            \NC \bf description \NC \NR
 613 \NC \type {disc1.pre}     \NC \type {f-}$^1$  \NC \NR
 614 \NC \type {disc1.post}    \NC \type {<fi>}$^4$  \NC \NR
 615 \NC \type {disc1.replace} \NC \type {<ffi>}$^7$ \NC \NR
 616 \NC \type {disc2.pre}     \NC \type {f-}$^2$  \NC \NR
 617 \NC \type {disc2.post}    \NC \type {i}$^{3{,}6}$\NC \NR
 618 \NC \type {disc2.replace} \NC \type {<ff>-}$^5$\NC \NR
 619 \stoptabulate
 620
 621 What is actually generated after ligaturing has been applied is therefore:
 622
 623 \starttyping
 624 {o}{{f-},
 625     {<fi>},
 626     {<ffi>}}
 627    {{f-},
 628     {i},
 629     {<ff>-}}{c}{e}
 630 \stoptyping
 631
 632 The two discretionaries have different subtypes from a discretionary appearing on
 633 its own: the first has subtype 4, and the second has subtype 5. The need for
 634 these special subtypes stems from the fact that not all of the fields appear in
 635 their \quote {normal} location. The second discretionary especially looks odd,
 636 with things like the \type {<ff>-} appearing in \type {disc2.replace}. The fact
 637 that some of the fields have different meanings (and different processing code
 638 internally) is what makes it necessary to have different subtypes: this enables
 639 \LUATEX\ to distinguish this sequence of two joined discretionary nodes from the
 640 case of two standalone discretionaries appearing in a row.
 641
 642 Of course there is still that relationship with fonts: ligatures can be implemented by
 643 mapping a sequence of glyphs onto one glyph, but also by selective replacement and
 644 kerning. This means that the above examples are just representing the traditional
 645 approach.
 646
 647 \section{Breaking paragraphs into lines}
 648
 649 This code is still almost unchanged, but because of the above|-|mentioned changes
 650 with respect to discretionaries and ligatures, line breaking will potentially be
 651 different from traditional \TEX. The actual line breaking code is still based on
 652 the \TEX82 algorithms, and it does not expect there to be discretionaries inside
 653 of discretionaries.
 654
 655 But that situation is now fairly common in \LUATEX, due to the changes to the
 656 ligaturing mechanism. And also, the \LUATEX\ discretionary nodes are implemented
 657 slightly different from the \TEX82 nodes: the \type {no_break} text is now
 658 embedded inside the disc node, where previously these nodes kept their place in
 659 the horizontal list. In traditional \TEX\ the discretionary node contains a
 660 counter indicating how many nodes to skip, but in \LUATEX\ we store the pre, post
 661 and replace text in the discretionary node.
 662
 663 The combined effect of these two differences is that \LUATEX\ does not always use
 664 all of the potential breakpoints in a paragraph, especially when fonts with many
 665 ligatures are used. Of course kerning also complicates matters here.
 666
 667 \section{The \type {lang} library}
 668
 669 This library provides the interface to \LUATEX's structure
 670 representing a language, and the associated functions.
 671
 672 \startfunctioncall
 673 <language> l = lang.new()
 674 <language> l = lang.new(<number> id)
 675 \stopfunctioncall
 676
 677 This function creates a new userdata object. An object of type \type {<language>}
 678 is the first argument to most of the other functions in the \type {lang}
 679 library. These functions can also be used as if they were object methods, using
 680 the colon syntax.
 681
 682 Without an argument, the next available internal id number will be assigned to
 683 this object. With argument, an object will be created that links to the internal
 684 language with that id number.
 685
 686 \startfunctioncall
 687 <number> n = lang.id(<language> l)
 688 \stopfunctioncall
 689
 690 returns the internal \type {\language} id number this object refers to.
 691
 692 \startfunctioncall
 693 <string> n = lang.hyphenation(<language> l)
 694 lang.hyphenation(<language> l, <string> n)
 695 \stopfunctioncall
 696
 697 Either returns the current hyphenation exceptions for this language, or adds new
 698 ones. The syntax of the string is explained in~\in {section}
 699 [patternsexceptions].
 700
 701 \startfunctioncall
 702 lang.clear_hyphenation(<language> l)
 703 \stopfunctioncall
 704
 705 Clears the exception dictionary (string) for this language.
 706
 707 \startfunctioncall
 708 <string> n = lang.clean(<language> l, <string> o)
 709 <string> n = lang.clean(<string> o)
 710 \stopfunctioncall
 711
 712 Creates a hyphenation key from the supplied hyphenation value. The syntax of the
 713 argument string is explained in~\in {section} [patternsexceptions]. This function
 714 is useful if you want to do something else based on the words in a dictionary
 715 file, like spell|-|checking.
 716
 717 \startfunctioncall
 718 <string> n = lang.patterns(<language> l)
 719 lang.patterns(<language> l, <string> n)
 720 \stopfunctioncall
 721
 722 Adds additional patterns for this language object, or returns the current set.
 723 The syntax of this string is explained in~\in {section} [patternsexceptions].
 724
 725 \startfunctioncall
 726 lang.clear_patterns(<language> l)
 727 \stopfunctioncall
 728
 729 Clears the pattern dictionary for this language.
 730
 731 \startfunctioncall
 732 <number> n = lang.prehyphenchar(<language> l)
 733 lang.prehyphenchar(<language> l, <number> n)
 734 \stopfunctioncall
 735
 736 Gets or sets the \quote {pre|-|break} hyphen character for implicit hyphenation
 737 in this language (initially the hyphen, decimal 45).
 738
 739 \startfunctioncall
 740 <number> n = lang.posthyphenchar(<language> l)
 741 lang.posthyphenchar(<language> l, <number> n)
 742 \stopfunctioncall
 743
 744 Gets or sets the \quote {post|-|break} hyphen character for implicit hyphenation
 745 in this language (initially null, decimal~0, indicating emptiness).
 746
 747 \startfunctioncall
 748 <number> n = lang.preexhyphenchar(<language> l)
 749 lang.preexhyphenchar(<language> l, <number> n)
 750 \stopfunctioncall
 751
 752 Gets or sets the \quote {pre|-|break} hyphen character for explicit hyphenation
 753 in this language (initially null, decimal~0, indicating emptiness).
 754
 755 \startfunctioncall
 756 <number> n = lang.postexhyphenchar(<language> l)
 757 lang.postexhyphenchar(<language> l, <number> n)
 758 \stopfunctioncall
 759
 760 Gets or sets the \quote {post|-|break} hyphen character for explicit hyphenation
 761 in this language (initially null, decimal~0, indicating emptiness).
 762
 763 \startfunctioncall
 764 <boolean> success = lang.hyphenate(<node> head)
 765 <boolean> success = lang.hyphenate(<node> head, <node> tail)
 766 \stopfunctioncall
 767
 768 Inserts hyphenation points (discretionary nodes) in a node list. If \type {tail}
 769 is given as argument, processing stops on that node. Currently, \type {success}
 770 is always true if \type {head} (and \type {tail}, if specified) are proper nodes,
 771 regardless of possible other errors.
 772
 773 Hyphenation works only on \quote {characters}, a special subtype of all the glyph
 774 nodes with the node subtype having the value \type {1}. Glyph modes with
 775 different subtypes are not processed. See \in {section~} [charsandglyphs] for
 776 more details.
 777
 778 The following two commands can be used to set or query hj codes:
 779
 780 \startfunctioncall
 781 lang.sethjcode(<language> l, <number> char, <number> usedchar)
 782 <number> usedchar = lang.gethjcode(<language> l, <number> char)
 783 \stopfunctioncall
 784
 785 When you set a hjcode the current sets get initialized unless the set was already
 786 initialized due to \type {\savinghyphcodes} being larger than zero.
 787
 788 \stopchapter
 789
 790 \stopcomponent
 791
 792 % \parindent0pt \hsize=1.1cm
 793 % 12-34-56 \par
 794 % 12-34-\hbox{56} \par
 795 % 12-34-\vrule width 1em height 1.5ex \par
 796 % 12-\hbox{34}-56 \par
 797 % 12-\vrule width 1em height 1.5ex-56 \par
 798 % \hjcode`\1=`\1 \hjcode`\2=`\2 \hjcode`\3=`\3 \hjcode`\4=`\4 \vskip.5cm
 799 % 12-34-56 \par
 800 % 12-34-\hbox{56} \par
 801 % 12-34-\vrule width 1em height 1.5ex \par
 802 % 12-\hbox{34}-56 \par
 803 % 12-\vrule width 1em height 1.5ex-56 \par
 804