manual/luatex-languages.tex

   1 \environment luatex-style
   2 \environment luatex-logos
   3
   4 \startcomponent luatex-languages
   5
   6 \startchapter[reference=languages,title={Languages and characters, fonts and glyphs}]
   7
   8 \LUATEX's internal handling of the characters and glyphs that eventually become
   9 typeset is quite different from the way \TEX82 handles those same objects. The
  10 easiest way to explain the difference is to focus on unrestricted horizontal mode
  11 (i.e.\ paragraphs) and hyphenation first. Later on, it will be easy to deal
  12 with the differences that occur in horizontal and math modes.
  13
  14 In \TEX82, the characters you type are converted into \type {char_node} records
  15 when they are encountered by the main control loop. \TEX\ attaches and processes
  16 the font information while creating those records, so that the resulting \quote
  17 {horizontal list} contains the final forms of ligatures and implicit kerning.
  18 This packaging is needed because we may want to get the effective width of for
  19 instance a horizontal box.
  20
  21 When it becomes necessary to hyphenate words in a paragraph, \TEX\ converts (one
  22 word at time) the \type {char_node} records into a string array by replacing
  23 ligatures with their components and ignoring the kerning. Then it runs the
  24 hyphenation algorithm on this string, and converts the hyphenated result back
  25 into a \quote {horizontal list} that is consecutively spliced back into the
  26 paragraph stream. Keep in mind that the paragraph may contain unboxed horizontal
  27 material, which then already contains ligatures and kerns and the words therein
  28 are part of the hyphenation process.
  29
  30 The \type {char_node} records are somewhat misnamed, as they are glyph positions
  31 in specific fonts, and therefore not really \quote {characters} in the linguistic
  32 sense. There is no language information inside the \type {char_node} records.
  33 Instead, language information is passed along using \type {language whatsit}
  34 records inside the horizontal list.
  35
  36 In \LUATEX, the situation is quite different. The characters you type are always
  37 converted into \type {glyph_node} records with a special subtype to identify them
  38 as being intended as linguistic characters. \LUATEX\ stores the needed language
  39 information in those records, but does not do any font|-|related processing at
  40 the time of node creation. It only stores the index of the current font.
  41
  42 When it becomes necessary to typeset a paragraph, \LUATEX\ first inserts all
  43 hyphenation points right into the whole node list. Next, it processes all the
  44 font information in the whole list (creating ligatures and adjusting kerning),
  45 and finally it adjusts all the subtype identifiers so that the records are \quote
  46 {glyph nodes} from now on.
  47
  48 That was the broad overview. The rest of this chapter will deal with the minutiae
  49 of the new process.
  50
  51 \section[charsandglyphs]{Characters and glyphs}
  52
  53 \TEX82 (including \PDFTEX) differentiates between \type {char_node}s and \type
  54 {lig_node}s. The former are simple items that contained nothing but a \quote
  55 {character} and a \quote {font} field, and they lived in the same memory as
  56 tokens did. The latter also contained a list of components, and a subtype
  57 indicating whether this ligature was the result of a word boundary, and it was
  58 stored in the same place as other nodes like boxes and kerns and glues.
  59
  60 In \LUATEX, these two types are merged into one, somewhat larger structure called
  61 a \type {glyph_node}. Besides having the old character, font, and component
  62 fields, and the new special fields like \quote {attr}
  63 (see~\in{section}[glyphnodes]), these nodes also contain:
  64
  65 \startitemize
  66
  67 \startitem A subtype, split into four main types:
  68
  69     \startitemize
  70         \startitem
  71             \type {character}, for characters to be hyphenated: the lowest bit
  72             (bit 0) is set to 1.
  73         \stopitem
  74         \startitem
  75             \type {glyph}, for specific font glyphs: the lowest bit (bit 0) is
  76             not set.
  77         \stopitem
  78         \startitem
  79             \type {ligature}, for ligatures (bit 1 is set)
  80         \stopitem
  81         \startitem
  82             \type {ghost}, for \quote {ghost objects} (bit 2 is set)
  83         \stopitem
  84     \stopitemize
  85
  86     The latter two make further use of two extra fields (bits 3 and 4):
  87
  88     \startitemize
  89         \startitem
  90             \type {left}, for ligatures created from a left word boundary and for
  91             ghosts created from \type {\leftghost}
  92         \stopitem
  93         \startitem
  94             \type {right}, for ligatures created from a right word boundary and
  95             for ghosts created from \type {\rightghost}
  96         \stopitem
  97    \stopitemize
  98
  99    For ligatures, both bits can be set at the same time (in case of a
 100    single|-|glyph word).
 101
 102 \stopitem
 103
 104 \startitem
 105     \type {glyph_node}s of type \quote {character} also contain language data,
 106     split into four items that were current when the node was created: the
 107     \type {\setlanguage} (15 bits), \type {\lefthyphenmin} (8 bits), \type
 108     {\righthyphenmin} (8 bits), and \type {\uchyph} (1 bit).
 109 \stopitem
 110
 111 \stopitemize
 112
 113 Incidentally, \LUATEX\ allows 16383 separate languages, and words can be 256
 114 characters long. The language is stored with each character. You can set
 115 \type {\firstvalidlanguage} to for instance~1 and make thereby language~0
 116 an ignored hyphenation language.
 117
 118 The new primitive \type {\hyphenationmin} can be used to signal the minimal length
 119 of a word. This value stored with the (current) language.
 120
 121 Because the \type {\uchyph} value is saved in the actual nodes, its handling is
 122 subtly different from \TEX82: changes to \type {\uchyph} become effective
 123 immediately, not at the end of the current partial paragraph.
 124
 125 Typeset boxes now always have their language information embedded in the nodes
 126 themselves, so there is no longer a possible dependency on the surrounding
 127 language settings. In \TEX82, a mid-paragraph statement like \type {\unhbox0} would
 128 process the box using the current paragraph language unless there was a
 129 \type {\setlanguage} issued inside the box. In \LUATEX, all language variables are
 130 already frozen.
 131
 132 In traditional \TEX\ the process of hyphenation is driven by so called lccodes.
 133 In \LUATEX\ we made this dependency less strong. There are several strategies
 134 possible. When you do nothing, the currently used lccodes are used, when loading
 135 patterns, setting exceptions or hyphenating a list.
 136
 137 When you set \type {\savinghyphcodes} to a value larger than zero the current set of
 138 lccodes will be saved with the language. In that case changing a lccode afterwards
 139 has no effect. However, you can adapt the set with:
 140
 141 \starttyping
 142 \hjcode`a=`a
 143 \stoptyping
 144
 145 This change is global which makes sense if you keep in mind that the moment that
 146 hyphenation happens is (normally) when the paragraph or a horizontal box is
 147 constructed. When \type {\savinghyphcodes} was zero when the language got
 148 initialized you start out with nothing, otherwise you already have a set.
 149
 150 Carrying all this information with each glyph would give too much overhead and
 151 also make the definition more complex. A solution with hj codesets was considered
 152 but rejected because in practice the current approach is sufficient and it would
 153 not be compatible anyway.
 154
 155 Beware: the values are always saved in the format, independent of the setting
 156 of \type {\savinghyphcodes} at the mnoment the format is dumped.
 157
 158 \section{The main control loop}
 159
 160 In \LUATEX's main loop, almost all input characters that are to be typeset are
 161 converted into \type {glyph} node records with subtype \quote {character}, but
 162 there are a few exceptions.
 163
 164 First, the \type {\accent} primitives creates nodes with subtype \quote {glyph}
 165 instead of \quote {character}: one for the actual accent and one for the
 166 accentee. The primary reason for this is that \type {\accent} in \TEX82 is
 167 explicitly dependent on the current font encoding, so it would not make much
 168 sense to attach a new meaning to the primitive's name, as that would invalidate
 169 many old documents and macro packages. A secondary reason is that in \TEX82,
 170 \type {\accent} prohibits hyphenation of the current word. Since in \LUATEX\
 171 hyphenation only takes place on \quote {character} nodes, it is possible to
 172 achieve the same effect.
 173
 174 This change of meaning did happen with \type {\char}, that now generates \quote
 175 {glyph} nodes with a character subtype. In traditional \TEX\ there was a strong
 176 relationship betwene the 8|-|bit input encoding, hyphenation and glyph staken
 177 from a font. In \LUATEX\ we have \UTF\ input, and in most cases this maps
 178 directly to a character in a font, apart from glyph replacement in the font
 179 engine. If you want to access arbitrary glyphs in a font directly you can alwasy
 180 use \LUA\ to do so, because fonts are available as \LUA\ table.
 181
 182 Second, all the results of processing in math mode eventually become nodes with
 183 \quote {glyph} subtypes.
 184
 185 Third, the \ALEPH|-|derived commands \type {\leftghost} and \type {\rightghost}
 186 create nodes of a third subtype: \quote {ghost}. These nodes are ignored
 187 completely by all further processing until the stage where inter|-|glyph kerning
 188 is added.
 189
 190 Fourth, automatic discretionaries are handled differently. \TEX82 inserts an
 191 empty discretionary after sensing an input character that matches the \type
 192 {\hyphenchar} in the current font. This test is wrong, in our opinion: whether or
 193 not hyphenation takes place should not depend on the current font, it is a
 194 language property.
 195
 196 In \LUATEX, it works like this: if \LUATEX\ senses a string of input characters
 197 that matches the value of the new integer parameter \type {\exhyphenchar}, it will
 198 insert an explicit discretionary after that series of nodes. Initex sets the \type
 199 {\exhyphenchar=`\-}. Incidentally, this is a global parameter instead of a
 200 language-specific one because it may be useful to change the value depending on
 201 the document structure instead of the text language.
 202
 203 The insertion of discretionaries after a sequence of explicit hyphens happens at
 204 the same time as the other hyphenation processing, {\it not\/} inside the main
 205 control loop.
 206
 207 The only use \LUATEX\ has for \type {\hyphenchar} is at the check whether a word
 208 should be considered for hyphenation at all. If the \type {\hyphenchar} of the font
 209 attached to the first character node in a word is negative, then hyphenation of
 210 that word is abandoned immediately. {\bf This behavior is added for backward
 211 compatibility only, and the use of \type {\hyphenchar=-1} as a means of
 212 preventing hyphenation should not be used in new \LUATEX\ documents.}
 213
 214 Fifth, \type {\setlanguage} no longer creates whatsits. The meaning of \type
 215 {\setlanguage} is changed so that it is now an integer parameter like all others.
 216 That integer parameter is used in \type {\glyph_node} creation to add language
 217 information to the glyph nodes. In conjunction, the \type {\language} primitive is
 218 extended so that it always also updates the value of \type {\setlanguage}.
 219
 220 Sixth, the \type {\noboundary} command (this command prohibits word boundary
 221 processing where that would normally take place) now does create whatsits. These
 222 whatsits are needed because the exact place of the \type {\noboundary} command in
 223 the input stream has to be retained until after the ligature and font processing
 224 stages.
 225
 226 Finally, there is no longer a \type {main_loop} label in the code. Remember that
 227 \TEX82 did quite a lot of processing while adding \type {char_nodes} to the
 228 horizontal list? For speed reasons, it handled that processing code outside of
 229 the \quote {main control} loop, and only the first character of any \quote {word}
 230 was handled by that \quote {main control} loop. In \LUATEX, there is no longer a
 231 need for that (all hard work is done later), and the (now very small) bits of
 232 character|-|handling code have been moved back inline. When \type
 233 {\tracingcommands} is on, this is visible because the full word is reported,
 234 instead of just the initial character.
 235
 236 \section[patternsexceptions]{Loading patterns and exceptions}
 237
 238 The hyphenation algorithm in \LUATEX\ is quite different from the one in \TEX82,
 239 although it uses essentially the same user input.
 240
 241 After expansion, the argument for \type {\patterns} has to be proper \UTF8 with
 242 individual patterns separated by spaces, no \type {\char} or \type {\chardef}d
 243 commands are allowed. The current implementation is even more strict, and will
 244 reject all non|-|\UNICODE\ characters, but that will be changed in the future.
 245 For now, the generated errors are a valuable tool in discovering font-encoding
 246 specific pattern files.
 247
 248 Likewise, the expanded argument for \type {\hyphenation} also has to be proper
 249 \UTF8, but here a tiny little bit of extra syntax is provided:
 250
 251 \startitemize[n]
 252 \startitem
 253     Three sets of arguments in curly braces (\type {{}{}{}}) indicates a desired
 254     complex discretionary, with arguments as in \type {\discretionary}'s command in
 255     normal document input.
 256 \stopitem
 257 \startitem
 258     A \type {-} indicates a desired simple discretionary, cf.\ \type {\-} and \type
 259     {\discretionary{-}{}{}} in normal document input.
 260 \stopitem
 261 \startitem
 262     Internal command names are ignored. This rule is provided especially for \type
 263     {\discretionary}, but it also helps to deal with \type {\relax} commands that
 264     may sneak in.
 265 \stopitem
 266 \startitem
 267     An \type {=} indicates a (non|-|discretionary) hyphen in the document input.
 268 \stopitem
 269 \stopitemize
 270
 271 The expanded argument is first converted back to a space-separated string while
 272 dropping the internal command names. This string is then converted into a
 273 dictionary by a routine that creates key|-|value pairs by converting the other
 274 listed items. It is important to note that the keys in an exception dictionary
 275 can always be generated from the values. Here are a few examples:
 276
 277 \starttabulate[|l|l|l|]
 278 \NC \ssbf value \NC \ssbf implied key (input) \NC \ssbf effect \NC\NR
 279 \NC \type {ta-ble} \NC table \NC \type {ta\-ble} ($=$ \type {ta\discretionary{-}{}{}ble}) \NC\NR
 280 \NC \type {ba{k-}{}{c}ken} \NC backen \NC \type {ba\discretionary{k-}{}{c}ken} \NC\NR
 281 \stoptabulate
 282
 283 The resultant patterns and exception dictionary will be stored under the language
 284 code that is the present value of \type {\language}.
 285
 286 In the last line of the table, you see there is no \type {\discretionary} command
 287 in the value: the command is optional in the \TEX-based input syntax. The
 288 underlying reason for that is that it is conceivable that a whole dictionary of
 289 words is stored as a plain text file and loaded into \LUATEX\ using one of the
 290 functions in the \LUA\ \type {lang} library. This loading method is quite a bit
 291 faster than going through the \TEX\ language primitives, but some (most?) of that
 292 speed gain would be lost if it had to interpret command sequences while doing so.
 293
 294 It is possible to specify extra hyphenation points in compound words by using
 295 \type {{-}{}{-}} for the explicit hyphen character (replace \type {-} by the
 296 actual explicit hyphen character if needed). For example, this matches the word
 297 \quote {multi|-|word|-|boundaries} and allows an extra break inbetweem \quote
 298 {boun} and \quote {daries}:
 299
 300 \starttyping
 301 \hyphenation{multi{-}{}{-}word{-}{}{-}boun-daries}
 302 \stoptyping
 303
 304 The motivation behind the \ETEX\ extension \type {\savinghyphcodes} was that
 305 hyphenation heavily depended on font encodings. This is no longer true in
 306 \LUATEX, and the corresponding primitive is ignored pending complete removal. The
 307 future semantics of \type {\uppercase} and \type {\lowercase} are still under
 308 consideration, no changes have taken place yet.
 309
 310 \section{Applying hyphenation}
 311
 312 The internal structures \LUATEX\ uses for the insertion of discretionaries in
 313 words is very different from the ones in \TEX82, and that means there are some
 314 noticeable differences in handling as well.
 315
 316 First and foremost, there is no \quote {compressed trie} involved in hyphenation.
 317 The algorithm still reads \PATGEN-generated pattern files, but \LUATEX\ uses a
 318 finite state hash to match the patterns against the word to be hyphenated. This
 319 algorithm is based on the \quote {libhnj} library used by \OPENOFFICE, which in
 320 turn is inspired by \TEX. The memory allocation for this new implementation is
 321 completely dynamic, so the \WEBC\ setting for \type {trie_size} is ignored.
 322
 323 Differences between \LUATEX\ and \TEX82 that are a direct result of that:
 324
 325 \startitemize
 326 \startitem
 327     \LUATEX\ happily hyphenates the full \UNICODE\ character range.
 328 \stopitem
 329 \startitem
 330     Pattern and exception dictionary size is limited by the available memory
 331     only, all allocations are done dynamically. The trie|-|related settings in
 332     \type {texmf.cnf} are ignored.
 333 \stopitem
 334 \startitem
 335     Because there is no \quote {trie preparation} stage, language patterns never
 336     become frozen. This means that the primitive \type {\patterns} (and its \LUA\
 337     counterpart \type {lang.patterns}) can be used at any time, not only in
 338     ini\TEX.
 339 \stopitem
 340 \startitem
 341     Only the string representation of \type {\patterns} and \type {\hyphenation} is
 342     stored in the format file. At format load time, they are simply
 343     re|-|evaluated. It follows that there is no real reason to preload languages
 344     in the format file. In fact, it is usually not a good idea to do so. It is
 345     much smarter to load patterns no sooner than the first time they are actually
 346     needed.
 347 \stopitem
 348 \startitem
 349     \LUATEX\ uses the language-specific variables \type {\prehyphenchar} and \type
 350     {\posthyphenchar} in the creation of implicit discretionaries, instead of
 351     \TEX82's \type {\hyphenchar}, and the values of the language|-|specific variables
 352     \type {\preexhyphenchar} and \type {\postexhyphenchar} for explicit
 353     discretionaries (instead of \TEX82's empty discretionary).
 354 \stopitem
 355 \startitem
 356     The value of the two counters related to hyphenation, \type {hyphenpenalty}
 357     and \type {exhyphenpenalty}, are now stored in the discretionary nodes. This
 358     permits a local overload for explicit \type {\discretionary} commands. The
 359     value current when the hyphenation pass is applied is used. When no callbacks
 360     are used this is compatible with traditional \TEX. When you apply the \LUA\
 361     \type {lang.hyphenate} function the current values are used.
 362 \stopitem
 363 \stopitemize
 364
 365 Because we store penalties in the disc node the \type {\discretionary} command has
 366 been extended to accept an optional penalty specification, so you can do the
 367 following:
 368
 369 \startbuffer
 370 \hsize1mm
 371 1:foo{\hyphenpenalty 10000\discretionary{}{}{}}bar\par
 372 2:foo\discretionary penalty 10000 {}{}{}bar\par
 373 3:foo\discretionary{}{}{}bar\par
 374 \stopbuffer
 375
 376 \typebuffer
 377
 378 This results in:
 379
 380 \blank \start \getbuffer \stop \blank
 381
 382 Inserted characters and ligatures inherit their attributes from the nearest glyph
 383 node item (usually the preceding one, but the following one for the items
 384 inserted at the left-hand side of a word).
 385
 386 Word boundaries are no longer implied by font switches, but by language switches.
 387 One word can have two separate fonts and still be hyphenated correctly (but it
 388 can not have two different languages, the \type {\setlanguage} command forces a
 389 word boundary).
 390
 391 All languages start out with \type {\prehyphenchar=`\-}, \type {\posthyphenchar=0},
 392 \type {\preexhyphenchar=0} and \type {\postexhyphenchar=0}. When you assign the
 393 values of one of these four parameters, you are actually changing the settings
 394 for the current \type {\language}, this behavior is compatible with \type {\patterns}
 395 and \type {\hyphenation}.
 396
 397 \LUATEX\ also hyphenates the first word in a paragraph. Words can be up to 256
 398 characters long (up from 64 in \TEX82). Longer words generate an error right now,
 399 but eventually either the limitation will be removed or perhaps it will become
 400 possible to silently ignore the excess characters (this is what happens in
 401 \TEX82, but there the behavior cannot be controlled).
 402
 403 If you are using the \LUA\ function \type {lang.hyphenate}, you should be aware
 404 that this function expects to receive a list of \quote {character} nodes. It will
 405 not operate properly in the presence of \quote {glyph}, \quote {ligature}, or
 406 \quote {ghost} nodes, nor does it know how to deal with kerning. In the near
 407 future, it will be able to skip over \quote {ghost} nodes, and we may add a less
 408 fuzzy function you can call as well.
 409
 410 The hyphenation exception dictionary is maintained as key|-|value hash, and that
 411 is also dynamic, so the \type {hyph_size} setting is not used either.
 412
 413 \section{Applying ligatures and kerning}
 414
 415 After all possible hyphenation points have been inserted in the list, \LUATEX\
 416 will process the list to convert the \quote {character} nodes into \quote {glyph}
 417 and \quote {ligature} nodes. This is actually done in two stages: first all
 418 ligatures are processed, then all kerning information is applied to the result
 419 list. But those two stages are somewhat dependent on each other: If the used font
 420 makes it possible to do so, the ligaturing stage adds virtual \quote {character}
 421 nodes to the word boundaries in the list. While doing so, it removes and
 422 interprets \type {noboundary} nodes. The kerning stage deletes those word
 423 boundary items after it is done with them, and it does the same for \quote
 424 {ghost} nodes. Finally, at the end of the kerning stage, all remaining \quote
 425 {character} nodes are converted to \quote {glyph} nodes.
 426
 427 This work separation is worth mentioning because, if you overrule from \LUA\ only
 428 one of the two callbacks related to font handling, then you have to make sure you
 429 perform the tasks normally done by \LUATEX\ itself in order to make sure that the
 430 other, non|-|overruled, routine continues to function properly.
 431
 432 Work in this area is not yet complete, but most of the possible cases are handled
 433 by our rewritten ligaturing engine. We are working hard to make sure all of the
 434 possible inputs will become supported soon.
 435
 436 For example, take the word \type {office}, hyphenated \type {of-fice}, using a
 437 \quote {normal} font with all the \type {f}-\type {f} and \type {f}-\type {i}
 438 type ligatures:
 439
 440 \starttabulate[|l|l|]
 441 \NC Initial:               \NC \type {{o}{f}{f}{i}{c}{e}}             \NC\NR
 442 \NC After hyphenation:     \NC \type {{o}{f}{{-},{},{}}{f}{i}{c}{e}}  \NC\NR
 443 \NC First ligature stage:  \NC \type {{o}{{f-},{f},{<ff>}}{i}{c}{e}}  \NC\NR
 444 \NC Final result:          \NC \type {{o}{{f-},{<fi>},{<ffi>}}{c}{e}} \NC\NR
 445 \stoptabulate
 446
 447 That's bad enough, but let us assume that there is also a hyphenation point
 448 between the \type {f} and the \type {i}, to create \type {of-f-ice}. Then the
 449 final result should be:
 450
 451 \starttyping
 452 {o}{{f-},
 453     {{f-},
 454      {i},
 455      {<fi>}},
 456     {{<ff>-},
 457      {i},
 458      {<ffi>}}}{c}{e}
 459 \stoptyping
 460
 461 with discretionaries in the post-break text as well as in the replacement text of
 462 the top-level discretionary that resulted from the first hyphenation point.
 463
 464 Here is that nested solution again, in a different representation:
 465
 466 \starttabulate[|l|l|l|l|]
 467 \NC         \NC pre              \NC post          \NC replace           \NC \NR
 468 \NC topdisc \NC \type {f-}$^1$   \NC sub1          \NC sub2              \NC \NR
 469 \NC sub1    \NC \type {f-}$^2$   \NC \type {i}$^3$ \NC \type {<fi>}$^4$  \NC \NR
 470 \NC sub2    \NC \type {<ff>-}$^5$\NC \type {i}$^6$ \NC \type {<ffi>}$^7$ \NC \NR
 471 \stoptabulate
 472
 473 When line breaking is choosing its breakpoints, the following fields will
 474 eventually be selected:
 475
 476 \starttabulate[|l|l|l|]
 477 \NC \type {of-f-ice} \NC \type {f-}$^1$    \NC \NR
 478 \NC                  \NC \type {f-}$^2$    \NC \NR
 479 \NC                  \NC \type {i}$^3$     \NC \NR
 480 \NC \type {of-fice}  \NC \type {f-}$^1$    \NC \NR
 481 \NC                  \NC \type {<fi>}$^4$  \NC \NR
 482 \NC \type {off-ice}  \NC \type {<ff>-}$^5$ \NC \NR
 483 \NC                  \NC \type {i}$^6$     \NC \NR
 484 \NC \type {office}   \NC \type {<ffi>}$^7$ \NC \NR
 485 \stoptabulate
 486
 487 The current solution in \LUATEX\ is not able to handle nested discretionaries,
 488 but it is in fact smart enough to handle this fictional \type {of-f-ice} example.
 489 It does so by combining two sequential discretionary nodes as if they were a
 490 single object (where the second discretionary node is treated as an extension of
 491 the first node).
 492
 493 One can observe that the \type {of-f-ice} and \type {off-ice} cases both end with
 494 the same actual post replacement list (\type {i}), and that this would be the
 495 case even if that \type {i} was the first item of a potential following ligature
 496 like \type {ic}. This allows \LUATEX\ to do away with one of the fields, and thus
 497 make the whole stuff fit into just two discretionary nodes.
 498
 499 The mapping of the seven list fields to the six fields in this discretionary node
 500 pair is as follows:
 501
 502 \starttabulate[|l|p|]
 503 \NC \bf field            \NC \bf description \NC \NR
 504 \NC \type {disc1.pre}     \NC \type {f-}$^1$  \NC \NR
 505 \NC \type {disc1.post}    \NC \type {<fi>}$^4$  \NC \NR
 506 \NC \type {disc1.replace} \NC \type {<ffi>}$^7$ \NC \NR
 507 \NC \type {disc2.pre}     \NC \type {f-}$^2$  \NC \NR
 508 \NC \type {disc2.post}    \NC \type {i}$^{3{,}6}$\NC \NR
 509 \NC \type {disc2.replace} \NC \type {<ff>-}$^5$\NC \NR
 510 \stoptabulate
 511
 512 What is actually generated after ligaturing has been applied is therefore:
 513
 514 \starttyping
 515 {o}{{f-},
 516     {<fi>},
 517     {<ffi>}}
 518    {{f-},
 519     {i},
 520     {<ff>-}}{c}{e}
 521 \stoptyping
 522
 523 The two discretionaries have different subtypes from a discretionary appearing on
 524 its own: the first has subtype 4, and the second has subtype 5. The need for
 525 these special subtypes stems from the fact that not all of the fields appear in
 526 their \quote {normal} location. The second discretionary especially looks odd,
 527 with things like the \type {<ff>-} appearing in \type {disc2.replace}. The fact
 528 that some of the fields have different meanings (and different processing code
 529 internally) is what makes it necessary to have different subtypes: this enables
 530 \LUATEX\ to distinguish this sequence of two joined discretionary nodes from the
 531 case of two standalone discretionaries appearing in a row.
 532
 533 Of course there is still that relationship with fonts: ligatures can be implemented by
 534 mapping a sequence of glyphs onto one glyph, but also by selective replacement and
 535 kerning. This means that the above examples are just representing the traditional
 536 approach.
 537
 538 \section{Breaking paragraphs into lines}
 539
 540 This code is still almost unchanged, but because of the above|-|mentioned changes
 541 with respect to discretionaries and ligatures, line breaking will potentially be
 542 different from traditional \TEX. The actual line breaking code is still based on
 543 the \TEX82 algorithms, and it does not expect there to be discretionaries inside
 544 of discretionaries.
 545
 546 But that situation is now fairly common in \LUATEX, due to the changes to the
 547 ligaturing mechanism. And also, the \LUATEX\ discretionary nodes are implemented
 548 slightly different from the \TEX82 nodes: the \type {no_break} text is now
 549 embedded inside the disc node, where previously these nodes kept their place in
 550 the horizontal list (the discretionary node contained a counter indicating how
 551 many nodes to skip).
 552
 553 The combined effect of these two differences is that \LUATEX\ does not always use
 554 all of the potential breakpoints in a paragraph, especially when fonts with many
 555 ligatures are used.
 556
 557 \stopchapter
 558
 559 \stopcomponent