boundary nodes made consistent (cleanup and document): WARNING: bump the format numbe...
[luatex.git] / manual / luatex-languages.tex
blob19e3f7b1458d739c82c3e20280e014add42012d8
1 % language=uk
3 \environment luatex-style
4 \environment luatex-logos
6 \startcomponent luatex-languages
8 \startchapter[reference=languages,title={Languages, characters, fonts and glyphs}]
10 \LUATEX's internal handling of the characters and glyphs that eventually become
11 typeset is quite different from the way \TEX82 handles those same objects. The
12 easiest way to explain the difference is to focus on unrestricted horizontal mode
13 (i.e.\ paragraphs) and hyphenation first. Later on, it will be easy to deal
14 with the differences that occur in horizontal and math modes.
16 In \TEX82, the characters you type are converted into \type {char_node} records
17 when they are encountered by the main control loop. \TEX\ attaches and processes
18 the font information while creating those records, so that the resulting \quote
19 {horizontal list} contains the final forms of ligatures and implicit kerning.
20 This packaging is needed because we may want to get the effective width of for
21 instance a horizontal box.
23 When it becomes necessary to hyphenate words in a paragraph, \TEX\ converts (one
24 word at time) the \type {char_node} records into a string by replacing ligatures
25 with their components and ignoring the kerning. Then it runs the hyphenation
26 algorithm on this string, and converts the hyphenated result back into a \quote
27 {horizontal list} that is consecutively spliced back into the paragraph stream.
28 Keep in mind that the paragraph may contain unboxed horizontal material, which
29 then already contains ligatures and kerns and the words therein are part of the
30 hyphenation process.
32 Those \type {char_node} records are somewhat misnamed, as they are glyph
33 positions in specific fonts, and therefore not really \quote {characters} in the
34 linguistic sense. There is no language information inside the \type {char_node}
35 records at all. Instead, language information is passed along using \type
36 {language whatsit} records inside the horizontal list.
38 In \LUATEX, the situation is quite different. The characters you type are always
39 converted into \type {glyph_node} records with a special subtype to identify them
40 as being intended as linguistic characters. \LUATEX\ stores the needed language
41 information in those records, but does not do any font|-|related processing at
42 the time of node creation. It only stores the index of the current font and a
43 reference to a character in that font.
45 When it becomes necessary to typeset a paragraph, \LUATEX\ first inserts all
46 hyphenation points right into the whole node list. Next, it processes all the
47 font information in the whole list (creating ligatures and adjusting kerning),
48 and finally it adjusts all the subtype identifiers so that the records are \quote
49 {glyph nodes} from now on.
51 \section[charsandglyphs]{Characters and glyphs}
53 \TEX82 (including \PDFTEX) differentiates between \type {char_node}s and \type
54 {lig_node}s. The former are simple items that contained nothing but a \quote
55 {character} and a \quote {font} field, and they lived in the same memory as
56 tokens did. The latter also contained a list of components, and a subtype
57 indicating whether this ligature was the result of a word boundary, and it was
58 stored in the same place as other nodes like boxes and kerns and glues.
60 In \LUATEX, these two types are merged into one, somewhat larger structure called
61 a \type {glyph_node}. Besides having the old character, font, and component
62 fields, and the new special fields like \quote {attr} (see~\in {section}
63 [glyphnodes]), these nodes also contain:
65 \startitemize
67 \startitem A subtype, split into four main types:
69 \startitemize
70 \startitem
71 \type {character}, for characters to be hyphenated: the lowest bit
72 (bit 0) is set to 1.
73 \stopitem
74 \startitem
75 \type {glyph}, for specific font glyphs: the lowest bit (bit 0) is
76 not set.
77 \stopitem
78 \startitem
79 \type {ligature}, for ligatures (bit 1 is set)
80 \stopitem
81 \startitem
82 \type {ghost}, for \quote {ghost objects} (bit 2 is set)
83 \stopitem
84 \stopitemize
86 The latter two make further use of two extra fields (bits 3 and 4):
88 \startitemize
89 \startitem
90 \type {left}, for ligatures created from a left word boundary and for
91 ghosts created from \type {\leftghost}
92 \stopitem
93 \startitem
94 \type {right}, for ligatures created from a right word boundary and
95 for ghosts created from \type {\rightghost}
96 \stopitem
97 \stopitemize
99 For ligatures, both bits can be set at the same time (in case of a
100 single|-|glyph word).
102 \stopitem
104 \startitem
105 \type {glyph_node}s of type \quote {character} also contain language data,
106 split into four items that were current when the node was created: the
107 \type {\setlanguage} (15 bits), \type {\lefthyphenmin} (8 bits), \type
108 {\righthyphenmin} (8 bits), and \type {\uchyph} (1 bit).
109 \stopitem
111 \stopitemize
113 Incidentally, \LUATEX\ allows 16383 separate languages, and words can be 256
114 characters long. The language is stored with each character. You can set
115 \type {\firstvalidlanguage} to for instance~1 and make thereby language~0
116 an ignored hyphenation language.
118 The new primitive \type {\hyphenationmin} can be used to signal the minimal length
119 of a word. This value stored with the (current) language.
121 Because the \type {\uchyph} value is saved in the actual nodes, its handling is
122 subtly different from \TEX82: changes to \type {\uchyph} become effective
123 immediately, not at the end of the current partial paragraph.
125 Typeset boxes now always have their language information embedded in the nodes
126 themselves, so there is no longer a possible dependency on the surrounding
127 language settings. In \TEX82, a mid-paragraph statement like \type {\unhbox0} would
128 process the box using the current paragraph language unless there was a
129 \type {\setlanguage} issued inside the box. In \LUATEX, all language variables are
130 already frozen.
132 In traditional \TEX\ the process of hyphenation is driven by \type {lccode}s. In
133 \LUATEX\ we made this dependency less strong. There are several strategies
134 possible. When you do nothing, the currently used \type {lccode}s are used, when
135 loading patterns, setting exceptions or hyphenating a list.
137 When you set \type {\savinghyphcodes} to a value larger than zero the current set
138 of \type {lccode}s will be saved with the language. In that case changing a \type
139 {lccode} afterwards has no effect. However, you can adapt the set with:
141 \starttyping
142 \hjcode`a=`a
143 \stoptyping
145 This change is global which makes sense if you keep in mind that the moment that
146 hyphenation happens is (normally) when the paragraph or a horizontal box is
147 constructed. When \type {\savinghyphcodes} was zero when the language got
148 initialized you start out with nothing, otherwise you already have a set.
150 Carrying all this information with each glyph would give too much overhead and
151 also make the process of setting up thee codes more complex. A solution with
152 \type {hjcode} sets was considered but rejected because in practice the current
153 approach is sufficient and it would not be compatible anyway.
155 Beware: the values are always saved in the format, independent of the setting
156 of \type {\savinghyphcodes} at the moment the format is dumped.
158 A boundary node normally would mark the end of a word which interferes with for
159 instance discretionary injection. For this you can use the \type {\wordboundary}
160 as trigger. Here are a few examples of usage:
162 \startbuffer
163 discrete---discrete
164 \stopbuffer
165 \typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
166 \startbuffer
167 discrete\discretionary{}{}{---}discrete
168 \stopbuffer
169 \typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
170 \startbuffer
171 discrete\wordboundary\discretionary{}{}{---}discrete
172 \stopbuffer
173 \typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
174 \startbuffer
175 discrete\wordboundary\discretionary{}{}{---}\wordboundary discrete
176 \stopbuffer
177 \typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
178 \startbuffer
179 discrete\wordboundary\discretionary{---}{}{}\wordboundary discrete
180 \stopbuffer
181 \typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
183 \section{The main control loop}
185 In \LUATEX's main loop, almost all input characters that are to be typeset are
186 converted into \type {glyph} node records with subtype \quote {character}, but
187 there are a few exceptions.
189 First, the \type {\accent} primitives creates nodes with subtype \quote {glyph}
190 instead of \quote {character}: one for the actual accent and one for the
191 accentee. The primary reason for this is that \type {\accent} in \TEX82 is
192 explicitly dependent on the current font encoding, so it would not make much
193 sense to attach a new meaning to the primitive's name, as that would invalidate
194 many old documents and macro packages. \footnote {Of course, modern packages will
195 not use the \type {\accent} primitive at all but try to map directly on composed
196 characters.} A secondary reason is that in \TEX82, \type {\accent} prohibits
197 hyphenation of the current word. Since in \LUATEX\ hyphenation only takes place
198 on \quote {character} nodes, it is possible to achieve the same effect.
200 This change of meaning did happen with \type {\char}, that now generates \quote
201 {glyph} nodes with a character subtype. In traditional \TEX\ there was a strong
202 relationship between the 8|-|bit input encoding, hyphenation and glyphs taken
203 from a font. In \LUATEX\ we have \UTF\ input, and in most cases this maps
204 directly to a character in a font, apart from glyph replacement in the font
205 engine. If you want to access arbitrary glyphs in a font directly you can always
206 use \LUA\ to do so, because fonts are available as \LUA\ table.
208 Second, all the results of processing in math mode eventually become nodes with
209 \quote {glyph} subtypes.
211 Third, the \ALEPH|-|derived commands \type {\leftghost} and \type {\rightghost}
212 create nodes of a third subtype: \quote {ghost}. These nodes are ignored
213 completely by all further processing until the stage where inter|-|glyph kerning
214 is added.
216 Fourth, automatic discretionaries are handled differently. \TEX82 inserts an
217 empty discretionary after sensing an input character that matches the \type
218 {\hyphenchar} in the current font. This test is wrong in our opinion: whether or
219 not hyphenation takes place should not depend on the current font, it is a
220 language property. \footnote {When \TEX\ showed up we didn't have \UNICODE\ yet
221 and being limited to eight bits meant that one sometimes had to compromise
222 between supporting character input, glyph rendering, hyphenation.}
224 In \LUATEX, it works like this: if \LUATEX\ senses a string of input characters
225 that matches the value of the new integer parameter \type {\exhyphenchar}, it will
226 insert an explicit discretionary after that series of nodes. Initex sets the \type
227 {\exhyphenchar=`\-}. Incidentally, this is a global parameter instead of a
228 language-specific one because it may be useful to change the value depending on
229 the document structure instead of the text language.
231 The insertion of discretionaries after a sequence of explicit hyphens happens at
232 the same time as the other hyphenation processing, {\it not\/} inside the main
233 control loop.
235 The only use \LUATEX\ has for \type {\hyphenchar} is at the check whether a word
236 should be considered for hyphenation at all. If the \type {\hyphenchar} of the
237 font attached to the first character node in a word is negative, then hyphenation
238 of that word is abandoned immediately. This behaviour is added for backward
239 compatibility only, and the use of \type {\hyphenchar=-1} as a means of
240 preventing hyphenation should not be used in new \LUATEX\ documents.
242 Fifth, \type {\setlanguage} no longer creates whatsits. The meaning of \type
243 {\setlanguage} is changed so that it is now an integer parameter like all others.
244 That integer parameter is used in \type {\glyph_node} creation to add language
245 information to the glyph nodes. In conjunction, the \type {\language} primitive is
246 extended so that it always also updates the value of \type {\setlanguage}.
248 Sixth, the \type {\noboundary} command (that prohibits word boundary processing
249 where that would normally take place) now does create nodes. These nodes are
250 needed because the exact place of the \type {\noboundary} command in the input
251 stream has to be retained until after the ligature and font processing stages.
253 Finally, there is no longer a \type {main_loop} label in the code. Remember that
254 \TEX82 did quite a lot of processing while adding \type {char_nodes} to the
255 horizontal list? For speed reasons, it handled that processing code outside of
256 the \quote {main control} loop, and only the first character of any \quote {word}
257 was handled by that \quote {main control} loop. In \LUATEX, there is no longer a
258 need for that (all hard work is done later), and the (now very small) bits of
259 character|-|handling code have been moved back inline. When \type
260 {\tracingcommands} is on, this is visible because the full word is reported,
261 instead of just the initial character.
263 \section[patternsexceptions]{Loading patterns and exceptions}
265 The hyphenation algorithm in \LUATEX\ is quite different from the one in \TEX82,
266 although it uses essentially the same user input.
268 After expansion, the argument for \type {\patterns} has to be proper \UTF8 with
269 individual patterns separated by spaces, no \type {\char} or \type {\chardef}d
270 commands are allowed. The current implementation quite strict and will reject all
271 non|-|\UNICODE\ characters.
273 Likewise, the expanded argument for \type {\hyphenation} also has to be proper
274 \UTF8, but here a bit of extra syntax is provided:
276 \startitemize[n]
277 \startitem
278 Three sets of arguments in curly braces (\type {{}{}{}}) indicates a desired
279 complex discretionary, with arguments as in \type {\discretionary}'s command in
280 normal document input.
281 \stopitem
282 \startitem
283 A \type {-} indicates a desired simple discretionary, cf.\ \type {\-} and \type
284 {\discretionary{-}{}{}} in normal document input.
285 \stopitem
286 \startitem
287 Internal command names are ignored. This rule is provided especially for \type
288 {\discretionary}, but it also helps to deal with \type {\relax} commands that
289 may sneak in.
290 \stopitem
291 \startitem
292 An \type {=} indicates a (non|-|discretionary) hyphen in the document input.
293 \stopitem
294 \stopitemize
296 The expanded argument is first converted back to a space-separated string while
297 dropping the internal command names. This string is then converted into a
298 dictionary by a routine that creates key|-|value pairs by converting the other
299 listed items. It is important to note that the keys in an exception dictionary
300 can always be generated from the values. Here are a few examples:
302 \starttabulate[|l|l|l|]
303 \NC \bf value \NC \bf implied key (input) \NC \bf effect \NC\NR
304 \NC \type {ta-ble} \NC table \NC \type {ta\-ble} ($=$ \type {ta\discretionary{-}{}{}ble}) \NC\NR
305 \NC \type {ba{k-}{}{c}ken} \NC backen \NC \type {ba\discretionary{k-}{}{c}ken} \NC\NR
306 \stoptabulate
308 The resultant patterns and exception dictionary will be stored under the language
309 code that is the present value of \type {\language}.
311 In the last line of the table, you see there is no \type {\discretionary} command
312 in the value: the command is optional in the \TEX-based input syntax. The
313 underlying reason for that is that it is conceivable that a whole dictionary of
314 words is stored as a plain text file and loaded into \LUATEX\ using one of the
315 functions in the \LUA\ \type {lang} library. This loading method is quite a bit
316 faster than going through the \TEX\ language primitives, but some (most?) of that
317 speed gain would be lost if it had to interpret command sequences while doing so.
319 It is possible to specify extra hyphenation points in compound words by using
320 \type {{-}{}{-}} for the explicit hyphen character (replace \type {-} by the
321 actual explicit hyphen character if needed). For example, this matches the word
322 \quote {multi|-|word|-|boundaries} and allows an extra break inbetween \quote
323 {boun} and \quote {daries}:
325 \starttyping
326 \hyphenation{multi{-}{}{-}word{-}{}{-}boun-daries}
327 \stoptyping
329 The motivation behind the \ETEX\ extension \type {\savinghyphcodes} was that
330 hyphenation heavily depended on font encodings. This is no longer true in
331 \LUATEX, and the corresponding primitive is basically ignored. Because we now
332 have \type {hjcode}, the case relate codes can be used exclusively for \type
333 {\uppercase} and \type {\lowercase}.
335 \section{Applying hyphenation}
337 The internal structures \LUATEX\ uses for the insertion of discretionaries in
338 words is very different from the ones in \TEX82, and that means there are some
339 noticeable differences in handling as well.
341 First and foremost, there is no \quote {compressed trie} involved in hyphenation.
342 The algorithm still reads \PATGEN-generated pattern files, but \LUATEX\ uses a
343 finite state hash to match the patterns against the word to be hyphenated. This
344 algorithm is based on the \quote {libhnj} library used by \OPENOFFICE, which in
345 turn is inspired by \TEX.
347 There are a few differences between \LUATEX\ and \TEX82 that are a direct result
348 of the implementation:
350 \startitemize
351 \startitem
352 \LUATEX\ happily hyphenates the full \UNICODE\ character range.
353 \stopitem
354 \startitem
355 Pattern and exception dictionary size is limited by the available memory
356 only, all allocations are done dynamically. The trie|-|related settings in
357 \type {texmf.cnf} are ignored.
358 \stopitem
359 \startitem
360 Because there is no \quote {trie preparation} stage, language patterns never
361 become frozen. This means that the primitive \type {\patterns} (and its \LUA\
362 counterpart \type {lang.patterns}) can be used at any time, not only in
363 ini\TEX.
364 \stopitem
365 \startitem
366 Only the string representation of \type {\patterns} and \type {\hyphenation} is
367 stored in the format file. At format load time, they are simply
368 re|-|evaluated. It follows that there is no real reason to preload languages
369 in the format file. In fact, it is usually not a good idea to do so. It is
370 much smarter to load patterns no sooner than the first time they are actually
371 needed.
372 \stopitem
373 \startitem
374 \LUATEX\ uses the language-specific variables \type {\prehyphenchar} and \type
375 {\posthyphenchar} in the creation of implicit discretionaries, instead of
376 \TEX82's \type {\hyphenchar}, and the values of the language|-|specific variables
377 \type {\preexhyphenchar} and \type {\postexhyphenchar} for explicit
378 discretionaries (instead of \TEX82's empty discretionary).
379 \stopitem
380 \startitem
381 The value of the two counters related to hyphenation, \type {\hyphenpenalty}
382 and \type {\exhyphenpenalty}, are now stored in the discretionary nodes. This
383 permits a local overload for explicit \type {\discretionary} commands. The
384 value current when the hyphenation pass is applied is used. When no callbacks
385 are used this is compatible with traditional \TEX. When you apply the \LUA\
386 \type {lang.hyphenate} function the current values are used.
387 \stopitem
388 \stopitemize
390 Because we store penalties in the disc node the \type {\discretionary} command has
391 been extended to accept an optional penalty specification, so you can do the
392 following:
394 \startbuffer
395 \hsize1mm
396 1:foo{\hyphenpenalty 10000\discretionary{}{}{}}bar\par
397 2:foo\discretionary penalty 10000 {}{}{}bar\par
398 3:foo\discretionary{}{}{}bar\par
399 \stopbuffer
401 \typebuffer
403 This results in:
405 \blank \start \getbuffer \stop \blank
407 Inserted characters and ligatures inherit their attributes from the nearest glyph
408 node item (usually the preceding one, but the following one for the items
409 inserted at the left-hand side of a word).
411 Word boundaries are no longer implied by font switches, but by language switches.
412 One word can have two separate fonts and still be hyphenated correctly (but it
413 can not have two different languages, the \type {\setlanguage} command forces a
414 word boundary).
416 All languages start out with \type {\prehyphenchar=`\-}, \type {\posthyphenchar=0},
417 \type {\preexhyphenchar=0} and \type {\postexhyphenchar=0}. When you assign the
418 values of one of these four parameters, you are actually changing the settings
419 for the current \type {\language}, this behaviour is compatible with \type {\patterns}
420 and \type {\hyphenation}.
422 \LUATEX\ also hyphenates the first word in a paragraph. Words can be up to 256
423 characters long (up from 64 in \TEX82). Longer words generate an error right now,
424 but eventually either the limitation will be removed or perhaps it will become
425 possible to silently ignore the excess characters (this is what happens in
426 \TEX82, but there the behaviour cannot be controlled).
428 If you are using the \LUA\ function \type {lang.hyphenate}, you should be aware
429 that this function expects to receive a list of \quote {character} nodes. It will
430 not operate properly in the presence of \quote {glyph}, \quote {ligature}, or
431 \quote {ghost} nodes, nor does it know how to deal with kerning.
433 The hyphenation exception dictionary is maintained as key|-|value hash, and that
434 is also dynamic, so the \type {hyph_size} setting is not used either.
436 \section{Applying ligatures and kerning}
438 After all possible hyphenation points have been inserted in the list, \LUATEX\
439 will process the list to convert the \quote {character} nodes into \quote {glyph}
440 and \quote {ligature} nodes. This is actually done in two stages: first all
441 ligatures are processed, then all kerning information is applied to the result
442 list. But those two stages are somewhat dependent on each other: If the used font
443 makes it possible to do so, the ligaturing stage adds virtual \quote {character}
444 nodes to the word boundaries in the list. While doing so, it removes and
445 interprets \type {\noboundary} nodes. The kerning stage deletes those word
446 boundary items after it is done with them, and it does the same for \quote
447 {ghost} nodes. Finally, at the end of the kerning stage, all remaining \quote
448 {character} nodes are converted to \quote {glyph} nodes.
450 This work separation is worth mentioning because, if you overrule from \LUA\ only
451 one of the two callbacks related to font handling, then you have to make sure you
452 perform the tasks normally done by \LUATEX\ itself in order to make sure that the
453 other, non|-|overruled, routine continues to function properly.
455 Work in this area is not yet complete, but most of the possible cases are handled
456 by our rewritten ligaturing engine. At some point all of the possible inputs will
457 become supported. \footnote {Not all of this makes sense because we nowadays have
458 \OPENTYPE\ fonts and ligature building can happen in ,any different ways there.}
460 For example, take the word \type {office}, hyphenated \type {of-fice}, using a
461 \quote {normal} font with all the \type {f}-\type {f} and \type {f}-\type {i}
462 type ligatures:
464 \starttabulate[|l|l|]
465 \NC Initial: \NC \type {{o}{f}{f}{i}{c}{e}} \NC\NR
466 \NC After hyphenation: \NC \type {{o}{f}{{-},{},{}}{f}{i}{c}{e}} \NC\NR
467 \NC First ligature stage: \NC \type {{o}{{f-},{f},{<ff>}}{i}{c}{e}} \NC\NR
468 \NC Final result: \NC \type {{o}{{f-},{<fi>},{<ffi>}}{c}{e}} \NC\NR
469 \stoptabulate
471 That's bad enough, but let us assume that there is also a hyphenation point
472 between the \type {f} and the \type {i}, to create \type {of-f-ice}. Then the
473 final result should be:
475 \starttyping
476 {o}{{f-},
477 {{f-},
478 {i},
479 {<fi>}},
480 {{<ff>-},
481 {i},
482 {<ffi>}}}{c}{e}
483 \stoptyping
485 with discretionaries in the post-break text as well as in the replacement text of
486 the top-level discretionary that resulted from the first hyphenation point.
488 Here is that nested solution again, in a different representation:
490 \starttabulate[|l|l|l|l|]
491 \NC \NC pre \NC post \NC replace \NC \NR
492 \NC topdisc \NC \type {f-}$^1$ \NC sub1 \NC sub2 \NC \NR
493 \NC sub1 \NC \type {f-}$^2$ \NC \type {i}$^3$ \NC \type {<fi>}$^4$ \NC \NR
494 \NC sub2 \NC \type {<ff>-}$^5$\NC \type {i}$^6$ \NC \type {<ffi>}$^7$ \NC \NR
495 \stoptabulate
497 When line breaking is choosing its breakpoints, the following fields will
498 eventually be selected:
500 \starttabulate[|l|l|l|]
501 \NC \type {of-f-ice} \NC \type {f-}$^1$ \NC \NR
502 \NC \NC \type {f-}$^2$ \NC \NR
503 \NC \NC \type {i}$^3$ \NC \NR
504 \NC \type {of-fice} \NC \type {f-}$^1$ \NC \NR
505 \NC \NC \type {<fi>}$^4$ \NC \NR
506 \NC \type {off-ice} \NC \type {<ff>-}$^5$ \NC \NR
507 \NC \NC \type {i}$^6$ \NC \NR
508 \NC \type {office} \NC \type {<ffi>}$^7$ \NC \NR
509 \stoptabulate
511 The current solution in \LUATEX\ is not able to handle nested discretionaries,
512 but it is in fact smart enough to handle this fictional \type {of-f-ice} example.
513 It does so by combining two sequential discretionary nodes as if they were a
514 single object (where the second discretionary node is treated as an extension of
515 the first node).
517 One can observe that the \type {of-f-ice} and \type {off-ice} cases both end with
518 the same actual post replacement list (\type {i}), and that this would be the
519 case even if that \type {i} was the first item of a potential following ligature
520 like \type {ic}. This allows \LUATEX\ to do away with one of the fields, and thus
521 make the whole stuff fit into just two discretionary nodes.
523 The mapping of the seven list fields to the six fields in this discretionary node
524 pair is as follows:
526 \starttabulate[|l|p|]
527 \NC \bf field \NC \bf description \NC \NR
528 \NC \type {disc1.pre} \NC \type {f-}$^1$ \NC \NR
529 \NC \type {disc1.post} \NC \type {<fi>}$^4$ \NC \NR
530 \NC \type {disc1.replace} \NC \type {<ffi>}$^7$ \NC \NR
531 \NC \type {disc2.pre} \NC \type {f-}$^2$ \NC \NR
532 \NC \type {disc2.post} \NC \type {i}$^{3{,}6}$\NC \NR
533 \NC \type {disc2.replace} \NC \type {<ff>-}$^5$\NC \NR
534 \stoptabulate
536 What is actually generated after ligaturing has been applied is therefore:
538 \starttyping
539 {o}{{f-},
540 {<fi>},
541 {<ffi>}}
542 {{f-},
543 {i},
544 {<ff>-}}{c}{e}
545 \stoptyping
547 The two discretionaries have different subtypes from a discretionary appearing on
548 its own: the first has subtype 4, and the second has subtype 5. The need for
549 these special subtypes stems from the fact that not all of the fields appear in
550 their \quote {normal} location. The second discretionary especially looks odd,
551 with things like the \type {<ff>-} appearing in \type {disc2.replace}. The fact
552 that some of the fields have different meanings (and different processing code
553 internally) is what makes it necessary to have different subtypes: this enables
554 \LUATEX\ to distinguish this sequence of two joined discretionary nodes from the
555 case of two standalone discretionaries appearing in a row.
557 Of course there is still that relationship with fonts: ligatures can be implemented by
558 mapping a sequence of glyphs onto one glyph, but also by selective replacement and
559 kerning. This means that the above examples are just representing the traditional
560 approach.
562 \section{Breaking paragraphs into lines}
564 This code is still almost unchanged, but because of the above|-|mentioned changes
565 with respect to discretionaries and ligatures, line breaking will potentially be
566 different from traditional \TEX. The actual line breaking code is still based on
567 the \TEX82 algorithms, and it does not expect there to be discretionaries inside
568 of discretionaries.
570 But that situation is now fairly common in \LUATEX, due to the changes to the
571 ligaturing mechanism. And also, the \LUATEX\ discretionary nodes are implemented
572 slightly different from the \TEX82 nodes: the \type {no_break} text is now
573 embedded inside the disc node, where previously these nodes kept their place in
574 the horizontal list. In traditional \TEX\ the discretionary node contains a
575 counter indicating how many nodes to skip, but in \LUATEX\ we store the pre, post
576 and replace text in the discretionary node.
578 The combined effect of these two differences is that \LUATEX\ does not always use
579 all of the potential breakpoints in a paragraph, especially when fonts with many
580 ligatures are used. Of course kerning also complicates matters here.
582 \section{The \type {lang} library}
584 This library provides the interface to \LUATEX's structure
585 representing a language, and the associated functions.
587 \startfunctioncall
588 <language> l = lang.new()
589 <language> l = lang.new(<number> id)
590 \stopfunctioncall
592 This function creates a new userdata object. An object of type \type {<language>}
593 is the first argument to most of the other functions in the \type {lang}
594 library. These functions can also be used as if they were object methods, using
595 the colon syntax.
597 Without an argument, the next available internal id number will be assigned to
598 this object. With argument, an object will be created that links to the internal
599 language with that id number.
601 \startfunctioncall
602 <number> n = lang.id(<language> l)
603 \stopfunctioncall
605 returns the internal \type {\language} id number this object refers to.
607 \startfunctioncall
608 <string> n = lang.hyphenation(<language> l)
609 lang.hyphenation(<language> l, <string> n)
610 \stopfunctioncall
612 Either returns the current hyphenation exceptions for this language, or adds new
613 ones. The syntax of the string is explained in~\in {section}
614 [patternsexceptions].
616 \startfunctioncall
617 lang.clear_hyphenation(<language> l)
618 \stopfunctioncall
620 Clears the exception dictionary (string) for this language.
622 \startfunctioncall
623 <string> n = lang.clean(<language> l, <string> o)
624 <string> n = lang.clean(<string> o)
625 \stopfunctioncall
627 Creates a hyphenation key from the supplied hyphenation value. The syntax of the
628 argument string is explained in~\in {section} [patternsexceptions]. This function
629 is useful if you want to do something else based on the words in a dictionary
630 file, like spell|-|checking.
632 \startfunctioncall
633 <string> n = lang.patterns(<language> l)
634 lang.patterns(<language> l, <string> n)
635 \stopfunctioncall
637 Adds additional patterns for this language object, or returns the current set.
638 The syntax of this string is explained in~\in {section} [patternsexceptions].
640 \startfunctioncall
641 lang.clear_patterns(<language> l)
642 \stopfunctioncall
644 Clears the pattern dictionary for this language.
646 \startfunctioncall
647 <number> n = lang.prehyphenchar(<language> l)
648 lang.prehyphenchar(<language> l, <number> n)
649 \stopfunctioncall
651 Gets or sets the \quote {pre|-|break} hyphen character for implicit hyphenation
652 in this language (initially the hyphen, decimal 45).
654 \startfunctioncall
655 <number> n = lang.posthyphenchar(<language> l)
656 lang.posthyphenchar(<language> l, <number> n)
657 \stopfunctioncall
659 Gets or sets the \quote {post|-|break} hyphen character for implicit hyphenation
660 in this language (initially null, decimal~0, indicating emptiness).
662 \startfunctioncall
663 <number> n = lang.preexhyphenchar(<language> l)
664 lang.preexhyphenchar(<language> l, <number> n)
665 \stopfunctioncall
667 Gets or sets the \quote {pre|-|break} hyphen character for explicit hyphenation
668 in this language (initially null, decimal~0, indicating emptiness).
670 \startfunctioncall
671 <number> n = lang.postexhyphenchar(<language> l)
672 lang.postexhyphenchar(<language> l, <number> n)
673 \stopfunctioncall
675 Gets or sets the \quote {post|-|break} hyphen character for explicit hyphenation
676 in this language (initially null, decimal~0, indicating emptiness).
678 \startfunctioncall
679 <boolean> success = lang.hyphenate(<node> head)
680 <boolean> success = lang.hyphenate(<node> head, <node> tail)
681 \stopfunctioncall
683 Inserts hyphenation points (discretionary nodes) in a node list. If \type {tail}
684 is given as argument, processing stops on that node. Currently, \type {success}
685 is always true if \type {head} (and \type {tail}, if specified) are proper nodes,
686 regardless of possible other errors.
688 Hyphenation works only on \quote {characters}, a special subtype of all the glyph
689 nodes with the node subtype having the value \type {1}. Glyph modes with
690 different subtypes are not processed. See \in {section~} [charsandglyphs] for
691 more details.
693 The following two commands can be used to set or query hj codes:
695 \startfunctioncall
696 lang.sethjcode(<language> l, <number> char, <number> usedchar)
697 <number> usedchar = lang.gethjcode(<language> l, <number> char)
698 \stopfunctioncall
700 When you set a hjcode the current sets get initialized unless the set was already
701 initialized due to \type {\savinghyphcodes} being larger than zero.
703 \stopchapter
705 \stopcomponent