Update copyright year to 2015
[emacs.git] / src / bidi.c
blobef0092f3d93d2460fe28bc0d8277e290e1c1a198
1 /* Low-level bidirectional buffer/string-scanning functions for GNU Emacs.
2 Copyright (C) 2000-2001, 2004-2005, 2009-2015 Free Software
3 Foundation, Inc.
5 This file is part of GNU Emacs.
7 GNU Emacs is free software: you can redistribute it and/or modify
8 it under the terms of the GNU General Public License as published by
9 the Free Software Foundation, either version 3 of the License, or
10 (at your option) any later version.
12 GNU Emacs is distributed in the hope that it will be useful,
13 but WITHOUT ANY WARRANTY; without even the implied warranty of
14 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 GNU General Public License for more details.
17 You should have received a copy of the GNU General Public License
18 along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */
20 /* Written by Eli Zaretskii <eliz@gnu.org>.
22 A sequential implementation of the Unicode Bidirectional algorithm,
23 (UBA) as per UAX#9, a part of the Unicode Standard.
25 Unlike the Reference Implementation and most other implementations,
26 this one is designed to be called once for every character in the
27 buffer or string. That way, we can leave intact the design of the
28 Emacs display engine, whereby an iterator object is used to
29 traverse buffer or string text character by character, and generate
30 the necessary data for displaying each character in 'struct glyph'
31 objects. (See xdisp.c for the details of that iteration.) The
32 functions on this file replace the original linear iteration in the
33 logical order of the text with a non-linear iteration in the visual
34 order, i.e. in the order characters should be shown on display.
36 The main entry point is bidi_move_to_visually_next. Each time it
37 is called, it finds the next character in the visual order, and
38 returns its information in a special structure. The caller is then
39 expected to process this character for display or any other
40 purposes, and call bidi_move_to_visually_next for the next
41 character. See the comments in bidi_move_to_visually_next for more
42 details about its algorithm that finds the next visual-order
43 character by resolving their levels on the fly.
45 Two other entry points are bidi_paragraph_init and
46 bidi_mirror_char. The first determines the base direction of a
47 paragraph, while the second returns the mirrored version of its
48 argument character.
50 A few auxiliary entry points are used to initialize the bidi
51 iterator for iterating an object (buffer or string), push and pop
52 the bidi iterator state, and save and restore the state of the bidi
53 cache.
55 If you want to understand the code, you will have to read it
56 together with the relevant portions of UAX#9. The comments include
57 references to UAX#9 rules, for that very reason.
59 A note about references to UAX#9 rules: if the reference says
60 something like "X9/Retaining", it means that you need to refer to
61 rule X9 and to its modifications described in the "Implementation
62 Notes" section of UAX#9, under "Retaining Format Codes".
64 Here's the overview of the design of the reordering engine
65 implemented by this file.
67 Basic implementation structure
68 ------------------------------
70 The sequential processing steps described by UAX#9 are implemented
71 as recursive levels of processing, all of which examine the next
72 character in the logical order. This hierarchy of processing looks
73 as follows, from the innermost (deepest) to the outermost level,
74 omitting some subroutines used by each level:
76 bidi_fetch_char -- fetch next character
77 bidi_resolve_explicit -- resolve explicit levels and directions
78 bidi_resolve_weak -- resolve weak types
79 bidi_resolve_brackets -- resolve "paired brackets" neutral types
80 bidi_resolve_neutral -- resolve neutral types
81 bidi_level_of_next_char -- resolve implicit levels
83 Each level calls the level below it, and works on the result
84 returned by the lower level, including all of its sub-levels.
86 Unlike all the levels below it, bidi_level_of_next_char can return
87 the information about either the next or previous character in the
88 logical order, depending on the current direction of scanning the
89 buffer or string. For the next character, it calls all the levels
90 below it; for the previous character, it uses the cache, described
91 below.
93 Thus, the result of calling bidi_level_of_next_char is the resolved
94 level of the next or the previous character in the logical order.
95 Based on this information, the function bidi_move_to_visually_next
96 finds the next character in the visual order and updates the
97 direction in which the buffer is scanned, either forward or
98 backward, to find the next character to be displayed. (Text is
99 scanned backwards when it needs to be reversed for display, i.e. if
100 the visual order is the inverse of the logical order.) This
101 implements the last, reordering steps of the UBA, by successively
102 calling bidi_level_of_next_char until the character of the required
103 embedding level is found; the scan direction is dynamically updated
104 as a side effect. See the commentary before the 'while' loop in
105 bidi_move_to_visually_next, for the details.
107 Fetching characters
108 -------------------
110 In a nutshell, fetching the next character boils down to calling
111 STRING_CHAR_AND_LENGTH, passing it the address of a buffer or
112 string position. See bidi_fetch_char. However, if the next
113 character is "covered" by a display property of some kind,
114 bidi_fetch_char returns the u+FFFC "object replacement character"
115 that represents the entire run of text covered by the display
116 property. (The ch_len and nchars members of 'struct bidi_it'
117 reflect the length in bytes and characters of that text.) This is
118 so we reorder text on both sides of the display property as
119 appropriate for an image or embedded string. Similarly, text
120 covered by a display spec of the form '(space ...)', is replaced
121 with the u+2029 paragraph separator character, so such display
122 specs produce the same effect as a TAB under UBA. Both these
123 special characters are not actually displayed -- the display
124 property is displayed instead -- but just used to compute the
125 embedding level of the surrounding text so as to produce the
126 required effect.
128 Bidi iterator states
129 --------------------
131 The UBA is highly context dependent in some of its parts,
132 i.e. results of processing a character can generally depend on
133 characters very far away. The UAX#9 description of the UBA
134 prescribes a stateful processing of each character, whereby the
135 results of this processing depend on various state variables, such
136 as the current embedding level, level stack, and directional
137 override status. In addition, the UAX#9 description includes many
138 passages like this (from rule W2 in this case):
140 Search backward from each instance of a European number until the
141 first strong type (R, L, AL, or sos) is found. If an AL is found,
142 change the type of the European number to Arabic number.
144 To support this, we use a bidi iterator object, 'struct bidi_it',
145 which is a sub-structure of 'struct it' used by xdisp.c (see
146 dispextern.h for the definition of both of these structures). The
147 bidi iterator holds the entire state of the iteration required by
148 the UBA, and is updated as the text is traversed. In particular,
149 the embedding level of the current character being resolved is
150 recorded in the iterator state. To avoid costly searches backward
151 in support of rules like W2 above, the necessary character types
152 are also recorded in the iterator state as they are found during
153 the forward scan, and then used when such rules need to be applied.
154 (Forward scans cannot be avoided in this way; they need to be
155 performed at least once, and the results recorded in the iterator
156 state, to be reused until the forward scan oversteps the recorded
157 position.)
159 In this manner, the iterator state acts as a mini-cache of
160 contextual information required for resolving the level of the
161 current character by various UBA rules.
163 Caching of bidi iterator states
164 -------------------------------
166 As described above, the reordering engine uses the information
167 recorded in the bidi iterator state in order to resolve the
168 embedding level of the current character. When the reordering
169 engine needs to process the next character in the logical order, it
170 fetches it and applies to it all the UBA levels, updating the
171 iterator state as it goes. But when the buffer or string is
172 scanned backwards, i.e. in the reverse order of buffer/string
173 positions, the scanned characters were already processed during the
174 preceding forward scan (see bidi_find_other_level_edge). To avoid
175 costly re-processing of characters that were already processed
176 during the forward scan, the iterator states computed while
177 scanning forward are cached.
179 The cache is just a linear array of 'struct bidi_it' objects, which
180 is dynamically allocated and reallocated as needed, since the size
181 of the cache depends on the text being processed. We only need the
182 cache while processing embedded levels higher than the base
183 paragraph embedding level, because these higher levels require
184 changes in scan direction. Therefore, as soon as we are back to
185 the base embedding level, we can free the cache; see the calls to
186 bidi_cache_reset and bidi_cache_shrink, for the conditions to do
187 this.
189 The cache maintains the index of the next unused cache slot -- this
190 is where the next iterator state will be cached. The function
191 bidi_cache_iterator_state saves an instance of the state in the
192 cache and increments the unused slot index. The companion function
193 bidi_cache_find looks up a cached state that corresponds to a given
194 buffer/string position. All of the cached states must correspond
195 1:1 to the buffer or string region whose processing they reflect;
196 bidi.c will abort if it finds cache slots that violate this 1:1
197 correspondence.
199 When the parent iterator 'struct it' is pushed (see push_it in
200 xdisp.c) to pause the current iteration and start iterating over a
201 different object (e.g., a 'display' string that covers some buffer
202 text), the bidi iterator cache needs to be "pushed" as well, so
203 that a new empty cache could be used while iterating over the new
204 object. Later, when the new object is exhausted, and xdisp.c calls
205 pop_it, we need to "pop" the bidi cache as well and return to the
206 original cache. See bidi_push_it and bidi_pop_it for how this is
207 done.
209 Some functions of the display engine save copies of 'struct it' in
210 local variables, and restore them later. For examples, see
211 pos_visible_p and move_it_in_display_line_to in xdisp.c, and
212 window_scroll_pixel_based in window.c. When this happens, we need
213 to save and restore the bidi cache as well, because conceptually
214 the cache is part of the 'struct it' state, and needs to be in
215 perfect sync with the portion of the buffer/string that is being
216 processed. This saving and restoring of the cache state is handled
217 by bidi_shelve_cache and bidi_unshelve_cache, and the helper macros
218 SAVE_IT and RESTORE_IT defined on xdisp.c.
220 Note that, because reordering is implemented below the level in
221 xdisp.c that breaks glyphs into screen lines, we are violating
222 paragraph 3.4 of UAX#9. which mandates that line breaking shall be
223 done before reordering each screen line separately. However,
224 following UAX#9 to the letter in this matter goes against the basic
225 design of the Emacs display engine, and so we choose here this
226 minor deviation from the UBA letter in preference to redesign of
227 the display engine. The effect of this is only seen in continued
228 lines that are broken into screen lines in the middle of a run
229 whose direction is opposite to the paragraph's base direction.
231 Important design and implementation note: when the code needs to
232 scan far ahead, be sure to avoid such scans as much as possible
233 when the buffer/string doesn't contain any RTL characters. Users
234 of left-to-right scripts will never forgive you if you introduce
235 some slow-down due to bidi in situations that don't involve any
236 bidirectional text. See the large comment near the beginning of
237 bidi_resolve_neutral, for one situation where such shortcut was
238 necessary. */
240 #include <config.h>
241 #include <stdio.h>
243 #include "lisp.h"
244 #include "character.h"
245 #include "buffer.h"
246 #include "dispextern.h"
247 #include "region-cache.h"
249 static bool bidi_initialized = 0;
251 static Lisp_Object bidi_type_table, bidi_mirror_table, bidi_brackets_table;
253 #define BIDI_EOB (-1)
255 /* Data type for describing the bidirectional character categories. */
256 typedef enum {
257 UNKNOWN_BC,
258 NEUTRAL,
259 WEAK,
260 STRONG,
261 EXPLICIT_FORMATTING
262 } bidi_category_t;
264 static Lisp_Object paragraph_start_re, paragraph_separate_re;
265 static Lisp_Object Qparagraph_start, Qparagraph_separate;
268 /***********************************************************************
269 Utilities
270 ***********************************************************************/
272 /* Return the bidi type of a character CH, subject to the current
273 directional OVERRIDE. */
274 static bidi_type_t
275 bidi_get_type (int ch, bidi_dir_t override)
277 bidi_type_t default_type;
279 if (ch == BIDI_EOB)
280 return NEUTRAL_B;
281 if (ch < 0 || ch > MAX_CHAR)
282 emacs_abort ();
284 default_type = (bidi_type_t) XINT (CHAR_TABLE_REF (bidi_type_table, ch));
285 /* Every valid character code, even those that are unassigned by the
286 UCD, have some bidi-class property, according to
287 DerivedBidiClass.txt file. Therefore, if we ever get UNKNOWN_BT
288 (= zero) code from CHAR_TABLE_REF, that's a bug. */
289 if (default_type == UNKNOWN_BT)
290 emacs_abort ();
292 switch (default_type)
294 case WEAK_BN:
295 case NEUTRAL_B:
296 case LRE:
297 case LRO:
298 case RLE:
299 case RLO:
300 case PDF:
301 case LRI:
302 case RLI:
303 case FSI:
304 case PDI:
305 return default_type;
306 default:
307 if (override == L2R)
308 return STRONG_L;
309 else if (override == R2L)
310 return STRONG_R;
311 else
312 return default_type;
316 static void
317 bidi_check_type (bidi_type_t type)
319 eassert (UNKNOWN_BT <= type && type <= NEUTRAL_ON);
322 /* Given a bidi TYPE of a character, return its category. */
323 static bidi_category_t
324 bidi_get_category (bidi_type_t type)
326 switch (type)
328 case UNKNOWN_BT:
329 return UNKNOWN_BC;
330 case STRONG_L:
331 case STRONG_R:
332 case STRONG_AL:
333 return STRONG;
334 case WEAK_EN:
335 case WEAK_ES:
336 case WEAK_ET:
337 case WEAK_AN:
338 case WEAK_CS:
339 case WEAK_NSM:
340 case WEAK_BN:
341 return WEAK;
342 case NEUTRAL_B:
343 case NEUTRAL_S:
344 case NEUTRAL_WS:
345 case NEUTRAL_ON:
346 return NEUTRAL;
347 case LRE:
348 case LRO:
349 case RLE:
350 case RLO:
351 case PDF:
352 case LRI:
353 case RLI:
354 case FSI:
355 case PDI:
356 return EXPLICIT_FORMATTING;
357 default:
358 emacs_abort ();
362 static bool
363 bidi_isolate_fmt_char (bidi_type_t ch_type)
365 return (ch_type == LRI || ch_type == RLI || ch_type == PDI || ch_type == FSI);
368 /* Return the mirrored character of C, if it has one. If C has no
369 mirrored counterpart, return C.
370 Note: The conditions in UAX#9 clause L4 regarding the surrounding
371 context must be tested by the caller. */
373 bidi_mirror_char (int c)
375 Lisp_Object val;
377 if (c == BIDI_EOB)
378 return c;
379 if (c < 0 || c > MAX_CHAR)
380 emacs_abort ();
382 val = CHAR_TABLE_REF (bidi_mirror_table, c);
383 if (INTEGERP (val))
385 int v;
387 /* When debugging, check before assigning to V, so that the check
388 isn't broken by undefined behavior due to int overflow. */
389 eassert (CHAR_VALID_P (XINT (val)));
391 v = XINT (val);
393 /* Minimal test we must do in optimized builds, to prevent weird
394 crashes further down the road. */
395 if (v < 0 || v > MAX_CHAR)
396 emacs_abort ();
398 return v;
401 return c;
404 /* Return the Bidi_Paired_Bracket_Type property of the character C. */
405 static bidi_bracket_type_t
406 bidi_paired_bracket_type (int c)
408 if (c == BIDI_EOB)
409 return BIDI_BRACKET_NONE;
410 if (c < 0 || c > MAX_CHAR)
411 emacs_abort ();
413 return (bidi_bracket_type_t) XINT (CHAR_TABLE_REF (bidi_brackets_table, c));
416 /* Determine the start-of-sequence (sos) directional type given the two
417 embedding levels on either side of the run boundary. Also, update
418 the saved info about previously seen characters, since that info is
419 generally valid for a single level run. */
420 static void
421 bidi_set_sos_type (struct bidi_it *bidi_it, int level_before, int level_after)
423 int higher_level = (level_before > level_after ? level_before : level_after);
425 /* FIXME: should the default sos direction be user selectable? */
426 bidi_it->sos = ((higher_level & 1) != 0 ? R2L : L2R); /* X10 */
428 bidi_it->prev.type = UNKNOWN_BT;
429 bidi_it->last_strong.type = bidi_it->last_strong.orig_type = UNKNOWN_BT;
430 bidi_it->prev_for_neutral.type = (bidi_it->sos == R2L ? STRONG_R : STRONG_L);
431 bidi_it->prev_for_neutral.charpos = bidi_it->charpos;
432 bidi_it->next_for_neutral.type
433 = bidi_it->next_for_neutral.orig_type = UNKNOWN_BT;
436 #define ISOLATE_STATUS(BIDI_IT, IDX) ((BIDI_IT)->level_stack[IDX].flags & 1)
437 #define OVERRIDE(BIDI_IT, IDX) (((BIDI_IT)->level_stack[IDX].flags >> 1) & 3)
439 /* Push the current embedding level and override status; reset the
440 current level to LEVEL and the current override status to OVERRIDE. */
441 static void
442 bidi_push_embedding_level (struct bidi_it *bidi_it,
443 int level, bidi_dir_t override, bool isolate_status)
445 struct bidi_stack *st;
446 int prev_level = bidi_it->level_stack[bidi_it->stack_idx].level;
448 bidi_it->stack_idx++;
449 eassert (bidi_it->stack_idx < BIDI_MAXDEPTH+2+1);
450 st = &bidi_it->level_stack[bidi_it->stack_idx];
451 eassert (level <= (1 << 7));
452 st->level = level;
453 st->flags = (((override & 3) << 1) | (isolate_status != 0));
454 if (isolate_status)
456 st->last_strong_type = bidi_it->last_strong.type;
457 st->prev_for_neutral_type = bidi_it->prev_for_neutral.type;
458 st->next_for_neutral_type = bidi_it->next_for_neutral.type;
459 st->next_for_neutral_pos = bidi_it->next_for_neutral.charpos;
460 st->flags |= ((bidi_it->sos == L2R ? 0 : 1) << 3);
462 /* We've got a new isolating sequence, compute the directional type
463 of sos and initialize per-sequence variables (UAX#9, clause X10). */
464 bidi_set_sos_type (bidi_it, prev_level, level);
467 /* Pop from the stack the embedding level, the directional override
468 status, and optionally saved information for the isolating run
469 sequence. Return the new level. */
470 static int
471 bidi_pop_embedding_level (struct bidi_it *bidi_it)
473 int level;
475 /* UAX#9 says to ignore invalid PDFs (X7, last bullet)
476 and PDIs (X6a, 2nd bullet). */
477 if (bidi_it->stack_idx > 0)
479 bool isolate_status = ISOLATE_STATUS (bidi_it, bidi_it->stack_idx);
480 int old_level = bidi_it->level_stack[bidi_it->stack_idx].level;
482 struct bidi_stack st;
484 st = bidi_it->level_stack[bidi_it->stack_idx];
485 if (isolate_status)
487 bidi_dir_t sos = ((st.flags >> 3) & 1);
488 /* PREV is used in W1 for resolving WEAK_NSM. By the time
489 we get to an NSM, we must have gotten past at least one
490 character: the PDI that ends the isolate from which we
491 are popping here. So PREV will have been filled up by
492 the time we first use it. We initialize it here to
493 UNKNOWN_BT to be able to catch any blunders in this
494 logic. */
495 bidi_it->prev.orig_type = bidi_it->prev.type = UNKNOWN_BT;
496 bidi_it->last_strong.type = st.last_strong_type;
497 bidi_it->prev_for_neutral.type = st.prev_for_neutral_type;
498 bidi_it->next_for_neutral.type = st.next_for_neutral_type;
499 bidi_it->next_for_neutral.charpos = st.next_for_neutral_pos;
500 bidi_it->sos = (sos == 0 ? L2R : R2L);
502 else
503 bidi_set_sos_type (bidi_it, old_level,
504 bidi_it->level_stack[bidi_it->stack_idx - 1].level);
506 bidi_it->stack_idx--;
508 level = bidi_it->level_stack[bidi_it->stack_idx].level;
509 eassert (0 <= level && level <= BIDI_MAXDEPTH + 1);
510 return level;
513 /* Record in SAVED_INFO the information about the current character. */
514 static void
515 bidi_remember_char (struct bidi_saved_info *saved_info,
516 struct bidi_it *bidi_it, bool from_type)
518 saved_info->charpos = bidi_it->charpos;
519 if (from_type)
520 saved_info->type = bidi_it->type;
521 else
522 saved_info->type = bidi_it->type_after_wn;
523 bidi_check_type (saved_info->type);
524 saved_info->orig_type = bidi_it->orig_type;
525 bidi_check_type (saved_info->orig_type);
528 /* Copy the bidi iterator from FROM to TO. To save cycles, this only
529 copies the part of the level stack that is actually in use. */
530 static void
531 bidi_copy_it (struct bidi_it *to, struct bidi_it *from)
533 /* Copy everything from the start through the active part of
534 the level stack. */
535 memcpy (to, from,
536 (offsetof (struct bidi_it, level_stack[1])
537 + from->stack_idx * sizeof from->level_stack[0]));
541 /***********************************************************************
542 Caching the bidi iterator states
543 ***********************************************************************/
545 /* We allocate and de-allocate the cache in chunks of this size (in
546 characters). 200 was chosen as an upper limit for reasonably-long
547 lines in a text file/buffer. */
548 #define BIDI_CACHE_CHUNK 200
549 /* Maximum size we allow the cache to become, per iterator stack slot,
550 in units of struct bidi_it size. If we allow unlimited growth, we
551 could run out of memory for pathologically long bracketed text or
552 very long text lines that need to be reordered. This is aggravated
553 when word-wrap is in effect, since then functions display_line and
554 move_it_in_display_line_to need to keep up to 4 copies of the
555 cache.
557 This limitation means there can be no more than that amount of
558 contiguous RTL text on any single physical line in a LTR paragraph,
559 and similarly with contiguous LTR + numeric text in a RTL
560 paragraph. (LTR text in a LTR paragraph and RTL text in a RTL
561 paragraph are not reordered, and so don't need the cache, and
562 cannot hit this limit.) More importantly, no single line can have
563 text longer than this inside paired brackets (because bracket pairs
564 resolution uses the cache). If the limit is exceeded, the fallback
565 code will produce visual order that will be incorrect if there are
566 RTL characters in the offending line of text. */
567 /* Do we need to allow customization of this limit? */
568 #define BIDI_CACHE_MAX_ELTS_PER_SLOT 50000
569 #if BIDI_CACHE_CHUNK >= BIDI_CACHE_MAX_ELTS_PER_SLOT
570 # error BIDI_CACHE_CHUNK must be less than BIDI_CACHE_MAX_ELTS_PER_SLOT
571 #endif
572 static ptrdiff_t bidi_cache_max_elts = BIDI_CACHE_MAX_ELTS_PER_SLOT;
573 static struct bidi_it *bidi_cache;
574 static ptrdiff_t bidi_cache_size = 0;
575 enum { elsz = sizeof (struct bidi_it) };
576 static ptrdiff_t bidi_cache_idx; /* next unused cache slot */
577 static ptrdiff_t bidi_cache_last_idx; /* slot of last cache hit */
578 static ptrdiff_t bidi_cache_start = 0; /* start of cache for this
579 "stack" level */
581 /* 5-slot stack for saving the start of the previous level of the
582 cache. xdisp.c maintains a 5-slot stack for its iterator state,
583 and we need the same size of our stack. */
584 static ptrdiff_t bidi_cache_start_stack[IT_STACK_SIZE];
585 static int bidi_cache_sp;
587 /* Size of header used by bidi_shelve_cache. */
588 enum
590 bidi_shelve_header_size
591 = (sizeof (bidi_cache_idx) + sizeof (bidi_cache_start_stack)
592 + sizeof (bidi_cache_sp) + sizeof (bidi_cache_start)
593 + sizeof (bidi_cache_last_idx) + sizeof (bidi_cache_max_elts))
596 /* Effectively remove the cached states beyond the Nth state from the
597 part of the cache relevant to iteration of the current object
598 (buffer or string). */
599 static void
600 bidi_cache_reset_to (int n)
602 bidi_cache_idx = bidi_cache_start + n;
603 bidi_cache_last_idx = -1;
606 /* Reset the cache state to the empty state. We only reset the part
607 of the cache relevant to iteration of the current object. Previous
608 objects, which are pushed on the display iterator's stack, are left
609 intact. This is called when the cached information is no more
610 useful for the current iteration, e.g. when we were reseated to a
611 new position on the same object. */
612 static void
613 bidi_cache_reset (void)
615 bidi_cache_reset_to (0);
618 /* Shrink the cache to its minimal size. Called when we init the bidi
619 iterator for reordering a buffer or a string that does not come
620 from display properties, because that means all the previously
621 cached info is of no further use. */
622 static void
623 bidi_cache_shrink (void)
625 if (bidi_cache_size > BIDI_CACHE_CHUNK)
627 bidi_cache = xrealloc (bidi_cache, BIDI_CACHE_CHUNK * elsz);
628 bidi_cache_size = BIDI_CACHE_CHUNK;
630 bidi_cache_reset ();
631 bidi_cache_max_elts = BIDI_CACHE_MAX_ELTS_PER_SLOT;
634 static void
635 bidi_cache_fetch_state (ptrdiff_t idx, struct bidi_it *bidi_it)
637 int current_scan_dir = bidi_it->scan_dir;
639 if (idx < bidi_cache_start || idx >= bidi_cache_idx)
640 emacs_abort ();
642 bidi_copy_it (bidi_it, &bidi_cache[idx]);
643 bidi_it->scan_dir = current_scan_dir;
644 bidi_cache_last_idx = idx;
647 /* Find a cached state with a given CHARPOS and resolved embedding
648 level less or equal to LEVEL. If LEVEL is -1, disregard the
649 resolved levels in cached states. DIR, if non-zero, means search
650 in that direction from the last cache hit.
652 Value is the index of the cached state, or -1 if not found. */
653 static ptrdiff_t
654 bidi_cache_search (ptrdiff_t charpos, int level, int dir)
656 ptrdiff_t i, i_start;
658 if (bidi_cache_idx > bidi_cache_start)
660 if (bidi_cache_last_idx == -1)
661 bidi_cache_last_idx = bidi_cache_idx - 1;
662 if (charpos < bidi_cache[bidi_cache_last_idx].charpos)
664 dir = -1;
665 i_start = bidi_cache_last_idx - 1;
667 else if (charpos > (bidi_cache[bidi_cache_last_idx].charpos
668 + bidi_cache[bidi_cache_last_idx].nchars - 1))
670 dir = 1;
671 i_start = bidi_cache_last_idx + 1;
673 else if (dir)
674 i_start = bidi_cache_last_idx;
675 else
677 dir = -1;
678 i_start = bidi_cache_idx - 1;
681 if (dir < 0)
683 /* Linear search for now; FIXME! */
684 for (i = i_start; i >= bidi_cache_start; i--)
685 if (bidi_cache[i].charpos <= charpos
686 && charpos < bidi_cache[i].charpos + bidi_cache[i].nchars
687 && (level == -1 || bidi_cache[i].resolved_level <= level))
688 return i;
690 else
692 for (i = i_start; i < bidi_cache_idx; i++)
693 if (bidi_cache[i].charpos <= charpos
694 && charpos < bidi_cache[i].charpos + bidi_cache[i].nchars
695 && (level == -1 || bidi_cache[i].resolved_level <= level))
696 return i;
700 return -1;
703 /* Find a cached state where the resolved level changes to a value
704 that is lower than LEVEL, and return its cache slot index. DIR is
705 the direction to search, starting with the last used cache slot.
706 If DIR is zero, we search backwards from the last occupied cache
707 slot. BEFORE means return the index of the slot that
708 is ``before'' the level change in the search direction. That is,
709 given the cached levels like this:
711 1122333442211
712 AB C
714 and assuming we are at the position cached at the slot marked with
715 C, searching backwards (DIR = -1) for LEVEL = 2 will return the
716 index of slot B or A, depending whether BEFORE is, respectively,
717 true or false. */
718 static ptrdiff_t
719 bidi_cache_find_level_change (int level, int dir, bool before)
721 if (bidi_cache_idx)
723 ptrdiff_t i = dir ? bidi_cache_last_idx : bidi_cache_idx - 1;
724 int incr = before ? 1 : 0;
726 if (i < 0) /* cache overflowed? */
727 i = 0;
729 if (!dir)
730 dir = -1;
731 else if (!incr)
732 i += dir;
734 if (dir < 0)
736 while (i >= bidi_cache_start + incr)
738 if (bidi_cache[i - incr].resolved_level >= 0
739 && bidi_cache[i - incr].resolved_level < level)
740 return i;
741 i--;
744 else
746 while (i < bidi_cache_idx - incr)
748 if (bidi_cache[i + incr].resolved_level >= 0
749 && bidi_cache[i + incr].resolved_level < level)
750 return i;
751 i++;
756 return -1;
759 static void
760 bidi_cache_ensure_space (ptrdiff_t idx)
762 /* Enlarge the cache as needed. */
763 if (idx >= bidi_cache_size)
765 ptrdiff_t chunk_size = BIDI_CACHE_CHUNK;
767 if (bidi_cache_size > bidi_cache_max_elts - chunk_size)
768 chunk_size = bidi_cache_max_elts - bidi_cache_size;
770 if (max (idx + 1,
771 bidi_cache_size + chunk_size) <= bidi_cache_max_elts)
773 /* The bidi cache cannot be larger than the largest Lisp
774 string or buffer. */
775 ptrdiff_t string_or_buffer_bound
776 = max (BUF_BYTES_MAX, STRING_BYTES_BOUND);
778 /* Also, it cannot be larger than what C can represent. */
779 ptrdiff_t c_bound
780 = (min (PTRDIFF_MAX, SIZE_MAX) - bidi_shelve_header_size) / elsz;
781 ptrdiff_t max_elts = bidi_cache_max_elts;
783 max_elts = min (max_elts, min (string_or_buffer_bound, c_bound));
785 /* Force xpalloc not to over-allocate by passing it MAX_ELTS
786 as its 4th argument. */
787 bidi_cache = xpalloc (bidi_cache, &bidi_cache_size,
788 max (chunk_size, idx - bidi_cache_size + 1),
789 max_elts, elsz);
790 eassert (bidi_cache_size > idx);
795 static int
796 bidi_cache_iterator_state (struct bidi_it *bidi_it, bool resolved,
797 bool update_only)
799 ptrdiff_t idx;
801 /* We should never cache on backward scans. */
802 if (bidi_it->scan_dir == -1)
803 emacs_abort ();
804 idx = bidi_cache_search (bidi_it->charpos, -1, 1);
806 if (idx < 0 && update_only)
807 return 0;
809 if (idx < 0)
811 idx = bidi_cache_idx;
812 bidi_cache_ensure_space (idx);
813 /* Character positions should correspond to cache positions 1:1.
814 If we are outside the range of cached positions, the cache is
815 useless and must be reset. */
816 if (bidi_cache_start < idx && idx < bidi_cache_size
817 && (bidi_it->charpos > (bidi_cache[idx - 1].charpos
818 + bidi_cache[idx - 1].nchars)
819 || bidi_it->charpos < bidi_cache[bidi_cache_start].charpos))
821 bidi_cache_reset ();
822 idx = bidi_cache_start;
824 if (bidi_it->nchars <= 0)
825 emacs_abort ();
826 /* Don't cache if no available space in the cache. */
827 if (bidi_cache_size > idx)
829 bidi_copy_it (&bidi_cache[idx], bidi_it);
830 if (!resolved)
831 bidi_cache[idx].resolved_level = -1;
834 else
836 /* Copy only the members which could have changed, to avoid
837 costly copying of the entire struct. */
838 bidi_cache[idx].type = bidi_it->type;
839 bidi_check_type (bidi_it->type);
840 bidi_cache[idx].type_after_wn = bidi_it->type_after_wn;
841 bidi_check_type (bidi_it->type_after_wn);
842 if (resolved)
843 bidi_cache[idx].resolved_level = bidi_it->resolved_level;
844 else
845 bidi_cache[idx].resolved_level = -1;
846 bidi_cache[idx].invalid_levels = bidi_it->invalid_levels;
847 bidi_cache[idx].next_for_neutral = bidi_it->next_for_neutral;
848 bidi_cache[idx].next_for_ws = bidi_it->next_for_ws;
849 bidi_cache[idx].disp_pos = bidi_it->disp_pos;
850 bidi_cache[idx].disp_prop = bidi_it->disp_prop;
851 bidi_cache[idx].bracket_pairing_pos = bidi_it->bracket_pairing_pos;
852 bidi_cache[idx].bracket_enclosed_type = bidi_it->bracket_enclosed_type;
855 if (bidi_cache_size > idx)
857 bidi_cache_last_idx = idx;
858 if (idx >= bidi_cache_idx)
859 bidi_cache_idx = idx + 1;
860 return 1;
862 else
864 /* The cache overflowed. */
865 bidi_cache_last_idx = -1;
866 return 0;
870 /* Look for a cached iterator state that corresponds to CHARPOS. If
871 found, copy the cached state into BIDI_IT and return the type of
872 the cached entry. If not found, return UNKNOWN_BT. RESOLVED_ONLY
873 zero means it is OK to return cached states that were not fully
874 resolved yet. This can happen if the state was cached before it
875 was resolved in bidi_resolve_neutral. */
876 static bidi_type_t
877 bidi_cache_find (ptrdiff_t charpos, bool resolved_only, struct bidi_it *bidi_it)
879 ptrdiff_t i = bidi_cache_search (charpos, -1, bidi_it->scan_dir);
881 if (i >= bidi_cache_start
882 && (!resolved_only
883 /* Callers that want only fully resolved states (and set
884 resolved_only = true) need to be sure that there's enough
885 info in the cached state to return the state as final,
886 and if not, they don't want the cached state. */
887 || bidi_cache[i].resolved_level >= 0))
889 bidi_dir_t current_scan_dir = bidi_it->scan_dir;
891 bidi_copy_it (bidi_it, &bidi_cache[i]);
892 bidi_cache_last_idx = i;
893 /* Don't let scan direction from the cached state override
894 the current scan direction. */
895 bidi_it->scan_dir = current_scan_dir;
896 return bidi_it->type;
899 return UNKNOWN_BT;
902 static int
903 bidi_peek_at_next_level (struct bidi_it *bidi_it)
905 if (bidi_cache_idx == bidi_cache_start)
906 emacs_abort ();
907 /* If the cache overflowed, return the level of the last cached
908 character. */
909 if (bidi_cache_last_idx == -1
910 || (bidi_cache_last_idx >= bidi_cache_idx - 1 && bidi_it->scan_dir > 0))
911 return bidi_cache[bidi_cache_idx - 1].resolved_level;
912 return bidi_cache[bidi_cache_last_idx + bidi_it->scan_dir].resolved_level;
916 /***********************************************************************
917 Pushing and popping the bidi iterator state
918 ***********************************************************************/
920 /* Push the bidi iterator state in preparation for reordering a
921 different object, e.g. display string found at certain buffer
922 position. Pushing the bidi iterator boils down to saving its
923 entire state on the cache and starting a new cache "stacked" on top
924 of the current cache. */
925 void
926 bidi_push_it (struct bidi_it *bidi_it)
928 /* Give this stack slot its cache room. */
929 bidi_cache_max_elts += BIDI_CACHE_MAX_ELTS_PER_SLOT;
930 /* Save the current iterator state in its entirety after the last
931 used cache slot. */
932 bidi_cache_ensure_space (bidi_cache_idx);
933 bidi_cache[bidi_cache_idx++] = *bidi_it;
935 /* Push the current cache start onto the stack. */
936 eassert (bidi_cache_sp < IT_STACK_SIZE);
937 bidi_cache_start_stack[bidi_cache_sp++] = bidi_cache_start;
939 /* Start a new level of cache, and make it empty. */
940 bidi_cache_start = bidi_cache_idx;
941 bidi_cache_last_idx = -1;
944 /* Restore the iterator state saved by bidi_push_it and return the
945 cache to the corresponding state. */
946 void
947 bidi_pop_it (struct bidi_it *bidi_it)
949 if (bidi_cache_start <= 0)
950 emacs_abort ();
952 /* Reset the next free cache slot index to what it was before the
953 call to bidi_push_it. */
954 bidi_cache_idx = bidi_cache_start - 1;
956 /* Restore the bidi iterator state saved in the cache. */
957 *bidi_it = bidi_cache[bidi_cache_idx];
959 /* Pop the previous cache start from the stack. */
960 if (bidi_cache_sp <= 0)
961 emacs_abort ();
962 bidi_cache_start = bidi_cache_start_stack[--bidi_cache_sp];
964 /* Invalidate the last-used cache slot data. */
965 bidi_cache_last_idx = -1;
967 bidi_cache_max_elts -= BIDI_CACHE_MAX_ELTS_PER_SLOT;
968 eassert (bidi_cache_max_elts > 0);
971 static ptrdiff_t bidi_cache_total_alloc;
973 /* Stash away a copy of the cache and its control variables. */
974 void *
975 bidi_shelve_cache (void)
977 unsigned char *databuf;
978 ptrdiff_t alloc;
980 /* Empty cache. */
981 if (bidi_cache_idx == 0)
982 return NULL;
984 alloc = (bidi_shelve_header_size
985 + bidi_cache_idx * sizeof (struct bidi_it));
986 databuf = xmalloc (alloc);
987 bidi_cache_total_alloc += alloc;
989 memcpy (databuf, &bidi_cache_idx, sizeof (bidi_cache_idx));
990 memcpy (databuf + sizeof (bidi_cache_idx),
991 bidi_cache, bidi_cache_idx * sizeof (struct bidi_it));
992 memcpy (databuf + sizeof (bidi_cache_idx)
993 + bidi_cache_idx * sizeof (struct bidi_it),
994 bidi_cache_start_stack, sizeof (bidi_cache_start_stack));
995 memcpy (databuf + sizeof (bidi_cache_idx)
996 + bidi_cache_idx * sizeof (struct bidi_it)
997 + sizeof (bidi_cache_start_stack),
998 &bidi_cache_sp, sizeof (bidi_cache_sp));
999 memcpy (databuf + sizeof (bidi_cache_idx)
1000 + bidi_cache_idx * sizeof (struct bidi_it)
1001 + sizeof (bidi_cache_start_stack) + sizeof (bidi_cache_sp),
1002 &bidi_cache_start, sizeof (bidi_cache_start));
1003 memcpy (databuf + sizeof (bidi_cache_idx)
1004 + bidi_cache_idx * sizeof (struct bidi_it)
1005 + sizeof (bidi_cache_start_stack) + sizeof (bidi_cache_sp)
1006 + sizeof (bidi_cache_start),
1007 &bidi_cache_last_idx, sizeof (bidi_cache_last_idx));
1008 memcpy (databuf + sizeof (bidi_cache_idx)
1009 + bidi_cache_idx * sizeof (struct bidi_it)
1010 + sizeof (bidi_cache_start_stack) + sizeof (bidi_cache_sp)
1011 + sizeof (bidi_cache_start) + sizeof (bidi_cache_last_idx),
1012 &bidi_cache_max_elts, sizeof (bidi_cache_max_elts));
1014 return databuf;
1017 /* Restore the cache state from a copy stashed away by
1018 bidi_shelve_cache, and free the buffer used to stash that copy.
1019 JUST_FREE means free the buffer, but don't restore the
1020 cache; used when the corresponding iterator is discarded instead of
1021 being restored. */
1022 void
1023 bidi_unshelve_cache (void *databuf, bool just_free)
1025 unsigned char *p = databuf;
1027 if (!p)
1029 if (!just_free)
1031 /* A NULL pointer means an empty cache. */
1032 bidi_cache_start = 0;
1033 bidi_cache_sp = 0;
1034 bidi_cache_max_elts = BIDI_CACHE_MAX_ELTS_PER_SLOT;
1035 bidi_cache_reset ();
1038 else
1040 if (just_free)
1042 ptrdiff_t idx;
1044 memcpy (&idx, p, sizeof (bidi_cache_idx));
1045 bidi_cache_total_alloc
1046 -= bidi_shelve_header_size + idx * sizeof (struct bidi_it);
1048 else
1050 memcpy (&bidi_cache_idx, p, sizeof (bidi_cache_idx));
1051 bidi_cache_ensure_space (bidi_cache_idx);
1052 memcpy (bidi_cache, p + sizeof (bidi_cache_idx),
1053 bidi_cache_idx * sizeof (struct bidi_it));
1054 memcpy (bidi_cache_start_stack,
1055 p + sizeof (bidi_cache_idx)
1056 + bidi_cache_idx * sizeof (struct bidi_it),
1057 sizeof (bidi_cache_start_stack));
1058 memcpy (&bidi_cache_sp,
1059 p + sizeof (bidi_cache_idx)
1060 + bidi_cache_idx * sizeof (struct bidi_it)
1061 + sizeof (bidi_cache_start_stack),
1062 sizeof (bidi_cache_sp));
1063 memcpy (&bidi_cache_start,
1064 p + sizeof (bidi_cache_idx)
1065 + bidi_cache_idx * sizeof (struct bidi_it)
1066 + sizeof (bidi_cache_start_stack) + sizeof (bidi_cache_sp),
1067 sizeof (bidi_cache_start));
1068 memcpy (&bidi_cache_last_idx,
1069 p + sizeof (bidi_cache_idx)
1070 + bidi_cache_idx * sizeof (struct bidi_it)
1071 + sizeof (bidi_cache_start_stack) + sizeof (bidi_cache_sp)
1072 + sizeof (bidi_cache_start),
1073 sizeof (bidi_cache_last_idx));
1074 memcpy (&bidi_cache_max_elts,
1075 p + sizeof (bidi_cache_idx)
1076 + bidi_cache_idx * sizeof (struct bidi_it)
1077 + sizeof (bidi_cache_start_stack) + sizeof (bidi_cache_sp)
1078 + sizeof (bidi_cache_start) + sizeof (bidi_cache_last_idx),
1079 sizeof (bidi_cache_max_elts));
1080 bidi_cache_total_alloc
1081 -= (bidi_shelve_header_size
1082 + bidi_cache_idx * sizeof (struct bidi_it));
1085 xfree (p);
1090 /***********************************************************************
1091 Initialization
1092 ***********************************************************************/
1093 static void
1094 bidi_initialize (void)
1096 bidi_type_table = uniprop_table (intern ("bidi-class"));
1097 if (NILP (bidi_type_table))
1098 emacs_abort ();
1099 staticpro (&bidi_type_table);
1101 bidi_mirror_table = uniprop_table (intern ("mirroring"));
1102 if (NILP (bidi_mirror_table))
1103 emacs_abort ();
1104 staticpro (&bidi_mirror_table);
1106 bidi_brackets_table = uniprop_table (intern ("bracket-type"));
1107 if (NILP (bidi_brackets_table))
1108 emacs_abort ();
1109 staticpro (&bidi_brackets_table);
1111 DEFSYM (Qparagraph_start, "paragraph-start");
1112 paragraph_start_re = Fsymbol_value (Qparagraph_start);
1113 if (!STRINGP (paragraph_start_re))
1114 paragraph_start_re = build_string ("\f\\|[ \t]*$");
1115 staticpro (&paragraph_start_re);
1116 DEFSYM (Qparagraph_separate, "paragraph-separate");
1117 paragraph_separate_re = Fsymbol_value (Qparagraph_separate);
1118 if (!STRINGP (paragraph_separate_re))
1119 paragraph_separate_re = build_string ("[ \t\f]*$");
1120 staticpro (&paragraph_separate_re);
1122 bidi_cache_sp = 0;
1123 bidi_cache_total_alloc = 0;
1124 bidi_cache_max_elts = BIDI_CACHE_MAX_ELTS_PER_SLOT;
1126 bidi_initialized = 1;
1129 /* Do whatever UAX#9 clause X8 says should be done at paragraph's
1130 end. */
1131 static void
1132 bidi_set_paragraph_end (struct bidi_it *bidi_it)
1134 bidi_it->invalid_levels = 0;
1135 bidi_it->invalid_isolates = 0;
1136 bidi_it->stack_idx = 0;
1137 bidi_it->resolved_level = bidi_it->level_stack[0].level;
1140 /* Initialize the bidi iterator from buffer/string position CHARPOS. */
1141 void
1142 bidi_init_it (ptrdiff_t charpos, ptrdiff_t bytepos, bool frame_window_p,
1143 struct bidi_it *bidi_it)
1145 if (! bidi_initialized)
1146 bidi_initialize ();
1147 if (charpos >= 0)
1148 bidi_it->charpos = charpos;
1149 if (bytepos >= 0)
1150 bidi_it->bytepos = bytepos;
1151 bidi_it->frame_window_p = frame_window_p;
1152 bidi_it->nchars = -1; /* to be computed in bidi_resolve_explicit */
1153 bidi_it->first_elt = 1;
1154 bidi_set_paragraph_end (bidi_it);
1155 bidi_it->new_paragraph = 1;
1156 bidi_it->separator_limit = -1;
1157 bidi_it->type = NEUTRAL_B;
1158 bidi_it->type_after_wn = NEUTRAL_B;
1159 bidi_it->orig_type = NEUTRAL_B;
1160 /* FIXME: Review this!!! */
1161 bidi_it->prev.type = bidi_it->prev.orig_type = UNKNOWN_BT;
1162 bidi_it->last_strong.type = bidi_it->last_strong.orig_type = UNKNOWN_BT;
1163 bidi_it->next_for_neutral.charpos = -1;
1164 bidi_it->next_for_neutral.type
1165 = bidi_it->next_for_neutral.orig_type = UNKNOWN_BT;
1166 bidi_it->prev_for_neutral.charpos = -1;
1167 bidi_it->prev_for_neutral.type
1168 = bidi_it->prev_for_neutral.orig_type = UNKNOWN_BT;
1169 bidi_it->bracket_pairing_pos = -1;
1170 bidi_it->sos = L2R; /* FIXME: should it be user-selectable? */
1171 bidi_it->disp_pos = -1; /* invalid/unknown */
1172 bidi_it->disp_prop = 0;
1173 /* We can only shrink the cache if we are at the bottom level of its
1174 "stack". */
1175 if (bidi_cache_start == 0)
1176 bidi_cache_shrink ();
1177 else
1178 bidi_cache_reset ();
1181 /* Perform initializations for reordering a new line of bidi text. */
1182 static void
1183 bidi_line_init (struct bidi_it *bidi_it)
1185 bidi_it->scan_dir = 1; /* FIXME: do we need to have control on this? */
1186 bidi_it->stack_idx = 0;
1187 bidi_it->resolved_level = bidi_it->level_stack[0].level;
1188 bidi_it->level_stack[0].flags = 0; /* NEUTRAL_DIR, false per X1 */
1189 bidi_it->invalid_levels = 0;
1190 bidi_it->isolate_level = 0; /* X1 */
1191 bidi_it->invalid_isolates = 0; /* X1 */
1192 /* Setting this to zero will force its recomputation the first time
1193 we need it for W5. */
1194 bidi_it->next_en_pos = 0;
1195 bidi_it->next_en_type = UNKNOWN_BT;
1196 bidi_it->next_for_ws.charpos = -1;
1197 bidi_it->next_for_ws.type = UNKNOWN_BT;
1198 bidi_it->bracket_pairing_pos = -1;
1199 bidi_set_sos_type (bidi_it,
1200 (bidi_it->paragraph_dir == R2L ? 1 : 0),
1201 bidi_it->level_stack[0].level); /* X10 */
1203 bidi_cache_reset ();
1207 /***********************************************************************
1208 Fetching characters
1209 ***********************************************************************/
1211 /* Count bytes in string S between BEG/BEGBYTE and END. BEG and END
1212 are zero-based character positions in S, BEGBYTE is byte position
1213 corresponding to BEG. UNIBYTE means S is a unibyte string. */
1214 static ptrdiff_t
1215 bidi_count_bytes (const unsigned char *s, ptrdiff_t beg,
1216 ptrdiff_t begbyte, ptrdiff_t end, bool unibyte)
1218 ptrdiff_t pos = beg;
1219 const unsigned char *p = s + begbyte, *start = p;
1221 if (unibyte)
1222 p = s + end;
1223 else
1225 if (!CHAR_HEAD_P (*p))
1226 emacs_abort ();
1228 while (pos < end)
1230 p += BYTES_BY_CHAR_HEAD (*p);
1231 pos++;
1235 return p - start;
1238 /* Fetch and return the character at byte position BYTEPOS. If S is
1239 non-NULL, fetch the character from string S; otherwise fetch the
1240 character from the current buffer. UNIBYTE means S is a
1241 unibyte string. */
1242 static int
1243 bidi_char_at_pos (ptrdiff_t bytepos, const unsigned char *s, bool unibyte)
1245 if (s)
1247 s += bytepos;
1248 if (unibyte)
1249 return *s;
1251 else
1252 s = BYTE_POS_ADDR (bytepos);
1253 return STRING_CHAR (s);
1256 /* Fetch and return the character at CHARPOS/BYTEPOS. If that
1257 character is covered by a display string, treat the entire run of
1258 covered characters as a single character, either u+2029 or u+FFFC,
1259 and return their combined length in CH_LEN and NCHARS. DISP_POS
1260 specifies the character position of the next display string, or -1
1261 if not yet computed. When the next character is at or beyond that
1262 position, the function updates DISP_POS with the position of the
1263 next display string. *DISP_PROP non-zero means that there's really
1264 a display string at DISP_POS, as opposed to when we searched till
1265 DISP_POS without finding one. If *DISP_PROP is 2, it means the
1266 display spec is of the form `(space ...)', which is replaced with
1267 u+2029 to handle it as a paragraph separator. STRING->s is the C
1268 string to iterate, or NULL if iterating over a buffer or a Lisp
1269 string; in the latter case, STRING->lstring is the Lisp string. */
1270 static int
1271 bidi_fetch_char (ptrdiff_t charpos, ptrdiff_t bytepos, ptrdiff_t *disp_pos,
1272 int *disp_prop, struct bidi_string_data *string,
1273 struct window *w,
1274 bool frame_window_p, ptrdiff_t *ch_len, ptrdiff_t *nchars)
1276 int ch;
1277 ptrdiff_t endpos
1278 = (string->s || STRINGP (string->lstring)) ? string->schars : ZV;
1279 struct text_pos pos;
1280 int len;
1282 /* If we got past the last known position of display string, compute
1283 the position of the next one. That position could be at CHARPOS. */
1284 if (charpos < endpos && charpos > *disp_pos)
1286 SET_TEXT_POS (pos, charpos, bytepos);
1287 *disp_pos = compute_display_string_pos (&pos, string, w, frame_window_p,
1288 disp_prop);
1291 /* Fetch the character at BYTEPOS. */
1292 if (charpos >= endpos)
1294 ch = BIDI_EOB;
1295 *ch_len = 1;
1296 *nchars = 1;
1297 *disp_pos = endpos;
1298 *disp_prop = 0;
1300 else if (charpos >= *disp_pos && *disp_prop)
1302 ptrdiff_t disp_end_pos;
1304 /* We don't expect to find ourselves in the middle of a display
1305 property. Hopefully, it will never be needed. */
1306 if (charpos > *disp_pos)
1307 emacs_abort ();
1308 /* Text covered by `display' properties and overlays with
1309 display properties or display strings is handled as a single
1310 character that represents the entire run of characters
1311 covered by the display property. */
1312 if (*disp_prop == 2)
1314 /* `(space ...)' display specs are handled as paragraph
1315 separators for the purposes of the reordering; see UAX#9
1316 section 3 and clause HL1 in section 4.3 there. */
1317 ch = 0x2029;
1319 else
1321 /* All other display specs are handled as the Unicode Object
1322 Replacement Character. */
1323 ch = 0xFFFC;
1325 disp_end_pos = compute_display_string_end (*disp_pos, string);
1326 if (disp_end_pos < 0)
1328 /* Somebody removed the display string from the buffer
1329 behind our back. Recover by processing this buffer
1330 position as if no display property were present there to
1331 begin with. */
1332 *disp_prop = 0;
1333 goto normal_char;
1335 *nchars = disp_end_pos - *disp_pos;
1336 if (*nchars <= 0)
1337 emacs_abort ();
1338 if (string->s)
1339 *ch_len = bidi_count_bytes (string->s, *disp_pos, bytepos,
1340 disp_end_pos, string->unibyte);
1341 else if (STRINGP (string->lstring))
1342 *ch_len = bidi_count_bytes (SDATA (string->lstring), *disp_pos,
1343 bytepos, disp_end_pos, string->unibyte);
1344 else
1345 *ch_len = CHAR_TO_BYTE (disp_end_pos) - bytepos;
1347 else
1349 normal_char:
1350 if (string->s)
1353 if (!string->unibyte)
1355 ch = STRING_CHAR_AND_LENGTH (string->s + bytepos, len);
1356 *ch_len = len;
1358 else
1360 ch = UNIBYTE_TO_CHAR (string->s[bytepos]);
1361 *ch_len = 1;
1364 else if (STRINGP (string->lstring))
1366 if (!string->unibyte)
1368 ch = STRING_CHAR_AND_LENGTH (SDATA (string->lstring) + bytepos,
1369 len);
1370 *ch_len = len;
1372 else
1374 ch = UNIBYTE_TO_CHAR (SREF (string->lstring, bytepos));
1375 *ch_len = 1;
1378 else
1380 ch = STRING_CHAR_AND_LENGTH (BYTE_POS_ADDR (bytepos), len);
1381 *ch_len = len;
1383 *nchars = 1;
1386 /* If we just entered a run of characters covered by a display
1387 string, compute the position of the next display string. */
1388 if (charpos + *nchars <= endpos && charpos + *nchars > *disp_pos
1389 && *disp_prop)
1391 SET_TEXT_POS (pos, charpos + *nchars, bytepos + *ch_len);
1392 *disp_pos = compute_display_string_pos (&pos, string, w, frame_window_p,
1393 disp_prop);
1396 return ch;
1399 /* Like bidi_fetch_char, but ignore any text between an isolate
1400 initiator and its matching PDI or, if it has no matching PDI, the
1401 end of the paragraph. If isolates were skipped, CH_LEN and NCHARS
1402 are set to the number of bytes and characters between BYTEPOS/CHARPOS
1403 and the character that was fetched after skipping the isolates. */
1404 static int
1405 bidi_fetch_char_skip_isolates (ptrdiff_t charpos, ptrdiff_t bytepos,
1406 ptrdiff_t *disp_pos, int *disp_prop,
1407 struct bidi_string_data *string,
1408 struct window *w, bool frame_window_p,
1409 ptrdiff_t *ch_len, ptrdiff_t *nchars)
1411 ptrdiff_t orig_charpos = charpos, orig_bytepos = bytepos;
1412 int ch = bidi_fetch_char (charpos, bytepos, disp_pos, disp_prop, string, w,
1413 frame_window_p, ch_len, nchars);
1414 bidi_type_t ch_type = bidi_get_type (ch, NEUTRAL_DIR);
1415 ptrdiff_t level = 0;
1417 if (ch_type == LRI || ch_type == RLI || ch_type == FSI)
1419 level++;
1420 while (level > 0 && ch_type != NEUTRAL_B)
1422 charpos += *nchars;
1423 bytepos += *ch_len;
1424 ch = bidi_fetch_char (charpos, bytepos, disp_pos, disp_prop, string,
1425 w, frame_window_p, ch_len, nchars);
1426 ch_type = bidi_get_type (ch, NEUTRAL_DIR);
1427 /* A Note to P2 says to ignore max_depth limit. */
1428 if (ch_type == LRI || ch_type == RLI || ch_type == FSI)
1429 level++;
1430 else if (ch_type == PDI)
1431 level--;
1435 /* Communicate to the caller how much did we skip, so it could get
1436 past the last character position we examined. */
1437 *nchars += charpos - orig_charpos;
1438 *ch_len += bytepos - orig_bytepos;
1439 return ch;
1444 /***********************************************************************
1445 Determining paragraph direction
1446 ***********************************************************************/
1448 /* Check if buffer position CHARPOS/BYTEPOS is the end of a paragraph.
1449 Value is the non-negative length of the paragraph separator
1450 following the buffer position, -1 if position is at the beginning
1451 of a new paragraph, or -2 if position is neither at beginning nor
1452 at end of a paragraph. */
1453 static ptrdiff_t
1454 bidi_at_paragraph_end (ptrdiff_t charpos, ptrdiff_t bytepos)
1456 Lisp_Object sep_re;
1457 Lisp_Object start_re;
1458 ptrdiff_t val;
1460 sep_re = paragraph_separate_re;
1461 start_re = paragraph_start_re;
1463 val = fast_looking_at (sep_re, charpos, bytepos, ZV, ZV_BYTE, Qnil);
1464 if (val < 0)
1466 if (fast_looking_at (start_re, charpos, bytepos, ZV, ZV_BYTE, Qnil) >= 0)
1467 val = -1;
1468 else
1469 val = -2;
1472 return val;
1475 /* If the user has requested the long scans caching, make sure that
1476 BIDI cache is enabled. Otherwise, make sure it's disabled. */
1478 static struct region_cache *
1479 bidi_paragraph_cache_on_off (void)
1481 struct buffer *cache_buffer = current_buffer;
1482 bool indirect_p = false;
1484 /* For indirect buffers, make sure to use the cache of their base
1485 buffer. */
1486 if (cache_buffer->base_buffer)
1488 cache_buffer = cache_buffer->base_buffer;
1489 indirect_p = true;
1492 /* Don't turn on or off the cache in the base buffer, if the value
1493 of cache-long-scans of the base buffer is inconsistent with that.
1494 This is because doing so will just make the cache pure overhead,
1495 since if we turn it on via indirect buffer, it will be
1496 immediately turned off by its base buffer. */
1497 if (NILP (BVAR (current_buffer, cache_long_scans)))
1499 if (!indirect_p
1500 || NILP (BVAR (cache_buffer, cache_long_scans)))
1502 if (cache_buffer->bidi_paragraph_cache)
1504 free_region_cache (cache_buffer->bidi_paragraph_cache);
1505 cache_buffer->bidi_paragraph_cache = 0;
1508 return NULL;
1510 else
1512 if (!indirect_p
1513 || !NILP (BVAR (cache_buffer, cache_long_scans)))
1515 if (!cache_buffer->bidi_paragraph_cache)
1516 cache_buffer->bidi_paragraph_cache = new_region_cache ();
1518 return cache_buffer->bidi_paragraph_cache;
1522 /* On my 2005-vintage machine, searching back for paragraph start
1523 takes ~1 ms per line. And bidi_paragraph_init is called 4 times
1524 when user types C-p. The number below limits each call to
1525 bidi_paragraph_init to about 10 ms. */
1526 #define MAX_PARAGRAPH_SEARCH 7500
1528 /* Find the beginning of this paragraph by looking back in the buffer.
1529 Value is the byte position of the paragraph's beginning, or
1530 BEGV_BYTE if paragraph_start_re is still not found after looking
1531 back MAX_PARAGRAPH_SEARCH lines in the buffer. */
1532 static ptrdiff_t
1533 bidi_find_paragraph_start (ptrdiff_t pos, ptrdiff_t pos_byte)
1535 Lisp_Object re = paragraph_start_re;
1536 ptrdiff_t limit = ZV, limit_byte = ZV_BYTE;
1537 struct region_cache *bpc = bidi_paragraph_cache_on_off ();
1538 ptrdiff_t n = 0, oldpos = pos, next;
1539 struct buffer *cache_buffer = current_buffer;
1541 if (cache_buffer->base_buffer)
1542 cache_buffer = cache_buffer->base_buffer;
1544 while (pos_byte > BEGV_BYTE
1545 && n++ < MAX_PARAGRAPH_SEARCH
1546 && fast_looking_at (re, pos, pos_byte, limit, limit_byte, Qnil) < 0)
1548 /* FIXME: What if the paragraph beginning is covered by a
1549 display string? And what if a display string covering some
1550 of the text over which we scan back includes
1551 paragraph_start_re? */
1552 DEC_BOTH (pos, pos_byte);
1553 if (bpc && region_cache_backward (cache_buffer, bpc, pos, &next))
1555 pos = next, pos_byte = CHAR_TO_BYTE (pos);
1556 break;
1558 else
1559 pos = find_newline_no_quit (pos, pos_byte, -1, &pos_byte);
1561 if (n >= MAX_PARAGRAPH_SEARCH)
1562 pos = BEGV, pos_byte = BEGV_BYTE;
1563 if (bpc)
1564 know_region_cache (cache_buffer, bpc, pos, oldpos);
1565 /* Positions returned by the region cache are not limited to
1566 BEGV..ZV range, so we limit them here. */
1567 pos_byte = clip_to_bounds (BEGV_BYTE, pos_byte, ZV_BYTE);
1568 return pos_byte;
1571 /* On a 3.4 GHz machine, searching forward for a strong directional
1572 character in a long paragraph full of weaks or neutrals takes about
1573 1 ms for each 20K characters. The number below limits each call to
1574 bidi_paragraph_init to less than 10 ms even on slow machines. */
1575 #define MAX_STRONG_CHAR_SEARCH 100000
1577 /* Starting from POS, find the first strong (L, R, or AL) character,
1578 while skipping over any characters between an isolate initiator and
1579 its matching PDI. STOP_AT_PDI non-zero means stop at the PDI that
1580 matches the isolate initiator at POS. Return the bidi type of the
1581 character where the search stopped. Give up if after examining
1582 MAX_STRONG_CHAR_SEARCH buffer or string positions no strong
1583 character was found. */
1584 static bidi_type_t
1585 find_first_strong_char (ptrdiff_t pos, ptrdiff_t bytepos, ptrdiff_t end,
1586 ptrdiff_t *disp_pos, int *disp_prop,
1587 struct bidi_string_data *string, struct window *w,
1588 bool string_p, bool frame_window_p,
1589 ptrdiff_t *ch_len, ptrdiff_t *nchars, bool stop_at_pdi)
1591 ptrdiff_t pos1;
1592 bidi_type_t type;
1593 int ch;
1595 if (stop_at_pdi)
1597 /* If STOP_AT_PDI is non-zero, we must have been called with FSI
1598 at POS. Get past it. */
1599 #ifdef ENABLE_CHECKING
1600 ch = bidi_fetch_char (pos, bytepos, disp_pos, disp_prop, string, w,
1601 frame_window_p, ch_len, nchars);
1602 type = bidi_get_type (ch, NEUTRAL_DIR);
1603 eassert (type == FSI /* || type == LRI || type == RLI */);
1604 #endif
1605 pos += *nchars;
1606 bytepos += *ch_len;
1608 ch = bidi_fetch_char_skip_isolates (pos, bytepos, disp_pos, disp_prop, string,
1609 w, frame_window_p, ch_len, nchars);
1610 type = bidi_get_type (ch, NEUTRAL_DIR);
1612 pos1 = pos;
1613 for (pos += *nchars, bytepos += *ch_len;
1614 bidi_get_category (type) != STRONG
1615 /* If requested to stop at first PDI, stop there. */
1616 && !(stop_at_pdi && type == PDI)
1617 /* Stop when searched too far into an abnormally large
1618 paragraph full of weak or neutral characters. */
1619 && pos - pos1 < MAX_STRONG_CHAR_SEARCH;
1620 type = bidi_get_type (ch, NEUTRAL_DIR))
1622 if (pos >= end)
1624 /* Pretend there's a paragraph separator at end of
1625 buffer/string. */
1626 type = NEUTRAL_B;
1627 break;
1629 if (!string_p
1630 && type == NEUTRAL_B
1631 && bidi_at_paragraph_end (pos, bytepos) >= -1)
1632 break;
1633 /* Fetch next character and advance to get past it. */
1634 ch = bidi_fetch_char_skip_isolates (pos, bytepos, disp_pos, disp_prop,
1635 string, w, frame_window_p,
1636 ch_len, nchars);
1637 pos += *nchars;
1638 bytepos += *ch_len;
1640 return type;
1643 /* Determine the base direction, a.k.a. base embedding level, of the
1644 paragraph we are about to iterate through. If DIR is either L2R or
1645 R2L, just use that. Otherwise, determine the paragraph direction
1646 from the first strong directional character of the paragraph.
1648 NO_DEFAULT_P means don't default to L2R if the paragraph
1649 has no strong directional characters and both DIR and
1650 bidi_it->paragraph_dir are NEUTRAL_DIR. In that case, search back
1651 in the buffer until a paragraph is found with a strong character,
1652 or until hitting BEGV. In the latter case, fall back to L2R. This
1653 flag is used in current-bidi-paragraph-direction.
1655 Note that this function gives the paragraph separator the same
1656 direction as the preceding paragraph, even though Emacs generally
1657 views the separator as not belonging to any paragraph. */
1658 void
1659 bidi_paragraph_init (bidi_dir_t dir, struct bidi_it *bidi_it, bool no_default_p)
1661 ptrdiff_t bytepos = bidi_it->bytepos;
1662 bool string_p = bidi_it->string.s || STRINGP (bidi_it->string.lstring);
1663 ptrdiff_t pstartbyte;
1664 /* Note that begbyte is a byte position, while end is a character
1665 position. Yes, this is ugly, but we are trying to avoid costly
1666 calls to BYTE_TO_CHAR and its ilk. */
1667 ptrdiff_t begbyte = string_p ? 0 : BEGV_BYTE;
1668 ptrdiff_t end = string_p ? bidi_it->string.schars : ZV;
1670 /* Special case for an empty buffer. */
1671 if (bytepos == begbyte && bidi_it->charpos == end)
1672 dir = L2R;
1673 /* We should never be called at EOB or before BEGV. */
1674 else if (bidi_it->charpos >= end || bytepos < begbyte)
1675 emacs_abort ();
1677 if (dir == L2R)
1679 bidi_it->paragraph_dir = L2R;
1680 bidi_it->new_paragraph = 0;
1682 else if (dir == R2L)
1684 bidi_it->paragraph_dir = R2L;
1685 bidi_it->new_paragraph = 0;
1687 else if (dir == NEUTRAL_DIR) /* P2 */
1689 ptrdiff_t ch_len, nchars;
1690 ptrdiff_t pos, disp_pos = -1;
1691 int disp_prop = 0;
1692 bidi_type_t type;
1693 const unsigned char *s;
1695 if (!bidi_initialized)
1696 bidi_initialize ();
1698 /* If we are inside a paragraph separator, we are just waiting
1699 for the separator to be exhausted; use the previous paragraph
1700 direction. But don't do that if we have been just reseated,
1701 because we need to reinitialize below in that case. */
1702 if (!bidi_it->first_elt
1703 && bidi_it->charpos < bidi_it->separator_limit)
1704 return;
1706 /* If we are on a newline, get past it to where the next
1707 paragraph might start. But don't do that at BEGV since then
1708 we are potentially in a new paragraph that doesn't yet
1709 exist. */
1710 pos = bidi_it->charpos;
1711 s = (STRINGP (bidi_it->string.lstring)
1712 ? SDATA (bidi_it->string.lstring)
1713 : bidi_it->string.s);
1714 if (bytepos > begbyte
1715 && bidi_char_at_pos (bytepos, s, bidi_it->string.unibyte) == '\n')
1717 bytepos++;
1718 pos++;
1721 /* We are either at the beginning of a paragraph or in the
1722 middle of it. Find where this paragraph starts. */
1723 if (string_p)
1725 /* We don't support changes of paragraph direction inside a
1726 string. It is treated as a single paragraph. */
1727 pstartbyte = 0;
1729 else
1730 pstartbyte = bidi_find_paragraph_start (pos, bytepos);
1731 bidi_it->separator_limit = -1;
1732 bidi_it->new_paragraph = 0;
1734 /* The following loop is run more than once only if NO_DEFAULT_P,
1735 and only if we are iterating on a buffer. */
1736 do {
1737 bytepos = pstartbyte;
1738 if (!string_p)
1739 pos = BYTE_TO_CHAR (bytepos);
1740 type = find_first_strong_char (pos, bytepos, end, &disp_pos, &disp_prop,
1741 &bidi_it->string, bidi_it->w,
1742 string_p, bidi_it->frame_window_p,
1743 &ch_len, &nchars, false);
1744 if (type == STRONG_R || type == STRONG_AL) /* P3 */
1745 bidi_it->paragraph_dir = R2L;
1746 else if (type == STRONG_L)
1747 bidi_it->paragraph_dir = L2R;
1748 if (!string_p
1749 && no_default_p && bidi_it->paragraph_dir == NEUTRAL_DIR)
1751 /* If this paragraph is at BEGV, default to L2R. */
1752 if (pstartbyte == BEGV_BYTE)
1753 bidi_it->paragraph_dir = L2R; /* P3 and HL1 */
1754 else
1756 ptrdiff_t prevpbyte = pstartbyte;
1757 ptrdiff_t p = BYTE_TO_CHAR (pstartbyte), pbyte = pstartbyte;
1759 /* Find the beginning of the previous paragraph, if any. */
1760 while (pbyte > BEGV_BYTE && prevpbyte >= pstartbyte)
1762 /* FXIME: What if p is covered by a display
1763 string? See also a FIXME inside
1764 bidi_find_paragraph_start. */
1765 DEC_BOTH (p, pbyte);
1766 prevpbyte = bidi_find_paragraph_start (p, pbyte);
1768 pstartbyte = prevpbyte;
1771 } while (!string_p
1772 && no_default_p && bidi_it->paragraph_dir == NEUTRAL_DIR);
1774 else
1775 emacs_abort ();
1777 /* Contrary to UAX#9 clause P3, we only default the paragraph
1778 direction to L2R if we have no previous usable paragraph
1779 direction. This is allowed by the HL1 clause. */
1780 if (bidi_it->paragraph_dir != L2R && bidi_it->paragraph_dir != R2L)
1781 bidi_it->paragraph_dir = L2R; /* P3 and HL1 ``higher-level protocols'' */
1782 if (bidi_it->paragraph_dir == R2L)
1783 bidi_it->level_stack[0].level = 1;
1784 else
1785 bidi_it->level_stack[0].level = 0;
1787 bidi_line_init (bidi_it);
1791 /***********************************************************************
1792 Resolving explicit and implicit levels.
1793 The rest of this file constitutes the core of the UBA implementation.
1794 ***********************************************************************/
1796 static bool
1797 bidi_explicit_dir_char (int ch)
1799 bidi_type_t ch_type;
1801 if (!bidi_initialized)
1802 emacs_abort ();
1803 ch_type = (bidi_type_t) XINT (CHAR_TABLE_REF (bidi_type_table, ch));
1804 return (ch_type == LRE || ch_type == LRO
1805 || ch_type == RLE || ch_type == RLO
1806 || ch_type == PDF);
1809 /* Given an iterator state in BIDI_IT, advance one character position
1810 in the buffer/string to the next character (in the logical order),
1811 resolve any explicit embeddings, directional overrides, and isolate
1812 initiators and terminators, and return the embedding level of the
1813 character after resolving these explicit directives. */
1814 static int
1815 bidi_resolve_explicit (struct bidi_it *bidi_it)
1817 int curchar;
1818 bidi_type_t type, typ1, prev_type = UNKNOWN_BT;
1819 int current_level;
1820 int new_level;
1821 bidi_dir_t override;
1822 bool isolate_status;
1823 bool string_p = bidi_it->string.s || STRINGP (bidi_it->string.lstring);
1824 ptrdiff_t ch_len, nchars, disp_pos, end;
1825 int disp_prop;
1826 ptrdiff_t eob
1827 = ((bidi_it->string.s || STRINGP (bidi_it->string.lstring))
1828 ? bidi_it->string.schars : ZV);
1830 /* Record the info about the previous character. */
1831 if (bidi_it->type_after_wn != WEAK_BN /* W1/Retaining */
1832 && bidi_it->type != WEAK_BN)
1834 /* This special case is needed in support of Unicode 8.0
1835 correction to N0, as implemented in bidi_resolve_weak/W1
1836 below. */
1837 if (bidi_it->type_after_wn == NEUTRAL_ON
1838 && bidi_get_category (bidi_it->type) == STRONG
1839 && bidi_paired_bracket_type (bidi_it->ch) == BIDI_BRACKET_CLOSE)
1840 bidi_remember_char (&bidi_it->prev, bidi_it, 1);
1841 else
1842 bidi_remember_char (&bidi_it->prev, bidi_it, 0);
1844 if (bidi_it->type_after_wn == STRONG_R
1845 || bidi_it->type_after_wn == STRONG_L
1846 || bidi_it->type_after_wn == STRONG_AL)
1847 bidi_remember_char (&bidi_it->last_strong, bidi_it, 0);
1848 if (bidi_it->type == STRONG_R || bidi_it->type == STRONG_L
1849 || bidi_it->type == WEAK_EN || bidi_it->type == WEAK_AN)
1850 bidi_remember_char (&bidi_it->prev_for_neutral, bidi_it, 1);
1852 /* If we overstepped the characters used for resolving neutrals
1853 and whitespace, invalidate their info in the iterator. */
1854 if (bidi_it->charpos >= bidi_it->next_for_neutral.charpos)
1856 bidi_it->next_for_neutral.type = UNKNOWN_BT;
1857 /* If needed, reset the "magical" value of pairing bracket
1858 position, so that bidi_resolve_brackets will resume
1859 resolution of brackets according to BPA. */
1860 if (bidi_it->bracket_pairing_pos == eob)
1861 bidi_it->bracket_pairing_pos = -1;
1863 if (bidi_it->next_en_pos >= 0
1864 && bidi_it->charpos >= bidi_it->next_en_pos)
1866 bidi_it->next_en_pos = 0;
1867 bidi_it->next_en_type = UNKNOWN_BT;
1870 /* Reset the bracket resolution info, unless we previously decided
1871 (in bidi_find_bracket_pairs) that brackets in this level run
1872 should be resolved as neutrals. */
1873 if (bidi_it->bracket_pairing_pos != eob)
1875 bidi_it->bracket_pairing_pos = -1;
1876 bidi_it->bracket_enclosed_type = UNKNOWN_BT;
1879 /* If reseat()'ed, don't advance, so as to start iteration from the
1880 position where we were reseated. bidi_it->bytepos can be less
1881 than BEGV_BYTE after reseat to BEGV. */
1882 if (bidi_it->bytepos < (string_p ? 0 : BEGV_BYTE)
1883 || bidi_it->first_elt)
1885 bidi_it->first_elt = 0;
1886 if (string_p)
1888 const unsigned char *p
1889 = (STRINGP (bidi_it->string.lstring)
1890 ? SDATA (bidi_it->string.lstring)
1891 : bidi_it->string.s);
1893 if (bidi_it->charpos < 0)
1894 bidi_it->charpos = bidi_it->bytepos = 0;
1895 eassert (bidi_it->bytepos == bidi_count_bytes (p, 0, 0,
1896 bidi_it->charpos,
1897 bidi_it->string.unibyte));
1899 else
1901 if (bidi_it->charpos < BEGV)
1903 bidi_it->charpos = BEGV;
1904 bidi_it->bytepos = BEGV_BYTE;
1906 eassert (bidi_it->bytepos == CHAR_TO_BYTE (bidi_it->charpos));
1908 /* Determine the original bidi type of the previous character,
1909 which is needed for handling isolate initiators and PDF. The
1910 type of the previous character will be non-trivial only if
1911 our caller moved through some previous text in
1912 get_visually_first_element, in which case bidi_it->prev holds
1913 the information we want. */
1914 if (bidi_it->first_elt && bidi_it->prev.type != UNKNOWN_BT)
1916 eassert (bidi_it->prev.charpos == bidi_it->charpos - 1);
1917 prev_type = bidi_it->prev.orig_type;
1918 if (prev_type == FSI)
1919 prev_type = bidi_it->type_after_wn;
1922 /* Don't move at end of buffer/string. */
1923 else if (bidi_it->charpos < (string_p ? bidi_it->string.schars : ZV))
1925 /* Advance to the next character, skipping characters covered by
1926 display strings (nchars > 1). */
1927 if (bidi_it->nchars <= 0)
1928 emacs_abort ();
1929 bidi_it->charpos += bidi_it->nchars;
1930 if (bidi_it->ch_len == 0)
1931 emacs_abort ();
1932 bidi_it->bytepos += bidi_it->ch_len;
1933 prev_type = bidi_it->orig_type;
1934 if (prev_type == FSI)
1935 prev_type = bidi_it->type_after_wn;
1937 else /* EOB or end of string */
1938 prev_type = NEUTRAL_B;
1940 current_level = bidi_it->level_stack[bidi_it->stack_idx].level; /* X1 */
1941 isolate_status = ISOLATE_STATUS (bidi_it, bidi_it->stack_idx);
1942 override = OVERRIDE (bidi_it, bidi_it->stack_idx);
1943 new_level = current_level;
1945 if (bidi_it->charpos >= (string_p ? bidi_it->string.schars : ZV))
1947 curchar = BIDI_EOB;
1948 bidi_it->ch_len = 1;
1949 bidi_it->nchars = 1;
1950 bidi_it->disp_pos = (string_p ? bidi_it->string.schars : ZV);
1951 bidi_it->disp_prop = 0;
1953 else
1955 /* LRI, RLI, and FSI increment, and PDF decrements, the
1956 embedding level of the _following_ characters, so we must
1957 first look at the type of the previous character to support
1958 that. */
1959 switch (prev_type)
1961 case RLI: /* X5a */
1962 if (current_level < BIDI_MAXDEPTH
1963 && bidi_it->invalid_levels == 0
1964 && bidi_it->invalid_isolates == 0)
1966 new_level = ((current_level + 1) & ~1) + 1;
1967 bidi_it->isolate_level++;
1968 bidi_push_embedding_level (bidi_it, new_level,
1969 NEUTRAL_DIR, true);
1971 else
1972 bidi_it->invalid_isolates++;
1973 break;
1974 case LRI: /* X5b */
1975 if (current_level < BIDI_MAXDEPTH - 1
1976 && bidi_it->invalid_levels == 0
1977 && bidi_it->invalid_isolates == 0)
1979 new_level = ((current_level + 2) & ~1);
1980 bidi_it->isolate_level++;
1981 bidi_push_embedding_level (bidi_it, new_level,
1982 NEUTRAL_DIR, true);
1984 else
1985 bidi_it->invalid_isolates++;
1986 break;
1987 case PDF: /* X7 */
1988 if (!bidi_it->invalid_isolates)
1990 if (bidi_it->invalid_levels)
1991 bidi_it->invalid_levels--;
1992 else if (!isolate_status && bidi_it->stack_idx >= 1)
1993 new_level = bidi_pop_embedding_level (bidi_it);
1995 break;
1996 default:
1997 eassert (prev_type != FSI);
1998 /* Nothing. */
1999 break;
2001 /* Fetch the character at BYTEPOS. If it is covered by a
2002 display string, treat the entire run of covered characters as
2003 a single character u+FFFC. */
2004 curchar = bidi_fetch_char (bidi_it->charpos, bidi_it->bytepos,
2005 &bidi_it->disp_pos, &bidi_it->disp_prop,
2006 &bidi_it->string, bidi_it->w,
2007 bidi_it->frame_window_p,
2008 &bidi_it->ch_len, &bidi_it->nchars);
2010 bidi_it->ch = curchar;
2011 bidi_it->resolved_level = new_level;
2013 /* Don't apply directional override here, as all the types we handle
2014 below will not be affected by the override anyway, and we need
2015 the original type unaltered. The override will be applied in
2016 bidi_resolve_weak. */
2017 type = bidi_get_type (curchar, NEUTRAL_DIR);
2018 bidi_it->orig_type = type;
2019 bidi_check_type (bidi_it->orig_type);
2021 bidi_it->type_after_wn = UNKNOWN_BT;
2023 switch (type)
2025 case RLE: /* X2 */
2026 case RLO: /* X4 */
2027 bidi_it->type_after_wn = type;
2028 bidi_check_type (bidi_it->type_after_wn);
2029 type = WEAK_BN; /* X9/Retaining */
2030 if (new_level < BIDI_MAXDEPTH
2031 && bidi_it->invalid_levels == 0
2032 && bidi_it->invalid_isolates == 0)
2034 /* Compute the least odd embedding level greater than
2035 the current level. */
2036 new_level = ((new_level + 1) & ~1) + 1;
2037 if (bidi_it->type_after_wn == RLE)
2038 override = NEUTRAL_DIR;
2039 else
2040 override = R2L;
2041 bidi_push_embedding_level (bidi_it, new_level, override, false);
2042 bidi_it->resolved_level = new_level;
2044 else
2046 if (bidi_it->invalid_isolates == 0)
2047 bidi_it->invalid_levels++;
2049 break;
2050 case LRE: /* X3 */
2051 case LRO: /* X5 */
2052 bidi_it->type_after_wn = type;
2053 bidi_check_type (bidi_it->type_after_wn);
2054 type = WEAK_BN; /* X9/Retaining */
2055 if (new_level < BIDI_MAXDEPTH - 1
2056 && bidi_it->invalid_levels == 0
2057 && bidi_it->invalid_isolates == 0)
2059 /* Compute the least even embedding level greater than
2060 the current level. */
2061 new_level = ((new_level + 2) & ~1);
2062 if (bidi_it->type_after_wn == LRE)
2063 override = NEUTRAL_DIR;
2064 else
2065 override = L2R;
2066 bidi_push_embedding_level (bidi_it, new_level, override, false);
2067 bidi_it->resolved_level = new_level;
2069 else
2071 if (bidi_it->invalid_isolates == 0)
2072 bidi_it->invalid_levels++;
2074 break;
2075 case FSI: /* X5c */
2076 end = string_p ? bidi_it->string.schars : ZV;
2077 disp_pos = bidi_it->disp_pos;
2078 disp_prop = bidi_it->disp_prop;
2079 nchars = bidi_it->nchars;
2080 ch_len = bidi_it->ch_len;
2081 typ1 = find_first_strong_char (bidi_it->charpos,
2082 bidi_it->bytepos, end,
2083 &disp_pos, &disp_prop,
2084 &bidi_it->string, bidi_it->w,
2085 string_p, bidi_it->frame_window_p,
2086 &ch_len, &nchars, true);
2087 if (typ1 != STRONG_R && typ1 != STRONG_AL)
2089 type = LRI;
2090 goto fsi_as_lri;
2092 else
2093 type = RLI;
2094 /* FALLTHROUGH */
2095 case RLI: /* X5a */
2096 if (override == NEUTRAL_DIR)
2097 bidi_it->type_after_wn = type;
2098 else /* Unicode 8.0 correction. */
2099 bidi_it->type_after_wn = (override == L2R ? STRONG_L : STRONG_R);
2100 bidi_check_type (bidi_it->type_after_wn);
2101 break;
2102 case LRI: /* X5b */
2103 fsi_as_lri:
2104 if (override == NEUTRAL_DIR)
2105 bidi_it->type_after_wn = type;
2106 else /* Unicode 8.0 correction. */
2107 bidi_it->type_after_wn = (override == L2R ? STRONG_L : STRONG_R);
2108 bidi_check_type (bidi_it->type_after_wn);
2109 break;
2110 case PDI: /* X6a */
2111 if (bidi_it->invalid_isolates)
2112 bidi_it->invalid_isolates--;
2113 else if (bidi_it->isolate_level > 0)
2115 bidi_it->invalid_levels = 0;
2116 while (!ISOLATE_STATUS (bidi_it, bidi_it->stack_idx))
2117 bidi_pop_embedding_level (bidi_it);
2118 eassert (bidi_it->stack_idx > 0);
2119 new_level = bidi_pop_embedding_level (bidi_it);
2120 bidi_it->isolate_level--;
2122 bidi_it->resolved_level = new_level;
2123 /* Unicode 8.0 correction. */
2125 bidi_dir_t stack_override = OVERRIDE (bidi_it, bidi_it->stack_idx);
2126 if (stack_override == L2R)
2127 bidi_it->type_after_wn = STRONG_L;
2128 else if (stack_override == R2L)
2129 bidi_it->type_after_wn = STRONG_R;
2130 else
2131 bidi_it->type_after_wn = type;
2133 break;
2134 case PDF: /* X7 */
2135 bidi_it->type_after_wn = type;
2136 bidi_check_type (bidi_it->type_after_wn);
2137 type = WEAK_BN; /* X9/Retaining */
2138 break;
2139 default:
2140 /* Nothing. */
2141 break;
2144 bidi_it->type = type;
2145 bidi_check_type (bidi_it->type);
2147 if (bidi_it->type == NEUTRAL_B) /* X8 */
2149 bidi_set_paragraph_end (bidi_it);
2150 /* This is needed by bidi_resolve_weak below, and in L1. */
2151 bidi_it->type_after_wn = bidi_it->type;
2154 eassert (bidi_it->resolved_level >= 0);
2155 return bidi_it->resolved_level;
2158 /* Advance in the buffer/string, resolve weak types and return the
2159 type of the next character after weak type resolution. */
2160 static bidi_type_t
2161 bidi_resolve_weak (struct bidi_it *bidi_it)
2163 bidi_type_t type;
2164 bidi_dir_t override;
2165 int prev_level = bidi_it->level_stack[bidi_it->stack_idx].level;
2166 int new_level = bidi_resolve_explicit (bidi_it);
2167 int next_char;
2168 bidi_type_t type_of_next;
2169 struct bidi_it saved_it;
2170 ptrdiff_t eob
2171 = ((STRINGP (bidi_it->string.lstring) || bidi_it->string.s)
2172 ? bidi_it->string.schars : ZV);
2174 type = bidi_it->type;
2175 override = OVERRIDE (bidi_it, bidi_it->stack_idx);
2177 eassert (!(type == UNKNOWN_BT
2178 || type == LRE
2179 || type == LRO
2180 || type == RLE
2181 || type == RLO
2182 || type == PDF));
2184 eassert (prev_level >= 0);
2185 if (bidi_it->type == NEUTRAL_B)
2187 /* We've got a new isolating sequence, compute the directional
2188 type of sos and initialize per-run variables (UAX#9, clause
2189 X10). */
2190 bidi_set_sos_type (bidi_it, prev_level, new_level);
2192 if (type == NEUTRAL_S || type == NEUTRAL_WS
2193 || type == WEAK_BN || type == STRONG_AL)
2194 bidi_it->type_after_wn = type; /* needed in L1 */
2195 bidi_check_type (bidi_it->type_after_wn);
2197 /* Level and directional override status are already recorded in
2198 bidi_it, and do not need any change; see X6. */
2199 if (override == R2L) /* X6 */
2200 type = STRONG_R;
2201 else if (override == L2R)
2202 type = STRONG_L;
2203 else
2205 if (type == WEAK_NSM) /* W1 */
2207 /* Note that we don't need to consider the case where the
2208 prev character has its type overridden by an RLO or LRO,
2209 because then either the type of this NSM would have been
2210 also overridden, or the previous character is outside the
2211 current level run, and thus not relevant to this NSM.
2212 This is why NSM gets the type_after_wn of the previous
2213 character. */
2214 /* bidi_set_sos_type sets type_after_wn to UNKNOWN_BT. */
2215 if (bidi_it->prev.type != UNKNOWN_BT
2216 /* If type_after_wn is NEUTRAL_B, this NSM is at sos. */
2217 && bidi_it->prev.type != NEUTRAL_B)
2219 if (bidi_isolate_fmt_char (bidi_it->prev.type))
2221 /* From W1: "Note that in an isolating run sequence,
2222 an isolate initiator followed by an NSM or any
2223 type other than PDI must be an overflow isolate
2224 initiator." */
2225 eassert (bidi_it->invalid_isolates > 0);
2226 type = NEUTRAL_ON;
2228 else
2230 /* This includes the Unicode 8.0 correction for N0,
2231 due to how we set prev.type in bidi_resolve_explicit,
2232 which see. */
2233 type = bidi_it->prev.type;
2236 else if (bidi_it->sos == R2L)
2237 type = STRONG_R;
2238 else if (bidi_it->sos == L2R)
2239 type = STRONG_L;
2240 else /* shouldn't happen! */
2241 emacs_abort ();
2243 if (type == WEAK_EN /* W2 */
2244 && bidi_it->last_strong.type == STRONG_AL)
2245 type = WEAK_AN;
2246 else if (type == STRONG_AL) /* W3 */
2247 type = STRONG_R;
2248 else if ((type == WEAK_ES /* W4 */
2249 && bidi_it->prev.type == WEAK_EN
2250 && bidi_it->prev.orig_type == WEAK_EN)
2251 || (type == WEAK_CS
2252 && ((bidi_it->prev.type == WEAK_EN
2253 && bidi_it->prev.orig_type == WEAK_EN)
2254 || bidi_it->prev.type == WEAK_AN)))
2256 const unsigned char *s
2257 = (STRINGP (bidi_it->string.lstring)
2258 ? SDATA (bidi_it->string.lstring)
2259 : bidi_it->string.s);
2261 next_char = (bidi_it->charpos + bidi_it->nchars >= eob
2262 ? BIDI_EOB
2263 : bidi_char_at_pos (bidi_it->bytepos + bidi_it->ch_len,
2264 s, bidi_it->string.unibyte));
2265 type_of_next = bidi_get_type (next_char, override);
2267 if (type_of_next == WEAK_BN
2268 || bidi_explicit_dir_char (next_char))
2270 bidi_copy_it (&saved_it, bidi_it);
2271 while (bidi_resolve_explicit (bidi_it) == new_level
2272 && bidi_it->type == WEAK_BN)
2273 type_of_next = bidi_it->type;
2274 bidi_copy_it (bidi_it, &saved_it);
2277 /* If the next character is EN, but the last strong-type
2278 character is AL, that next EN will be changed to AN when
2279 we process it in W2 above. So in that case, this ES
2280 should not be changed into EN. */
2281 if (type == WEAK_ES
2282 && type_of_next == WEAK_EN
2283 && bidi_it->last_strong.type != STRONG_AL)
2284 type = WEAK_EN;
2285 else if (type == WEAK_CS)
2287 if (bidi_it->prev.type == WEAK_AN
2288 && (type_of_next == WEAK_AN
2289 /* If the next character is EN, but the last
2290 strong-type character is AL, EN will be later
2291 changed to AN when we process it in W2 above.
2292 So in that case, this ES should not be
2293 changed into EN. */
2294 || (type_of_next == WEAK_EN
2295 && bidi_it->last_strong.type == STRONG_AL)))
2296 type = WEAK_AN;
2297 else if (bidi_it->prev.type == WEAK_EN
2298 && type_of_next == WEAK_EN
2299 && bidi_it->last_strong.type != STRONG_AL)
2300 type = WEAK_EN;
2303 else if (type == WEAK_ET /* W5: ET with EN before or after it */
2304 || type == WEAK_BN) /* W5/Retaining */
2306 if (bidi_it->prev.type == WEAK_EN) /* ET/BN w/EN before it */
2307 type = WEAK_EN;
2308 else if (bidi_it->next_en_pos > bidi_it->charpos
2309 && bidi_it->next_en_type != WEAK_BN)
2311 if (bidi_it->next_en_type == WEAK_EN) /* ET/BN with EN after it */
2312 type = WEAK_EN;
2314 else if (bidi_it->next_en_pos >=0)
2316 /* We overstepped the last known position for ET
2317 resolution but there could be other such characters
2318 in this paragraph (when we are sure there are no more
2319 such positions, we set next_en_pos to a negative
2320 value). Try to find the next position for ET
2321 resolution. */
2322 ptrdiff_t en_pos = bidi_it->charpos + bidi_it->nchars;
2323 const unsigned char *s = (STRINGP (bidi_it->string.lstring)
2324 ? SDATA (bidi_it->string.lstring)
2325 : bidi_it->string.s);
2327 if (bidi_it->nchars <= 0)
2328 emacs_abort ();
2329 next_char
2330 = (bidi_it->charpos + bidi_it->nchars >= eob
2331 ? BIDI_EOB
2332 : bidi_char_at_pos (bidi_it->bytepos + bidi_it->ch_len, s,
2333 bidi_it->string.unibyte));
2334 type_of_next = bidi_get_type (next_char, override);
2336 if (type_of_next == WEAK_ET
2337 || type_of_next == WEAK_BN
2338 || bidi_explicit_dir_char (next_char))
2340 bidi_copy_it (&saved_it, bidi_it);
2341 while (bidi_resolve_explicit (bidi_it) == new_level
2342 && (bidi_it->type == WEAK_BN
2343 || bidi_it->type == WEAK_ET))
2344 type_of_next = bidi_it->type;
2345 if (type == WEAK_BN
2346 && bidi_it->charpos == saved_it.charpos + saved_it.nchars)
2348 /* If we entered the above loop with a BN that
2349 changes the level, the type of next
2350 character, which is in a different level, is
2351 not relevant to resolving this series of ET
2352 and BN. */
2353 en_pos = saved_it.charpos;
2354 type_of_next = type;
2356 else
2357 en_pos = bidi_it->charpos;
2358 bidi_copy_it (bidi_it, &saved_it);
2360 /* Remember this position, to speed up processing of the
2361 next ETs. */
2362 bidi_it->next_en_pos = en_pos;
2363 if (type_of_next == WEAK_EN)
2365 /* If the last strong character is AL, the EN we've
2366 found will become AN when we get to it (W2). */
2367 if (bidi_it->last_strong.type == STRONG_AL)
2368 type_of_next = WEAK_AN;
2369 else if (type == WEAK_BN)
2370 type = NEUTRAL_ON; /* W6/Retaining */
2371 else
2372 type = WEAK_EN;
2374 else if (type_of_next == NEUTRAL_B)
2375 /* Record the fact that there are no more ENs from
2376 here to the end of paragraph, to avoid entering the
2377 loop above ever again in this paragraph. */
2378 bidi_it->next_en_pos = -1;
2379 /* Record the type of the character where we ended our search. */
2380 bidi_it->next_en_type = type_of_next;
2385 if (type == WEAK_ES || type == WEAK_ET || type == WEAK_CS /* W6 */
2386 || (type == WEAK_BN
2387 && (bidi_it->prev.type == WEAK_CS /* W6/Retaining */
2388 || bidi_it->prev.type == WEAK_ES
2389 || bidi_it->prev.type == WEAK_ET)))
2390 type = NEUTRAL_ON;
2392 /* Store the type we've got so far, before we clobber it with strong
2393 types in W7 and while resolving neutral types. But leave alone
2394 the original types that were recorded above, because we will need
2395 them for the L1 clause. */
2396 if (bidi_it->type_after_wn == UNKNOWN_BT)
2397 bidi_it->type_after_wn = type;
2398 bidi_check_type (bidi_it->type_after_wn);
2400 if (type == WEAK_EN) /* W7 */
2402 if ((bidi_it->last_strong.type == STRONG_L)
2403 || (bidi_it->last_strong.type == UNKNOWN_BT && bidi_it->sos == L2R))
2404 type = STRONG_L;
2407 bidi_it->type = type;
2408 bidi_check_type (bidi_it->type);
2409 return type;
2412 /* Resolve the type of a neutral character according to the type of
2413 surrounding strong text and the current embedding level. */
2414 static bidi_type_t
2415 bidi_resolve_neutral_1 (bidi_type_t prev_type, bidi_type_t next_type, int lev)
2417 /* N1: "European and Arabic numbers act as if they were R in terms
2418 of their influence on NIs." */
2419 if (next_type == WEAK_EN || next_type == WEAK_AN)
2420 next_type = STRONG_R;
2421 if (prev_type == WEAK_EN || prev_type == WEAK_AN)
2422 prev_type = STRONG_R;
2424 if (next_type == prev_type) /* N1 */
2425 return next_type;
2426 else if ((lev & 1) == 0) /* N2 */
2427 return STRONG_L;
2428 else
2429 return STRONG_R;
2432 #define FLAG_EMBEDDING_INSIDE 1
2433 #define FLAG_OPPOSITE_INSIDE 2
2435 /* A data type used in the stack maintained by
2436 bidi_find_bracket_pairs below. */
2437 typedef struct bpa_stack_entry {
2438 int close_bracket_char;
2439 int open_bracket_idx;
2440 #ifdef ENABLE_CHECKING
2441 ptrdiff_t open_bracket_pos;
2442 #endif
2443 unsigned flags : 2;
2444 } bpa_stack_entry;
2446 /* With MAX_ALLOCA of 16KB, this should allow at least 1K slots in the
2447 BPA stack, which should be more than enough for actual bidi text. */
2448 #define MAX_BPA_STACK ((int)max (MAX_ALLOCA / sizeof (bpa_stack_entry), 1))
2450 /* UAX#9 says to match opening brackets with the matching closing
2451 brackets or their canonical equivalents. As of Unicode 7.0, there
2452 are only 2 bracket characters that have canonical equivalence
2453 decompositions: u+2329 and u+232A. So instead of accessing the
2454 table in uni-decomposition.el, we just handle these 2 characters
2455 with this simple macro. Note that ASCII characters don't have
2456 canonical equivalents by definition. */
2458 /* To find all the characters that need to be processed by
2459 CANONICAL_EQU, first find all the characters which have
2460 decompositions in UnicodeData.txt, with this Awk script:
2462 awk -F ";" " {if ($6 != \"\") print $1, $6}" UnicodeData.txt
2464 Then produce a list of all the bracket characters in BidiBrackets.txt:
2466 awk -F "[ ;]" " {if ($1 != \"#\" && $1 != \"\") print $1}" BidiBrackets.txt
2468 And finally, cross-reference these two:
2470 fgrep -w -f brackets.txt decompositions.txt
2472 where "decompositions.txt" was produced by the 1st script, and
2473 "brackets.txt" by the 2nd script. In the output of fgrep, look
2474 only for decompositions that don't begin with some compatibility
2475 formatting tag, such as "<compat>". Only decompositions that
2476 consist solely of character codepoints are relevant to bidi
2477 brackets processing. */
2479 #define CANONICAL_EQU(c) \
2480 ( ASCII_CHAR_P (c) ? c \
2481 : (c) == 0x2329 ? 0x3008 \
2482 : (c) == 0x232a ? 0x3009 \
2483 : c )
2485 #ifdef ENABLE_CHECKING
2486 # define STORE_BRACKET_CHARPOS \
2487 bpa_stack[bpa_sp].open_bracket_pos = bidi_it->charpos
2488 #else
2489 # define STORE_BRACKET_CHARPOS /* nothing */
2490 #endif
2492 #define PUSH_BPA_STACK \
2493 do { \
2494 int ch; \
2495 if (bpa_sp < MAX_BPA_STACK - 1) \
2497 bpa_sp++; \
2498 ch = CANONICAL_EQU (bidi_it->ch); \
2499 bpa_stack[bpa_sp].close_bracket_char = bidi_mirror_char (ch); \
2500 bpa_stack[bpa_sp].open_bracket_idx = bidi_cache_last_idx; \
2501 bpa_stack[bpa_sp].flags = 0; \
2502 STORE_BRACKET_CHARPOS; \
2504 } while (0)
2507 /* This function implements BPA, the Bidi Parenthesis Algorithm,
2508 described in BD16 and N0 of UAX#9. It finds all the bracket pairs
2509 in the current isolating sequence, and records the enclosed type
2510 and the position of the matching bracket in the cache. It returns
2511 non-zero if called with the iterator on the opening bracket which
2512 has a matching closing bracket in the current isolating sequence,
2513 zero otherwise. */
2514 static bool
2515 bidi_find_bracket_pairs (struct bidi_it *bidi_it)
2517 bidi_bracket_type_t btype;
2518 bidi_type_t type = bidi_it->type;
2519 bool retval = false;
2521 /* When scanning backwards, we don't expect any unresolved bidi
2522 bracket characters. */
2523 if (bidi_it->scan_dir != 1)
2524 emacs_abort ();
2526 btype = bidi_paired_bracket_type (bidi_it->ch);
2527 if (btype == BIDI_BRACKET_OPEN)
2529 bpa_stack_entry bpa_stack[MAX_BPA_STACK];
2530 int bpa_sp = -1;
2531 struct bidi_it saved_it;
2532 int base_level = bidi_it->level_stack[0].level;
2533 int embedding_level = bidi_it->level_stack[bidi_it->stack_idx].level;
2534 int maxlevel = embedding_level;
2535 bidi_type_t embedding_type = (embedding_level & 1) ? STRONG_R : STRONG_L;
2536 struct bidi_it tem_it;
2537 bool l2r_seen = false, r2l_seen = false;
2538 ptrdiff_t pairing_pos;
2539 int idx_at_entry = bidi_cache_idx;
2541 eassert (MAX_BPA_STACK >= 100);
2542 bidi_copy_it (&saved_it, bidi_it);
2543 /* bidi_cache_iterator_state refuses to cache on backward scans,
2544 and bidi_cache_fetch_state doesn't bring scan_dir from the
2545 cache, so we must initialize this explicitly. */
2546 tem_it.scan_dir = 1;
2548 while (1)
2550 int old_sidx, new_sidx;
2551 int current_level = bidi_it->level_stack[bidi_it->stack_idx].level;
2553 if (maxlevel < current_level)
2554 maxlevel = current_level;
2555 /* Mark every opening bracket character we've traversed by
2556 putting its own position into bracket_pairing_pos. This
2557 is examined in bidi_resolve_brackets to distinguish
2558 brackets that were already resolved to stay NEUTRAL_ON,
2559 and those that were not yet processed by this function
2560 (because they were skipped when we skip higher embedding
2561 levels below). */
2562 if (btype == BIDI_BRACKET_OPEN && bidi_it->bracket_pairing_pos == -1)
2563 bidi_it->bracket_pairing_pos = bidi_it->charpos;
2564 if (!bidi_cache_iterator_state (bidi_it, type == NEUTRAL_B, 0))
2566 /* No more space in cache -- give up and let the opening
2567 bracket that started this be processed as a
2568 NEUTRAL_ON. */
2569 bidi_cache_reset_to (idx_at_entry - bidi_cache_start);
2570 bidi_copy_it (bidi_it, &saved_it);
2571 goto give_up;
2573 if (btype == BIDI_BRACKET_OPEN)
2574 PUSH_BPA_STACK;
2575 else if (btype == BIDI_BRACKET_CLOSE)
2577 int sp = bpa_sp;
2578 int curchar = CANONICAL_EQU (bidi_it->ch);
2580 eassert (sp >= 0);
2581 while (sp >= 0 && bpa_stack[sp].close_bracket_char != curchar)
2582 sp--;
2583 if (sp >= 0)
2585 /* Update and cache the corresponding opening bracket. */
2586 bidi_cache_fetch_state (bpa_stack[sp].open_bracket_idx,
2587 &tem_it);
2588 #ifdef ENABLE_CHECKING
2589 eassert (bpa_stack[sp].open_bracket_pos == tem_it.charpos);
2590 #endif
2591 /* Determine the enclosed type for this bracket
2592 pair's type resolution according to N0. */
2593 if (bpa_stack[sp].flags & FLAG_EMBEDDING_INSIDE)
2594 tem_it.bracket_enclosed_type = embedding_type; /* N0b */
2595 else if (bpa_stack[sp].flags & FLAG_OPPOSITE_INSIDE)
2596 tem_it.bracket_enclosed_type /* N0c */
2597 = (embedding_type == STRONG_L ? STRONG_R : STRONG_L);
2598 else /* N0d */
2599 tem_it.bracket_enclosed_type = UNKNOWN_BT;
2601 /* Record the position of the matching closing
2602 bracket, and update the cache. */
2603 tem_it.bracket_pairing_pos = bidi_it->charpos;
2604 bidi_cache_iterator_state (&tem_it, 0, 1);
2606 /* Pop the BPA stack. */
2607 bpa_sp = sp - 1;
2609 if (bpa_sp < 0)
2611 retval = true;
2612 break;
2615 else if (bidi_get_category (bidi_it->type_after_wn) != NEUTRAL)
2617 unsigned flag = 0;
2618 int sp;
2620 /* Whenever we see a strong type, update the flags of
2621 all the slots on the stack. */
2622 switch (bidi_it->type)
2624 case STRONG_L:
2625 flag = ((embedding_level & 1) == 0
2626 ? FLAG_EMBEDDING_INSIDE
2627 : FLAG_OPPOSITE_INSIDE);
2628 l2r_seen = true;
2629 break;
2630 case STRONG_R:
2631 case WEAK_EN:
2632 case WEAK_AN:
2633 flag = ((embedding_level & 1) == 1
2634 ? FLAG_EMBEDDING_INSIDE
2635 : FLAG_OPPOSITE_INSIDE);
2636 r2l_seen = true;
2637 break;
2638 default:
2639 break;
2641 if (flag)
2643 for (sp = bpa_sp; sp >= 0; sp--)
2644 bpa_stack[sp].flags |= flag;
2647 old_sidx = bidi_it->stack_idx;
2648 type = bidi_resolve_weak (bidi_it);
2649 /* Skip level runs excluded from this isolating run sequence. */
2650 new_sidx = bidi_it->stack_idx;
2651 if (bidi_it->level_stack[new_sidx].level > current_level
2652 && (ISOLATE_STATUS (bidi_it, new_sidx)
2653 || (new_sidx > old_sidx + 1
2654 && ISOLATE_STATUS (bidi_it, new_sidx - 1))))
2656 while (bidi_it->level_stack[bidi_it->stack_idx].level
2657 > current_level)
2659 if (maxlevel < bidi_it->level_stack[bidi_it->stack_idx].level)
2660 maxlevel = bidi_it->level_stack[bidi_it->stack_idx].level;
2661 if (!bidi_cache_iterator_state (bidi_it,
2662 type == NEUTRAL_B, 0))
2664 /* No more space in cache -- give up and let the
2665 opening bracket that started this be
2666 processed as any other NEUTRAL_ON. */
2667 bidi_cache_reset_to (idx_at_entry - bidi_cache_start);
2668 bidi_copy_it (bidi_it, &saved_it);
2669 goto give_up;
2671 type = bidi_resolve_weak (bidi_it);
2674 if (type == NEUTRAL_B
2675 || (bidi_it->level_stack[bidi_it->stack_idx].level
2676 != current_level))
2678 /* We've marched all the way to the end of this
2679 isolating run sequence, and didn't find matching
2680 closing brackets for some opening brackets. Leave
2681 their type unchanged. */
2682 pairing_pos = bidi_it->charpos;
2683 break;
2685 if (bidi_it->type_after_wn == NEUTRAL_ON) /* Unicode 8.0 correction */
2686 btype = bidi_paired_bracket_type (bidi_it->ch);
2687 else
2688 btype = BIDI_BRACKET_NONE;
2691 /* Restore bidi_it from the cache, which should have the bracket
2692 resolution members set as determined by the above loop. */
2693 type = bidi_cache_find (saved_it.charpos, 0, bidi_it);
2694 eassert (type == NEUTRAL_ON);
2696 /* The following is an optimization for bracketed text that has
2697 only one level which is equal to the paragraph's base
2698 embedding level. That is, only L2R and weak/neutral
2699 characters in a L2R paragraph, or only R2L and weak/neutral
2700 characters in a R2L paragraph. Such brackets can be resolved
2701 by bidi_resolve_neutral, which has a further shortcut for
2702 this case. So we pretend we did not resolve the brackets in
2703 this case, set up next_for_neutral for the entire bracketed
2704 text, and reset the cache to the character before the opening
2705 bracket. The upshot is to allow bidi_move_to_visually_next
2706 reset the cache when it returns this opening bracket, thus
2707 cutting significantly on the size of the cache, which is
2708 important with long lines, especially if word-wrap is non-nil
2709 (which requires the display engine to copy the cache back and
2710 forth many times). */
2711 if (maxlevel == base_level
2712 && ((base_level == 0 && !r2l_seen)
2713 || (base_level == 1 && !l2r_seen)))
2715 ptrdiff_t eob
2716 = ((bidi_it->string.s || STRINGP (bidi_it->string.lstring))
2717 ? bidi_it->string.schars : ZV);
2719 if (retval)
2720 pairing_pos = bidi_it->bracket_pairing_pos;
2722 /* This special value (which cannot possibly happen when
2723 brackets are resolved, since there's no character at ZV)
2724 will be noticed by bidi_resolve_explicit, and will be
2725 copied to the following iterator states, instead of being
2726 reset to -1. */
2727 bidi_it->bracket_pairing_pos = eob;
2728 /* This type value will be used for resolving the outermost
2729 closing bracket in bidi_resolve_brackets. */
2730 bidi_it->bracket_enclosed_type = embedding_type;
2731 /* bidi_cache_last_idx is set to the index of the current
2732 state, because we just called bidi_cache_find above.
2733 That state describes the outermost opening bracket, the
2734 one with which we entered this function. Force the cache
2735 to "forget" all the cached states starting from that state. */
2736 bidi_cache_reset_to (bidi_cache_last_idx - bidi_cache_start);
2737 /* Set up the next_for_neutral member, to help
2738 bidi_resolve_neutral. */
2739 bidi_it->next_for_neutral.type = embedding_type;
2740 bidi_it->next_for_neutral.charpos = pairing_pos;
2741 /* Pretend we didn't resolve this bracket. */
2742 retval = false;
2746 give_up:
2747 return retval;
2750 static void
2751 bidi_record_type_for_neutral (struct bidi_saved_info *info, int level,
2752 bool nextp)
2754 int idx;
2756 for (idx = bidi_cache_last_idx + 1; idx < bidi_cache_idx; idx++)
2758 int lev = bidi_cache[idx].level_stack[bidi_cache[idx].stack_idx].level;
2760 if (lev <= level)
2762 eassert (lev == level);
2763 if (nextp)
2764 bidi_cache[idx].next_for_neutral = *info;
2765 else
2766 bidi_cache[idx].prev_for_neutral = *info;
2767 break;
2772 static bidi_type_t
2773 bidi_resolve_brackets (struct bidi_it *bidi_it)
2775 int prev_level = bidi_it->level_stack[bidi_it->stack_idx].level;
2776 bool resolve_bracket = false;
2777 bidi_type_t type = UNKNOWN_BT;
2778 int ch;
2779 struct bidi_saved_info prev_for_neutral, next_for_neutral;
2780 ptrdiff_t eob
2781 = ((bidi_it->string.s || STRINGP (bidi_it->string.lstring))
2782 ? bidi_it->string.schars : ZV);
2784 /* Record the prev_for_neutral type either from the previous
2785 character, if it was a strong or AN/EN, or from the
2786 prev_for_neutral information recorded previously. */
2787 if (bidi_it->type == STRONG_L || bidi_it->type == STRONG_R
2788 || bidi_it->type == WEAK_AN || bidi_it->type == WEAK_EN)
2789 bidi_remember_char (&prev_for_neutral, bidi_it, 1);
2790 else
2791 prev_for_neutral = bidi_it->prev_for_neutral;
2792 /* Record the next_for_neutral type information. */
2793 if (bidi_it->next_for_neutral.charpos > bidi_it->charpos)
2794 next_for_neutral = bidi_it->next_for_neutral;
2795 else
2796 next_for_neutral.charpos = -1;
2797 if (!bidi_it->first_elt)
2799 type = bidi_cache_find (bidi_it->charpos + bidi_it->nchars, 0, bidi_it);
2800 ch = bidi_it->ch;
2802 if (type == UNKNOWN_BT)
2804 type = bidi_resolve_weak (bidi_it);
2805 if (type == NEUTRAL_ON)
2807 /* bracket_pairing_pos == eob means this bracket does not
2808 need to be resolved as a bracket, but as a neutral, see
2809 the optimization trick we play near the end of
2810 bidi_find_bracket_pairs. */
2811 if (bidi_it->bracket_pairing_pos == eob)
2813 /* If this is the outermost closing bracket of a run of
2814 characters in which we decided to resolve brackets as
2815 neutrals, use the embedding level's type, recorded in
2816 bracket_enclosed_type, to resolve the bracket. */
2817 if (bidi_it->next_for_neutral.charpos == bidi_it->charpos
2818 && bidi_paired_bracket_type (bidi_it->ch) == BIDI_BRACKET_CLOSE)
2819 type = bidi_it->bracket_enclosed_type;
2821 else if (bidi_find_bracket_pairs (bidi_it))
2822 resolve_bracket = true;
2825 else if (bidi_it->bracket_pairing_pos != eob)
2827 eassert (bidi_it->resolved_level == -1);
2828 /* If the cached state shows an increase of embedding level due
2829 to an isolate initiator, we need to update the 1st cached
2830 state of the next run of the current isolating sequence with
2831 the prev_for_neutral and next_for_neutral information, so
2832 that it will be picked up when we advance to that next run. */
2833 if (bidi_it->level_stack[bidi_it->stack_idx].level > prev_level
2834 && ISOLATE_STATUS (bidi_it, bidi_it->stack_idx))
2836 bidi_record_type_for_neutral (&prev_for_neutral, prev_level, 0);
2837 bidi_record_type_for_neutral (&next_for_neutral, prev_level, 1);
2839 if (type == NEUTRAL_ON
2840 && bidi_paired_bracket_type (ch) == BIDI_BRACKET_OPEN)
2842 if (bidi_it->bracket_pairing_pos > bidi_it->charpos)
2844 /* A cached opening bracket that wasn't completely
2845 resolved yet. */
2846 resolve_bracket = true;
2848 else if (bidi_it->bracket_pairing_pos == -1)
2850 /* Higher levels were not BPA-resolved yet, even if
2851 cached by bidi_find_bracket_pairs. Force application
2852 of BPA to the new level now. */
2853 if (bidi_find_bracket_pairs (bidi_it))
2854 resolve_bracket = true;
2857 /* Keep track of the prev_for_neutral and next_for_neutral
2858 types, needed for resolving brackets below and for resolving
2859 neutrals in bidi_resolve_neutral. */
2860 if (bidi_it->level_stack[bidi_it->stack_idx].level == prev_level)
2862 bidi_it->prev_for_neutral = prev_for_neutral;
2863 if (next_for_neutral.charpos > 0)
2864 bidi_it->next_for_neutral = next_for_neutral;
2868 /* If needed, resolve the bracket type according to N0. */
2869 if (resolve_bracket)
2871 int embedding_level = bidi_it->level_stack[bidi_it->stack_idx].level;
2872 bidi_type_t embedding_type = (embedding_level & 1) ? STRONG_R : STRONG_L;
2874 eassert (bidi_it->prev_for_neutral.type != UNKNOWN_BT);
2875 eassert (bidi_it->bracket_pairing_pos > bidi_it->charpos);
2876 if (bidi_it->bracket_enclosed_type == embedding_type) /* N0b */
2877 type = embedding_type;
2878 else
2880 switch (bidi_it->prev_for_neutral.type)
2882 case STRONG_R:
2883 case WEAK_EN:
2884 case WEAK_AN:
2885 type =
2886 (bidi_it->bracket_enclosed_type == STRONG_R) /* N0c */
2887 ? STRONG_R /* N0c1 */
2888 : embedding_type; /* N0c2 */
2889 break;
2890 case STRONG_L:
2891 type =
2892 (bidi_it->bracket_enclosed_type == STRONG_L) /* N0c */
2893 ? STRONG_L /* N0c1 */
2894 : embedding_type; /* N0c2 */
2895 break;
2896 default:
2897 /* N0d: Do not set the type for that bracket pair. */
2898 break;
2901 eassert (type == STRONG_L || type == STRONG_R || type == NEUTRAL_ON);
2903 /* Update the type of the paired closing bracket to the same
2904 type as for the resolved opening bracket. */
2905 if (type != NEUTRAL_ON)
2907 ptrdiff_t idx = bidi_cache_search (bidi_it->bracket_pairing_pos,
2908 -1, 1);
2910 if (idx < bidi_cache_start)
2911 emacs_abort ();
2912 bidi_cache[idx].type = type;
2916 return type;
2919 static bidi_type_t
2920 bidi_resolve_neutral (struct bidi_it *bidi_it)
2922 bidi_type_t type = bidi_resolve_brackets (bidi_it);
2923 int current_level;
2924 bool is_neutral;
2926 eassert (type == STRONG_R
2927 || type == STRONG_L
2928 || type == WEAK_BN
2929 || type == WEAK_EN
2930 || type == WEAK_AN
2931 || type == NEUTRAL_B
2932 || type == NEUTRAL_S
2933 || type == NEUTRAL_WS
2934 || type == NEUTRAL_ON
2935 || type == LRI
2936 || type == RLI
2937 || type == PDI);
2939 current_level = bidi_it->level_stack[bidi_it->stack_idx].level;
2940 eassert (current_level >= 0);
2941 is_neutral = bidi_get_category (type) == NEUTRAL;
2943 if ((type != NEUTRAL_B /* Don't risk entering the long loop below if
2944 we are already at paragraph end. */
2945 && (is_neutral || bidi_isolate_fmt_char (type)))
2946 /* N1-N2/Retaining */
2947 || (type == WEAK_BN && bidi_explicit_dir_char (bidi_it->ch)))
2949 if (bidi_it->next_for_neutral.type != UNKNOWN_BT)
2951 /* Make sure the data for resolving neutrals we are
2952 about to use is valid. */
2953 eassert (bidi_it->next_for_neutral.charpos > bidi_it->charpos
2954 /* PDI defines an eos, so it's OK for it to
2955 serve as its own next_for_neutral. */
2956 || (bidi_it->next_for_neutral.charpos == bidi_it->charpos
2957 && bidi_it->type == PDI));
2958 type = bidi_resolve_neutral_1 (bidi_it->prev_for_neutral.type,
2959 bidi_it->next_for_neutral.type,
2960 current_level);
2962 /* The next two "else if" clauses are shortcuts for the
2963 important special case when we have a long sequence of
2964 neutral or WEAK_BN characters, such as whitespace or nulls or
2965 other control characters, on the base embedding level of the
2966 paragraph, and that sequence goes all the way to the end of
2967 the paragraph and follows a character whose resolved
2968 directionality is identical to the base embedding level.
2969 (This is what happens in a buffer with plain L2R text that
2970 happens to include long sequences of control characters.) By
2971 virtue of N1, the result of examining this long sequence will
2972 always be either STRONG_L or STRONG_R, depending on the base
2973 embedding level. So we use this fact directly instead of
2974 entering the expensive loop in the "else" clause. */
2975 else if (current_level == 0
2976 && bidi_it->prev_for_neutral.type == STRONG_L
2977 && !bidi_explicit_dir_char (bidi_it->ch)
2978 && !bidi_isolate_fmt_char (type))
2979 type = bidi_resolve_neutral_1 (bidi_it->prev_for_neutral.type,
2980 STRONG_L, current_level);
2981 else if (/* current level is 1 */
2982 current_level == 1
2983 /* base embedding level is also 1 */
2984 && bidi_it->level_stack[0].level == 1
2985 /* previous character is one of those considered R for
2986 the purposes of W5 */
2987 && (bidi_it->prev_for_neutral.type == STRONG_R
2988 || bidi_it->prev_for_neutral.type == WEAK_EN
2989 || bidi_it->prev_for_neutral.type == WEAK_AN)
2990 && !bidi_explicit_dir_char (bidi_it->ch)
2991 && !bidi_isolate_fmt_char (type))
2992 type = bidi_resolve_neutral_1 (bidi_it->prev_for_neutral.type,
2993 STRONG_R, current_level);
2994 else
2996 /* Arrrgh!! The UAX#9 algorithm is too deeply entrenched in
2997 the assumption of batch-style processing; see clauses W4,
2998 W5, and especially N1, which require to look far forward
2999 (as well as back) in the buffer/string. May the fleas of
3000 a thousand camels infest the armpits of those who design
3001 supposedly general-purpose algorithms by looking at their
3002 own implementations, and fail to consider other possible
3003 implementations! */
3004 struct bidi_it saved_it;
3005 bidi_type_t next_type;
3006 bool adjacent_to_neutrals = is_neutral;
3008 bidi_copy_it (&saved_it, bidi_it);
3009 /* Scan the text forward until we find the first non-neutral
3010 character, and then use that to resolve the neutral we
3011 are dealing with now. We also cache the scanned iterator
3012 states, to salvage some of the effort later. */
3013 do {
3014 int old_sidx, new_sidx;
3016 /* Paragraph separators have their levels fully resolved
3017 at this point, so cache them as resolved. */
3018 bidi_cache_iterator_state (bidi_it, type == NEUTRAL_B, 0);
3019 old_sidx = bidi_it->stack_idx;
3020 type = bidi_resolve_brackets (bidi_it);
3021 /* Skip level runs excluded from this isolating run sequence. */
3022 new_sidx = bidi_it->stack_idx;
3023 if (bidi_it->level_stack[new_sidx].level > current_level
3024 && (ISOLATE_STATUS (bidi_it, new_sidx)
3025 /* This is for when we have an isolate initiator
3026 immediately followed by an embedding or
3027 override initiator, in which case we get the
3028 level stack pushed twice by the single call to
3029 bidi_resolve_weak above. */
3030 || (new_sidx > old_sidx + 1
3031 && ISOLATE_STATUS (bidi_it, new_sidx - 1))))
3033 while (bidi_it->level_stack[bidi_it->stack_idx].level
3034 > current_level)
3036 bidi_cache_iterator_state (bidi_it, type == NEUTRAL_B, 0);
3037 type = bidi_resolve_brackets (bidi_it);
3040 if (!adjacent_to_neutrals
3041 && (bidi_get_category (type) == NEUTRAL
3042 || bidi_isolate_fmt_char (type)))
3043 adjacent_to_neutrals = true;
3044 } while (!(type == NEUTRAL_B
3045 || (type != WEAK_BN
3046 && bidi_get_category (type) != NEUTRAL
3047 && !bidi_isolate_fmt_char (type))
3048 /* This is all per level run, so stop when we
3049 reach the end of this level run. */
3050 || (bidi_it->level_stack[bidi_it->stack_idx].level
3051 != current_level)));
3053 /* Record the character we stopped at. */
3054 bidi_remember_char (&saved_it.next_for_neutral, bidi_it, 1);
3056 if ((bidi_it->level_stack[bidi_it->stack_idx].level != current_level)
3057 || type == NEUTRAL_B)
3059 /* Marched all the way to the end of this level run. We
3060 need to use the eos type, whose information is stored
3061 by bidi_set_sos_type in the prev_for_neutral
3062 member. */
3063 if (adjacent_to_neutrals)
3064 next_type = bidi_it->prev_for_neutral.type;
3065 else
3067 /* This is a BN which does not adjoin neutrals.
3068 Leave its type alone. */
3069 bidi_copy_it (bidi_it, &saved_it);
3070 return bidi_it->type;
3073 else
3075 switch (type)
3077 case STRONG_L:
3078 case STRONG_R:
3079 case STRONG_AL:
3080 /* Actually, STRONG_AL cannot happen here, because
3081 bidi_resolve_weak converts it to STRONG_R, per W3. */
3082 eassert (type != STRONG_AL);
3083 next_type = type;
3084 break;
3085 case WEAK_EN:
3086 case WEAK_AN:
3087 /* N1: "European and Arabic numbers act as if they
3088 were R in terms of their influence on NIs." */
3089 next_type = STRONG_R;
3090 break;
3091 default:
3092 emacs_abort ();
3093 break;
3096 /* Resolve the type of all the NIs found during the above loop. */
3097 type = bidi_resolve_neutral_1 (saved_it.prev_for_neutral.type,
3098 next_type, current_level);
3099 /* Update next_for_neutral with the resolved type, so we
3100 could use it for all the other NIs up to the place where
3101 we exited the loop. */
3102 saved_it.next_for_neutral.type = next_type;
3103 bidi_check_type (type);
3104 /* Update the character which caused us to enter the above loop. */
3105 saved_it.type = type;
3106 bidi_check_type (next_type);
3107 bidi_copy_it (bidi_it, &saved_it);
3110 return type;
3113 /* Given an iterator state in BIDI_IT, advance one character position
3114 in the buffer/string to the next character (in the logical order),
3115 resolve the bidi type of that next character, and return that
3116 type. */
3117 static bidi_type_t
3118 bidi_type_of_next_char (struct bidi_it *bidi_it)
3120 bidi_type_t type;
3122 /* This should always be called during a forward scan. */
3123 if (bidi_it->scan_dir != 1)
3124 emacs_abort ();
3126 type = bidi_resolve_neutral (bidi_it);
3128 return type;
3131 /* Given an iterator state BIDI_IT, advance one character position in
3132 the buffer/string to the next character (in the current scan
3133 direction), resolve the embedding and implicit levels of that next
3134 character, and return the resulting level. */
3135 static int
3136 bidi_level_of_next_char (struct bidi_it *bidi_it)
3138 bidi_type_t type = UNKNOWN_BT;
3139 int level;
3140 ptrdiff_t next_char_pos = -2;
3142 if (bidi_it->scan_dir == 1)
3144 ptrdiff_t eob
3145 = ((bidi_it->string.s || STRINGP (bidi_it->string.lstring))
3146 ? bidi_it->string.schars : ZV);
3148 /* There's no sense in trying to advance if we've already hit
3149 the end of text. */
3150 if (bidi_it->charpos >= eob)
3152 eassert (bidi_it->resolved_level >= 0);
3153 return bidi_it->resolved_level;
3157 /* Perhaps the character we want is already cached s fully resolved.
3158 If it is, the call to bidi_cache_find below will return a type
3159 other than UNKNOWN_BT. */
3160 if (bidi_cache_idx > bidi_cache_start && !bidi_it->first_elt)
3162 int bob = ((bidi_it->string.s || STRINGP (bidi_it->string.lstring))
3163 ? 0 : 1);
3165 if (bidi_it->scan_dir > 0)
3167 if (bidi_it->nchars <= 0)
3168 emacs_abort ();
3169 next_char_pos = bidi_it->charpos + bidi_it->nchars;
3171 else if (bidi_it->charpos >= bob)
3172 /* Implementation note: we allow next_char_pos to be as low as
3173 0 for buffers or -1 for strings, and that is okay because
3174 that's the "position" of the sentinel iterator state we
3175 cached at the beginning of the iteration. */
3176 next_char_pos = bidi_it->charpos - 1;
3177 if (next_char_pos >= bob - 1)
3178 type = bidi_cache_find (next_char_pos, 1, bidi_it);
3179 if (type != UNKNOWN_BT)
3181 /* We asked the cache for fully resolved states. */
3182 eassert (bidi_it->resolved_level >= 0);
3183 return bidi_it->resolved_level;
3187 if (bidi_it->scan_dir == -1)
3188 /* If we are going backwards, the iterator state is already cached
3189 from previous scans, and should be fully resolved. */
3190 emacs_abort ();
3192 if (type == UNKNOWN_BT)
3193 type = bidi_type_of_next_char (bidi_it);
3195 if (type == NEUTRAL_B)
3197 eassert (bidi_it->resolved_level >= 0);
3198 return bidi_it->resolved_level;
3201 level = bidi_it->level_stack[bidi_it->stack_idx].level;
3203 eassert ((type == STRONG_R
3204 || type == STRONG_L
3205 || type == WEAK_BN
3206 || type == WEAK_EN
3207 || type == WEAK_AN));
3208 bidi_it->type = type;
3209 bidi_check_type (bidi_it->type);
3211 /* For L1 below, we need to know, for each WS character, whether
3212 it belongs to a sequence of WS characters preceding a newline
3213 or a TAB or a paragraph separator. */
3214 if ((bidi_it->orig_type == NEUTRAL_WS
3215 || bidi_isolate_fmt_char (bidi_it->orig_type))
3216 && bidi_it->next_for_ws.charpos < bidi_it->charpos)
3218 int ch;
3219 ptrdiff_t clen = bidi_it->ch_len;
3220 ptrdiff_t bpos = bidi_it->bytepos;
3221 ptrdiff_t cpos = bidi_it->charpos;
3222 ptrdiff_t disp_pos = bidi_it->disp_pos;
3223 ptrdiff_t nc = bidi_it->nchars;
3224 struct bidi_string_data bs = bidi_it->string;
3225 bidi_type_t chtype;
3226 bool fwp = bidi_it->frame_window_p;
3227 int dpp = bidi_it->disp_prop;
3229 if (bidi_it->nchars <= 0)
3230 emacs_abort ();
3231 do {
3232 ch = bidi_fetch_char (cpos += nc, bpos += clen, &disp_pos, &dpp, &bs,
3233 bidi_it->w, fwp, &clen, &nc);
3234 chtype = bidi_get_type (ch, NEUTRAL_DIR);
3235 } while (chtype == NEUTRAL_WS || chtype == WEAK_BN
3236 || bidi_isolate_fmt_char (chtype)
3237 || bidi_explicit_dir_char (ch)); /* L1/Retaining */
3238 bidi_it->next_for_ws.type = chtype;
3239 bidi_check_type (bidi_it->next_for_ws.type);
3240 bidi_it->next_for_ws.charpos = cpos;
3243 /* Update the cache, but only if this state was already cached. */
3244 bidi_cache_iterator_state (bidi_it, 1, 1);
3246 /* Resolve implicit levels. */
3247 if (bidi_it->orig_type == NEUTRAL_B /* L1 */
3248 || bidi_it->orig_type == NEUTRAL_S
3249 || bidi_it->ch == '\n' || bidi_it->ch == BIDI_EOB
3250 || (bidi_it->orig_type == NEUTRAL_WS
3251 && (bidi_it->next_for_ws.type == NEUTRAL_B
3252 || bidi_it->next_for_ws.type == NEUTRAL_S)))
3253 level = bidi_it->level_stack[0].level;
3254 else if ((level & 1) == 0) /* I1 */
3256 if (type == STRONG_R)
3257 level++;
3258 else if (type == WEAK_EN || type == WEAK_AN)
3259 level += 2;
3261 else /* I2 */
3263 if (type == STRONG_L || type == WEAK_EN || type == WEAK_AN)
3264 level++;
3267 bidi_it->resolved_level = level;
3268 return level;
3271 /* Move to the other edge of a level given by LEVEL. If END_FLAG,
3272 we are at the end of a level, and we need to prepare to
3273 resume the scan of the lower level.
3275 If this level's other edge is cached, we simply jump to it, filling
3276 the iterator structure with the iterator state on the other edge.
3277 Otherwise, we walk the buffer or string until we come back to the
3278 same level as LEVEL.
3280 Note: we are not talking here about a ``level run'' in the UAX#9
3281 sense of the term, but rather about a ``level'' which includes
3282 all the levels higher than it. In other words, given the levels
3283 like this:
3285 11111112222222333333334443343222222111111112223322111
3286 A B C
3288 and assuming we are at point A scanning left to right, this
3289 function moves to point C, whereas the UAX#9 ``level 2 run'' ends
3290 at point B. */
3291 static void
3292 bidi_find_other_level_edge (struct bidi_it *bidi_it, int level, bool end_flag)
3294 int dir = end_flag ? -bidi_it->scan_dir : bidi_it->scan_dir;
3295 ptrdiff_t idx;
3297 /* Try the cache first. */
3298 if ((idx = bidi_cache_find_level_change (level, dir, end_flag))
3299 >= bidi_cache_start)
3300 bidi_cache_fetch_state (idx, bidi_it);
3301 else
3303 int new_level;
3305 /* If we are at end of level, its edges must be cached. */
3306 if (end_flag)
3307 emacs_abort ();
3309 if (!bidi_cache_iterator_state (bidi_it, 1, 0))
3311 /* Can't happen: if the cache needs to grow, it means we
3312 were at base embedding level, so the cache should have
3313 been either empty or already large enough to cover this
3314 character position. */
3315 emacs_abort ();
3317 do {
3318 new_level = bidi_level_of_next_char (bidi_it);
3319 /* If the cache is full, perform an emergency return by
3320 pretending that the level ended. */
3321 if (!bidi_cache_iterator_state (bidi_it, 1, 0))
3323 new_level = level - 1;
3324 /* Since the cache should only grow when we are scanning
3325 forward looking for the edge of the level that is one
3326 above the base embedding level, we can only have this
3327 contingency when LEVEL - 1 is the base embedding
3328 level. */
3329 eassert (new_level == bidi_it->level_stack[0].level);
3330 /* Plan B, for when the cache overflows: Back up to the
3331 previous character by fetching the last cached state,
3332 and force the resolved level of that character be the
3333 base embedding level. */
3334 bidi_cache_fetch_state (bidi_cache_idx - 1, bidi_it);
3335 bidi_it->resolved_level = new_level;
3336 bidi_cache_iterator_state (bidi_it, 1, 1);
3338 } while (new_level >= level);
3342 void
3343 bidi_move_to_visually_next (struct bidi_it *bidi_it)
3345 int old_level, new_level, next_level;
3346 struct bidi_it sentinel;
3347 struct gcpro gcpro1;
3349 if (bidi_it->charpos < 0 || bidi_it->bytepos < 0)
3350 emacs_abort ();
3352 if (bidi_it->scan_dir == 0)
3354 bidi_it->scan_dir = 1; /* default to logical order */
3357 /* The code below can call eval, and thus cause GC. If we are
3358 iterating a Lisp string, make sure it won't be GCed. */
3359 if (STRINGP (bidi_it->string.lstring))
3360 GCPRO1 (bidi_it->string.lstring);
3362 /* If we just passed a newline, initialize for the next line. */
3363 if (!bidi_it->first_elt
3364 && (bidi_it->ch == '\n' || bidi_it->ch == BIDI_EOB))
3365 bidi_line_init (bidi_it);
3367 /* Prepare the sentinel iterator state, and cache it. When we bump
3368 into it, scanning backwards, we'll know that the last non-base
3369 level is exhausted. */
3370 if (bidi_cache_idx == bidi_cache_start)
3372 bidi_copy_it (&sentinel, bidi_it);
3373 if (bidi_it->first_elt)
3375 sentinel.charpos--; /* cached charpos needs to be monotonic */
3376 sentinel.bytepos--;
3377 sentinel.ch = '\n'; /* doesn't matter, but why not? */
3378 sentinel.ch_len = 1;
3379 sentinel.nchars = 1;
3381 bidi_cache_iterator_state (&sentinel, 1, 0);
3384 old_level = bidi_it->resolved_level;
3385 new_level = bidi_level_of_next_char (bidi_it);
3387 /* Reordering of resolved levels (clause L2) is implemented by
3388 jumping to the other edge of the level and flipping direction of
3389 scanning the text whenever we find a level change. */
3390 if (new_level != old_level)
3392 bool ascending = new_level > old_level;
3393 int level_to_search = ascending ? old_level + 1 : old_level;
3394 int incr = ascending ? 1 : -1;
3395 int expected_next_level = old_level + incr;
3397 /* Jump (or walk) to the other edge of this level. */
3398 bidi_find_other_level_edge (bidi_it, level_to_search, !ascending);
3399 /* Switch scan direction and peek at the next character in the
3400 new direction. */
3401 bidi_it->scan_dir = -bidi_it->scan_dir;
3403 /* The following loop handles the case where the resolved level
3404 jumps by more than one. This is typical for numbers inside a
3405 run of text with left-to-right embedding direction, but can
3406 also happen in other situations. In those cases the decision
3407 where to continue after a level change, and in what direction,
3408 is tricky. For example, given a text like below:
3410 abcdefgh
3411 11336622
3413 (where the numbers below the text show the resolved levels),
3414 the result of reordering according to UAX#9 should be this:
3416 efdcghba
3418 This is implemented by the loop below which flips direction
3419 and jumps to the other edge of the level each time it finds
3420 the new level not to be the expected one. The expected level
3421 is always one more or one less than the previous one. */
3422 next_level = bidi_peek_at_next_level (bidi_it);
3423 while (next_level != expected_next_level)
3425 /* If next_level is -1, it means we have an unresolved level
3426 in the cache, which at this point should not happen. If
3427 it does, we will infloop. */
3428 eassert (next_level >= 0);
3429 /* If next_level is not consistent with incr, we might
3430 infloop. */
3431 eassert (incr > 0
3432 ? next_level > expected_next_level
3433 : next_level < expected_next_level);
3434 expected_next_level += incr;
3435 level_to_search += incr;
3436 bidi_find_other_level_edge (bidi_it, level_to_search, !ascending);
3437 bidi_it->scan_dir = -bidi_it->scan_dir;
3438 next_level = bidi_peek_at_next_level (bidi_it);
3441 /* Finally, deliver the next character in the new direction. */
3442 next_level = bidi_level_of_next_char (bidi_it);
3445 /* Take note when we have just processed the newline that precedes
3446 the end of the paragraph. The next time we are about to be
3447 called, set_iterator_to_next will automatically reinit the
3448 paragraph direction, if needed. We do this at the newline before
3449 the paragraph separator, because the next character might not be
3450 the first character of the next paragraph, due to the bidi
3451 reordering, whereas we _must_ know the paragraph base direction
3452 _before_ we process the paragraph's text, since the base
3453 direction affects the reordering. */
3454 if (bidi_it->scan_dir == 1
3455 && (bidi_it->ch == '\n' || bidi_it->ch == BIDI_EOB))
3457 /* The paragraph direction of the entire string, once
3458 determined, is in effect for the entire string. Setting the
3459 separator limit to the end of the string prevents
3460 bidi_paragraph_init from being called automatically on this
3461 string. */
3462 if (bidi_it->string.s || STRINGP (bidi_it->string.lstring))
3463 bidi_it->separator_limit = bidi_it->string.schars;
3464 else if (bidi_it->bytepos < ZV_BYTE)
3466 ptrdiff_t sep_len
3467 = bidi_at_paragraph_end (bidi_it->charpos + bidi_it->nchars,
3468 bidi_it->bytepos + bidi_it->ch_len);
3469 if (bidi_it->nchars <= 0)
3470 emacs_abort ();
3471 if (sep_len >= 0)
3473 bidi_it->new_paragraph = 1;
3474 /* Record the buffer position of the last character of the
3475 paragraph separator. */
3476 bidi_it->separator_limit
3477 = bidi_it->charpos + bidi_it->nchars + sep_len;
3482 if (bidi_it->scan_dir == 1 && bidi_cache_idx > bidi_cache_start)
3484 /* If we are at paragraph's base embedding level and beyond the
3485 last cached position, the cache's job is done and we can
3486 discard it. */
3487 if (bidi_it->resolved_level == bidi_it->level_stack[0].level
3488 && bidi_it->charpos > (bidi_cache[bidi_cache_idx - 1].charpos
3489 + bidi_cache[bidi_cache_idx - 1].nchars - 1))
3490 bidi_cache_reset ();
3491 /* Also reset the cache if it overflowed and we have just
3492 emergency-exited using Plan B. */
3493 else if (bidi_it->resolved_level == bidi_it->level_stack[0].level
3494 && bidi_cache_idx >= bidi_cache_size
3495 && bidi_it->charpos == bidi_cache[bidi_cache_idx - 1].charpos)
3496 bidi_cache_reset ();
3497 /* But as long as we are caching during forward scan, we must
3498 cache each state, or else the cache integrity will be
3499 compromised: it assumes cached states correspond to buffer
3500 positions 1:1. */
3501 else
3502 bidi_cache_iterator_state (bidi_it, 1, 0);
3505 eassert (bidi_it->resolved_level >= 0
3506 && bidi_it->resolved_level <= BIDI_MAXDEPTH + 2);
3508 if (STRINGP (bidi_it->string.lstring))
3509 UNGCPRO;
3512 /* Utility function for looking for strong directional characters
3513 whose bidi type was overridden by a directional override. */
3514 ptrdiff_t
3515 bidi_find_first_overridden (struct bidi_it *bidi_it)
3517 ptrdiff_t found_pos = ZV;
3521 /* Need to call bidi_resolve_weak, not bidi_resolve_explicit,
3522 because the directional overrides are applied by the
3523 former. */
3524 bidi_type_t type = bidi_resolve_weak (bidi_it);
3526 if ((type == STRONG_R && bidi_it->orig_type == STRONG_L)
3527 || (type == STRONG_L
3528 && (bidi_it->orig_type == STRONG_R
3529 || bidi_it->orig_type == STRONG_AL)))
3530 found_pos = bidi_it->charpos;
3531 } while (found_pos == ZV
3532 && bidi_it->charpos < ZV
3533 && bidi_it->ch != BIDI_EOB
3534 && bidi_it->ch != '\n');
3536 return found_pos;
3539 /* This is meant to be called from within the debugger, whenever you
3540 wish to examine the cache contents. */
3541 void bidi_dump_cached_states (void) EXTERNALLY_VISIBLE;
3542 void
3543 bidi_dump_cached_states (void)
3545 ptrdiff_t i;
3546 int ndigits = 1;
3548 if (bidi_cache_idx == 0)
3550 fprintf (stderr, "The cache is empty.\n");
3551 return;
3553 fprintf (stderr, "Total of %"pD"d state%s in cache:\n",
3554 bidi_cache_idx, bidi_cache_idx == 1 ? "" : "s");
3556 for (i = bidi_cache[bidi_cache_idx - 1].charpos; i > 0; i /= 10)
3557 ndigits++;
3558 fputs ("ch ", stderr);
3559 for (i = 0; i < bidi_cache_idx; i++)
3560 fprintf (stderr, "%*c", ndigits, bidi_cache[i].ch);
3561 fputs ("\n", stderr);
3562 fputs ("lvl ", stderr);
3563 for (i = 0; i < bidi_cache_idx; i++)
3564 fprintf (stderr, "%*d", ndigits, bidi_cache[i].resolved_level);
3565 fputs ("\n", stderr);
3566 fputs ("pos ", stderr);
3567 for (i = 0; i < bidi_cache_idx; i++)
3568 fprintf (stderr, "%*"pD"d", ndigits, bidi_cache[i].charpos);
3569 fputs ("\n", stderr);