Fixes for Emacs manual in frames.texi
[emacs.git] / src / bidi.c
blob1f05a1f7d51daf9c27103269c9e1d35e75a8459d
1 /* Low-level bidirectional buffer/string-scanning functions for GNU Emacs.
2 Copyright (C) 2000-2001, 2004-2005, 2009-2018 Free Software
3 Foundation, Inc.
5 This file is part of GNU Emacs.
7 GNU Emacs is free software: you can redistribute it and/or modify
8 it under the terms of the GNU General Public License as published by
9 the Free Software Foundation, either version 3 of the License, or (at
10 your option) any later version.
12 GNU Emacs is distributed in the hope that it will be useful,
13 but WITHOUT ANY WARRANTY; without even the implied warranty of
14 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 GNU General Public License for more details.
17 You should have received a copy of the GNU General Public License
18 along with GNU Emacs. If not, see <https://www.gnu.org/licenses/>. */
20 /* Written by Eli Zaretskii <eliz@gnu.org>.
22 A sequential implementation of the Unicode Bidirectional algorithm,
23 (UBA) as per UAX#9, a part of the Unicode Standard.
25 Unlike the Reference Implementation and most other implementations,
26 this one is designed to be called once for every character in the
27 buffer or string. That way, we can leave intact the design of the
28 Emacs display engine, whereby an iterator object is used to
29 traverse buffer or string text character by character, and generate
30 the necessary data for displaying each character in 'struct glyph'
31 objects. (See xdisp.c for the details of that iteration.) The
32 functions on this file replace the original linear iteration in the
33 logical order of the text with a non-linear iteration in the visual
34 order, i.e. in the order characters should be shown on display.
36 The main entry point is bidi_move_to_visually_next. Each time it
37 is called, it finds the next character in the visual order, and
38 returns its information in a special structure. The caller is then
39 expected to process this character for display or any other
40 purposes, and call bidi_move_to_visually_next for the next
41 character. See the comments in bidi_move_to_visually_next for more
42 details about its algorithm that finds the next visual-order
43 character by resolving their levels on the fly.
45 Two other entry points are bidi_paragraph_init and
46 bidi_mirror_char. The first determines the base direction of a
47 paragraph, while the second returns the mirrored version of its
48 argument character.
50 A few auxiliary entry points are used to initialize the bidi
51 iterator for iterating an object (buffer or string), push and pop
52 the bidi iterator state, and save and restore the state of the bidi
53 cache.
55 If you want to understand the code, you will have to read it
56 together with the relevant portions of UAX#9. The comments include
57 references to UAX#9 rules, for that very reason.
59 A note about references to UAX#9 rules: if the reference says
60 something like "X9/Retaining", it means that you need to refer to
61 rule X9 and to its modifications described in the "Implementation
62 Notes" section of UAX#9, under "Retaining Format Codes".
64 Here's the overview of the design of the reordering engine
65 implemented by this file.
67 Basic implementation structure
68 ------------------------------
70 The sequential processing steps described by UAX#9 are implemented
71 as recursive levels of processing, all of which examine the next
72 character in the logical order. This hierarchy of processing looks
73 as follows, from the innermost (deepest) to the outermost level,
74 omitting some subroutines used by each level:
76 bidi_fetch_char -- fetch next character
77 bidi_resolve_explicit -- resolve explicit levels and directions
78 bidi_resolve_weak -- resolve weak types
79 bidi_resolve_brackets -- resolve "paired brackets" neutral types
80 bidi_resolve_neutral -- resolve neutral types
81 bidi_level_of_next_char -- resolve implicit levels
83 Each level calls the level below it, and works on the result
84 returned by the lower level, including all of its sub-levels.
86 Unlike all the levels below it, bidi_level_of_next_char can return
87 the information about either the next or previous character in the
88 logical order, depending on the current direction of scanning the
89 buffer or string. For the next character, it calls all the levels
90 below it; for the previous character, it uses the cache, described
91 below.
93 Thus, the result of calling bidi_level_of_next_char is the resolved
94 level of the next or the previous character in the logical order.
95 Based on this information, the function bidi_move_to_visually_next
96 finds the next character in the visual order and updates the
97 direction in which the buffer is scanned, either forward or
98 backward, to find the next character to be displayed. (Text is
99 scanned backwards when it needs to be reversed for display, i.e. if
100 the visual order is the inverse of the logical order.) This
101 implements the last, reordering steps of the UBA, by successively
102 calling bidi_level_of_next_char until the character of the required
103 embedding level is found; the scan direction is dynamically updated
104 as a side effect. See the commentary before the 'while' loop in
105 bidi_move_to_visually_next, for the details.
107 Fetching characters
108 -------------------
110 In a nutshell, fetching the next character boils down to calling
111 STRING_CHAR_AND_LENGTH, passing it the address of a buffer or
112 string position. See bidi_fetch_char. However, if the next
113 character is "covered" by a display property of some kind,
114 bidi_fetch_char returns the u+FFFC "object replacement character"
115 that represents the entire run of text covered by the display
116 property. (The ch_len and nchars members of 'struct bidi_it'
117 reflect the length in bytes and characters of that text.) This is
118 so we reorder text on both sides of the display property as
119 appropriate for an image or embedded string. Similarly, text
120 covered by a display spec of the form '(space ...)', is replaced
121 with the u+2029 paragraph separator character, so such display
122 specs produce the same effect as a TAB under UBA. Both these
123 special characters are not actually displayed -- the display
124 property is displayed instead -- but just used to compute the
125 embedding level of the surrounding text so as to produce the
126 required effect.
128 Bidi iterator states
129 --------------------
131 The UBA is highly context dependent in some of its parts,
132 i.e. results of processing a character can generally depend on
133 characters very far away. The UAX#9 description of the UBA
134 prescribes a stateful processing of each character, whereby the
135 results of this processing depend on various state variables, such
136 as the current embedding level, level stack, and directional
137 override status. In addition, the UAX#9 description includes many
138 passages like this (from rule W2 in this case):
140 Search backward from each instance of a European number until the
141 first strong type (R, L, AL, or sos) is found. If an AL is found,
142 change the type of the European number to Arabic number.
144 To support this, we use a bidi iterator object, 'struct bidi_it',
145 which is a sub-structure of 'struct it' used by xdisp.c (see
146 dispextern.h for the definition of both of these structures). The
147 bidi iterator holds the entire state of the iteration required by
148 the UBA, and is updated as the text is traversed. In particular,
149 the embedding level of the current character being resolved is
150 recorded in the iterator state. To avoid costly searches backward
151 in support of rules like W2 above, the necessary character types
152 are also recorded in the iterator state as they are found during
153 the forward scan, and then used when such rules need to be applied.
154 (Forward scans cannot be avoided in this way; they need to be
155 performed at least once, and the results recorded in the iterator
156 state, to be reused until the forward scan oversteps the recorded
157 position.)
159 In this manner, the iterator state acts as a mini-cache of
160 contextual information required for resolving the level of the
161 current character by various UBA rules.
163 Caching of bidi iterator states
164 -------------------------------
166 As described above, the reordering engine uses the information
167 recorded in the bidi iterator state in order to resolve the
168 embedding level of the current character. When the reordering
169 engine needs to process the next character in the logical order, it
170 fetches it and applies to it all the UBA levels, updating the
171 iterator state as it goes. But when the buffer or string is
172 scanned backwards, i.e. in the reverse order of buffer/string
173 positions, the scanned characters were already processed during the
174 preceding forward scan (see bidi_find_other_level_edge). To avoid
175 costly re-processing of characters that were already processed
176 during the forward scan, the iterator states computed while
177 scanning forward are cached.
179 The cache is just a linear array of 'struct bidi_it' objects, which
180 is dynamically allocated and reallocated as needed, since the size
181 of the cache depends on the text being processed. We only need the
182 cache while processing embedded levels higher than the base
183 paragraph embedding level, because these higher levels require
184 changes in scan direction. Therefore, as soon as we are back to
185 the base embedding level, we can free the cache; see the calls to
186 bidi_cache_reset and bidi_cache_shrink, for the conditions to do
187 this.
189 The cache maintains the index of the next unused cache slot -- this
190 is where the next iterator state will be cached. The function
191 bidi_cache_iterator_state saves an instance of the state in the
192 cache and increments the unused slot index. The companion function
193 bidi_cache_find looks up a cached state that corresponds to a given
194 buffer/string position. All of the cached states must correspond
195 1:1 to the buffer or string region whose processing they reflect;
196 bidi.c will abort if it finds cache slots that violate this 1:1
197 correspondence.
199 When the parent iterator 'struct it' is pushed (see push_it in
200 xdisp.c) to pause the current iteration and start iterating over a
201 different object (e.g., a 'display' string that covers some buffer
202 text), the bidi iterator cache needs to be "pushed" as well, so
203 that a new empty cache could be used while iterating over the new
204 object. Later, when the new object is exhausted, and xdisp.c calls
205 pop_it, we need to "pop" the bidi cache as well and return to the
206 original cache. See bidi_push_it and bidi_pop_it for how this is
207 done.
209 Some functions of the display engine save copies of 'struct it' in
210 local variables, and restore them later. For examples, see
211 pos_visible_p and move_it_in_display_line_to in xdisp.c, and
212 window_scroll_pixel_based in window.c. When this happens, we need
213 to save and restore the bidi cache as well, because conceptually
214 the cache is part of the 'struct it' state, and needs to be in
215 perfect sync with the portion of the buffer/string that is being
216 processed. This saving and restoring of the cache state is handled
217 by bidi_shelve_cache and bidi_unshelve_cache, and the helper macros
218 SAVE_IT and RESTORE_IT defined on xdisp.c.
220 Note that, because reordering is implemented below the level in
221 xdisp.c that breaks glyphs into screen lines, we are violating
222 paragraph 3.4 of UAX#9. which mandates that line breaking shall be
223 done before reordering each screen line separately. However,
224 following UAX#9 to the letter in this matter goes against the basic
225 design of the Emacs display engine, and so we choose here this
226 minor deviation from the UBA letter in preference to redesign of
227 the display engine. The effect of this is only seen in continued
228 lines that are broken into screen lines in the middle of a run
229 whose direction is opposite to the paragraph's base direction.
231 Important design and implementation note: when the code needs to
232 scan far ahead, be sure to avoid such scans as much as possible
233 when the buffer/string doesn't contain any RTL characters. Users
234 of left-to-right scripts will never forgive you if you introduce
235 some slow-down due to bidi in situations that don't involve any
236 bidirectional text. See the large comment near the beginning of
237 bidi_resolve_neutral, for one situation where such shortcut was
238 necessary. */
240 #include <config.h>
241 #include <stdio.h>
243 #include "lisp.h"
244 #include "character.h"
245 #include "buffer.h"
246 #include "dispextern.h"
247 #include "region-cache.h"
249 static bool bidi_initialized = 0;
251 static Lisp_Object bidi_type_table, bidi_mirror_table, bidi_brackets_table;
253 #define BIDI_EOB (-1)
255 /* Data type for describing the bidirectional character categories. */
256 typedef enum {
257 UNKNOWN_BC,
258 NEUTRAL,
259 WEAK,
260 STRONG,
261 EXPLICIT_FORMATTING
262 } bidi_category_t;
264 static Lisp_Object paragraph_start_re, paragraph_separate_re;
267 /***********************************************************************
268 Utilities
269 ***********************************************************************/
271 /* Return the bidi type of a character CH, subject to the current
272 directional OVERRIDE. */
273 static bidi_type_t
274 bidi_get_type (int ch, bidi_dir_t override)
276 bidi_type_t default_type;
278 if (ch == BIDI_EOB)
279 return NEUTRAL_B;
280 if (ch < 0 || ch > MAX_CHAR)
281 emacs_abort ();
283 default_type = (bidi_type_t) XINT (CHAR_TABLE_REF (bidi_type_table, ch));
284 /* Every valid character code, even those that are unassigned by the
285 UCD, have some bidi-class property, according to
286 DerivedBidiClass.txt file. Therefore, if we ever get UNKNOWN_BT
287 (= zero) code from CHAR_TABLE_REF, that's a bug. */
288 if (default_type == UNKNOWN_BT)
289 emacs_abort ();
291 switch (default_type)
293 case WEAK_BN:
294 case NEUTRAL_B:
295 case LRE:
296 case LRO:
297 case RLE:
298 case RLO:
299 case PDF:
300 case LRI:
301 case RLI:
302 case FSI:
303 case PDI:
304 return default_type;
305 default:
306 if (override == L2R)
307 return STRONG_L;
308 else if (override == R2L)
309 return STRONG_R;
310 else
311 return default_type;
315 static void
316 bidi_check_type (bidi_type_t type)
318 eassert (UNKNOWN_BT <= type && type <= NEUTRAL_ON);
321 /* Given a bidi TYPE of a character, return its category. */
322 static bidi_category_t
323 bidi_get_category (bidi_type_t type)
325 switch (type)
327 case UNKNOWN_BT:
328 return UNKNOWN_BC;
329 case STRONG_L:
330 case STRONG_R:
331 case STRONG_AL:
332 return STRONG;
333 case WEAK_EN:
334 case WEAK_ES:
335 case WEAK_ET:
336 case WEAK_AN:
337 case WEAK_CS:
338 case WEAK_NSM:
339 case WEAK_BN:
340 return WEAK;
341 case NEUTRAL_B:
342 case NEUTRAL_S:
343 case NEUTRAL_WS:
344 case NEUTRAL_ON:
345 return NEUTRAL;
346 case LRE:
347 case LRO:
348 case RLE:
349 case RLO:
350 case PDF:
351 case LRI:
352 case RLI:
353 case FSI:
354 case PDI:
355 return EXPLICIT_FORMATTING;
356 default:
357 emacs_abort ();
361 static bool
362 bidi_isolate_fmt_char (bidi_type_t ch_type)
364 return (ch_type == LRI || ch_type == RLI || ch_type == PDI || ch_type == FSI);
367 /* Return the mirrored character of C, if it has one. If C has no
368 mirrored counterpart, return C.
369 Note: The conditions in UAX#9 clause L4 regarding the surrounding
370 context must be tested by the caller. */
372 bidi_mirror_char (int c)
374 Lisp_Object val;
376 if (c == BIDI_EOB)
377 return c;
378 if (c < 0 || c > MAX_CHAR)
379 emacs_abort ();
381 val = CHAR_TABLE_REF (bidi_mirror_table, c);
382 if (INTEGERP (val))
384 int v;
386 /* When debugging, check before assigning to V, so that the check
387 isn't broken by undefined behavior due to int overflow. */
388 eassert (CHAR_VALID_P (XINT (val)));
390 v = XINT (val);
392 /* Minimal test we must do in optimized builds, to prevent weird
393 crashes further down the road. */
394 if (v < 0 || v > MAX_CHAR)
395 emacs_abort ();
397 return v;
400 return c;
403 /* Return the Bidi_Paired_Bracket_Type property of the character C. */
404 static bidi_bracket_type_t
405 bidi_paired_bracket_type (int c)
407 if (c == BIDI_EOB)
408 return BIDI_BRACKET_NONE;
409 if (c < 0 || c > MAX_CHAR)
410 emacs_abort ();
412 return (bidi_bracket_type_t) XINT (CHAR_TABLE_REF (bidi_brackets_table, c));
415 /* Determine the start-of-sequence (sos) directional type given the two
416 embedding levels on either side of the run boundary. Also, update
417 the saved info about previously seen characters, since that info is
418 generally valid for a single level run. */
419 static void
420 bidi_set_sos_type (struct bidi_it *bidi_it, int level_before, int level_after)
422 int higher_level = (level_before > level_after ? level_before : level_after);
424 /* FIXME: should the default sos direction be user selectable? */
425 bidi_it->sos = ((higher_level & 1) != 0 ? R2L : L2R); /* X10 */
427 bidi_it->prev.type = UNKNOWN_BT;
428 bidi_it->last_strong.type = bidi_it->last_strong.orig_type = UNKNOWN_BT;
429 bidi_it->prev_for_neutral.type = (bidi_it->sos == R2L ? STRONG_R : STRONG_L);
430 bidi_it->prev_for_neutral.charpos = bidi_it->charpos;
431 bidi_it->next_for_neutral.type
432 = bidi_it->next_for_neutral.orig_type = UNKNOWN_BT;
435 #define ISOLATE_STATUS(BIDI_IT, IDX) ((BIDI_IT)->level_stack[IDX].flags & 1)
436 #define OVERRIDE(BIDI_IT, IDX) (((BIDI_IT)->level_stack[IDX].flags >> 1) & 3)
438 /* Push the current embedding level and override status; reset the
439 current level to LEVEL and the current override status to OVERRIDE. */
440 static void
441 bidi_push_embedding_level (struct bidi_it *bidi_it,
442 int level, bidi_dir_t override, bool isolate_status)
444 struct bidi_stack *st;
445 int prev_level = bidi_it->level_stack[bidi_it->stack_idx].level;
447 bidi_it->stack_idx++;
448 eassert (bidi_it->stack_idx < BIDI_MAXDEPTH+2+1);
449 st = &bidi_it->level_stack[bidi_it->stack_idx];
450 eassert (level <= (1 << 7));
451 st->level = level;
452 st->flags = (((override & 3) << 1) | (isolate_status != 0));
453 if (isolate_status)
455 st->last_strong_type = bidi_it->last_strong.type;
456 st->prev_for_neutral_type = bidi_it->prev_for_neutral.type;
457 st->next_for_neutral_type = bidi_it->next_for_neutral.type;
458 st->next_for_neutral_pos = bidi_it->next_for_neutral.charpos;
459 st->flags |= ((bidi_it->sos == L2R ? 0 : 1) << 3);
461 /* We've got a new isolating sequence, compute the directional type
462 of sos and initialize per-sequence variables (UAX#9, clause X10). */
463 bidi_set_sos_type (bidi_it, prev_level, level);
466 /* Pop from the stack the embedding level, the directional override
467 status, and optionally saved information for the isolating run
468 sequence. Return the new level. */
469 static int
470 bidi_pop_embedding_level (struct bidi_it *bidi_it)
472 int level;
474 /* UAX#9 says to ignore invalid PDFs (X7, last bullet)
475 and PDIs (X6a, 2nd bullet). */
476 if (bidi_it->stack_idx > 0)
478 bool isolate_status = ISOLATE_STATUS (bidi_it, bidi_it->stack_idx);
479 int old_level = bidi_it->level_stack[bidi_it->stack_idx].level;
481 struct bidi_stack st;
483 st = bidi_it->level_stack[bidi_it->stack_idx];
484 if (isolate_status)
486 bidi_dir_t sos = ((st.flags >> 3) & 1);
487 /* PREV is used in W1 for resolving WEAK_NSM. By the time
488 we get to an NSM, we must have gotten past at least one
489 character: the PDI that ends the isolate from which we
490 are popping here. So PREV will have been filled up by
491 the time we first use it. We initialize it here to
492 UNKNOWN_BT to be able to catch any blunders in this
493 logic. */
494 bidi_it->prev.orig_type = bidi_it->prev.type = UNKNOWN_BT;
495 bidi_it->last_strong.type = st.last_strong_type;
496 bidi_it->prev_for_neutral.type = st.prev_for_neutral_type;
497 bidi_it->next_for_neutral.type = st.next_for_neutral_type;
498 bidi_it->next_for_neutral.charpos = st.next_for_neutral_pos;
499 bidi_it->sos = (sos == 0 ? L2R : R2L);
501 else
502 bidi_set_sos_type (bidi_it, old_level,
503 bidi_it->level_stack[bidi_it->stack_idx - 1].level);
505 bidi_it->stack_idx--;
507 level = bidi_it->level_stack[bidi_it->stack_idx].level;
508 eassert (0 <= level && level <= BIDI_MAXDEPTH + 1);
509 return level;
512 /* Record in SAVED_INFO the information about the current character. */
513 static void
514 bidi_remember_char (struct bidi_saved_info *saved_info,
515 struct bidi_it *bidi_it, bool from_type)
517 saved_info->charpos = bidi_it->charpos;
518 if (from_type)
519 saved_info->type = bidi_it->type;
520 else
521 saved_info->type = bidi_it->type_after_wn;
522 bidi_check_type (saved_info->type);
523 saved_info->orig_type = bidi_it->orig_type;
524 bidi_check_type (saved_info->orig_type);
527 /* Copy the bidi iterator from FROM to TO. To save cycles, this only
528 copies the part of the level stack that is actually in use. */
529 static void
530 bidi_copy_it (struct bidi_it *to, struct bidi_it *from)
532 /* Copy everything from the start through the active part of
533 the level stack. */
534 memcpy (to, from,
535 (offsetof (struct bidi_it, level_stack) + sizeof from->level_stack[0]
536 + from->stack_idx * sizeof from->level_stack[0]));
540 /***********************************************************************
541 Caching the bidi iterator states
542 ***********************************************************************/
544 /* We allocate and de-allocate the cache in chunks of this size (in
545 characters). 200 was chosen as an upper limit for reasonably-long
546 lines in a text file/buffer. */
547 #define BIDI_CACHE_CHUNK 200
548 /* Maximum size we allow the cache to become, per iterator stack slot,
549 in units of struct bidi_it size. If we allow unlimited growth, we
550 could run out of memory for pathologically long bracketed text or
551 very long text lines that need to be reordered. This is aggravated
552 when word-wrap is in effect, since then functions display_line and
553 move_it_in_display_line_to need to keep up to 4 copies of the
554 cache.
556 This limitation means there can be no more than that amount of
557 contiguous RTL text on any single physical line in a LTR paragraph,
558 and similarly with contiguous LTR + numeric text in a RTL
559 paragraph. (LTR text in a LTR paragraph and RTL text in a RTL
560 paragraph are not reordered, and so don't need the cache, and
561 cannot hit this limit.) More importantly, no single line can have
562 text longer than this inside paired brackets (because bracket pairs
563 resolution uses the cache). If the limit is exceeded, the fallback
564 code will produce visual order that will be incorrect if there are
565 RTL characters in the offending line of text. */
566 /* Do we need to allow customization of this limit? */
567 #define BIDI_CACHE_MAX_ELTS_PER_SLOT 50000
568 verify (BIDI_CACHE_CHUNK < BIDI_CACHE_MAX_ELTS_PER_SLOT);
569 static ptrdiff_t bidi_cache_max_elts = BIDI_CACHE_MAX_ELTS_PER_SLOT;
570 static struct bidi_it *bidi_cache;
571 static ptrdiff_t bidi_cache_size = 0;
572 enum { elsz = sizeof (struct bidi_it) };
573 static ptrdiff_t bidi_cache_idx; /* next unused cache slot */
574 static ptrdiff_t bidi_cache_last_idx; /* slot of last cache hit */
575 static ptrdiff_t bidi_cache_start = 0; /* start of cache for this
576 "stack" level */
578 /* 5-slot stack for saving the start of the previous level of the
579 cache. xdisp.c maintains a 5-slot stack for its iterator state,
580 and we need the same size of our stack. */
581 static ptrdiff_t bidi_cache_start_stack[IT_STACK_SIZE];
582 static int bidi_cache_sp;
584 /* Size of header used by bidi_shelve_cache. */
585 enum
587 bidi_shelve_header_size
588 = (sizeof (bidi_cache_idx) + sizeof (bidi_cache_start_stack)
589 + sizeof (bidi_cache_sp) + sizeof (bidi_cache_start)
590 + sizeof (bidi_cache_last_idx) + sizeof (bidi_cache_max_elts))
593 /* Effectively remove the cached states beyond the Nth state from the
594 part of the cache relevant to iteration of the current object
595 (buffer or string). */
596 static void
597 bidi_cache_reset_to (int n)
599 bidi_cache_idx = bidi_cache_start + n;
600 bidi_cache_last_idx = -1;
603 /* Reset the cache state to the empty state. We only reset the part
604 of the cache relevant to iteration of the current object. Previous
605 objects, which are pushed on the display iterator's stack, are left
606 intact. This is called when the cached information is no more
607 useful for the current iteration, e.g. when we were reseated to a
608 new position on the same object. */
609 static void
610 bidi_cache_reset (void)
612 bidi_cache_reset_to (0);
615 /* Shrink the cache to its minimal size. Called when we init the bidi
616 iterator for reordering a buffer or a string that does not come
617 from display properties, because that means all the previously
618 cached info is of no further use. */
619 static void
620 bidi_cache_shrink (void)
622 if (bidi_cache_size > BIDI_CACHE_CHUNK)
624 bidi_cache = xrealloc (bidi_cache, BIDI_CACHE_CHUNK * elsz);
625 bidi_cache_size = BIDI_CACHE_CHUNK;
627 bidi_cache_reset ();
628 bidi_cache_max_elts = BIDI_CACHE_MAX_ELTS_PER_SLOT;
631 static void
632 bidi_cache_fetch_state (ptrdiff_t idx, struct bidi_it *bidi_it)
634 int current_scan_dir = bidi_it->scan_dir;
636 if (idx < bidi_cache_start || idx >= bidi_cache_idx)
637 emacs_abort ();
639 bidi_copy_it (bidi_it, &bidi_cache[idx]);
640 bidi_it->scan_dir = current_scan_dir;
641 bidi_cache_last_idx = idx;
644 /* Find a cached state with a given CHARPOS and resolved embedding
645 level less or equal to LEVEL. If LEVEL is -1, disregard the
646 resolved levels in cached states. DIR, if non-zero, means search
647 in that direction from the last cache hit.
649 Value is the index of the cached state, or -1 if not found. */
650 static ptrdiff_t
651 bidi_cache_search (ptrdiff_t charpos, int level, int dir)
653 ptrdiff_t i, i_start;
655 if (bidi_cache_idx > bidi_cache_start)
657 if (bidi_cache_last_idx == -1)
658 bidi_cache_last_idx = bidi_cache_idx - 1;
659 if (charpos < bidi_cache[bidi_cache_last_idx].charpos)
661 dir = -1;
662 i_start = bidi_cache_last_idx - 1;
664 else if (charpos > (bidi_cache[bidi_cache_last_idx].charpos
665 + bidi_cache[bidi_cache_last_idx].nchars - 1))
667 dir = 1;
668 i_start = bidi_cache_last_idx + 1;
670 else if (dir)
671 i_start = bidi_cache_last_idx;
672 else
674 dir = -1;
675 i_start = bidi_cache_idx - 1;
678 if (dir < 0)
680 /* Linear search for now; FIXME! */
681 for (i = i_start; i >= bidi_cache_start; i--)
682 if (bidi_cache[i].charpos <= charpos
683 && charpos < bidi_cache[i].charpos + bidi_cache[i].nchars
684 && (level == -1 || bidi_cache[i].resolved_level <= level))
685 return i;
687 else
689 for (i = i_start; i < bidi_cache_idx; i++)
690 if (bidi_cache[i].charpos <= charpos
691 && charpos < bidi_cache[i].charpos + bidi_cache[i].nchars
692 && (level == -1 || bidi_cache[i].resolved_level <= level))
693 return i;
697 return -1;
700 /* Find a cached state where the resolved level changes to a value
701 that is lower than LEVEL, and return its cache slot index. DIR is
702 the direction to search, starting with the last used cache slot.
703 If DIR is zero, we search backwards from the last occupied cache
704 slot. BEFORE means return the index of the slot that
705 is ``before'' the level change in the search direction. That is,
706 given the cached levels like this:
708 1122333442211
709 AB C
711 and assuming we are at the position cached at the slot marked with
712 C, searching backwards (DIR = -1) for LEVEL = 2 will return the
713 index of slot B or A, depending whether BEFORE is, respectively,
714 true or false. */
715 static ptrdiff_t
716 bidi_cache_find_level_change (int level, int dir, bool before)
718 if (bidi_cache_idx)
720 ptrdiff_t i = dir ? bidi_cache_last_idx : bidi_cache_idx - 1;
721 int incr = before ? 1 : 0;
723 if (i < 0) /* cache overflowed? */
724 i = 0;
726 if (!dir)
727 dir = -1;
728 else if (!incr)
729 i += dir;
731 if (dir < 0)
733 while (i >= bidi_cache_start + incr)
735 if (bidi_cache[i - incr].resolved_level >= 0
736 && bidi_cache[i - incr].resolved_level < level)
737 return i;
738 i--;
741 else
743 while (i < bidi_cache_idx - incr)
745 if (bidi_cache[i + incr].resolved_level >= 0
746 && bidi_cache[i + incr].resolved_level < level)
747 return i;
748 i++;
753 return -1;
756 static void
757 bidi_cache_ensure_space (ptrdiff_t idx)
759 /* Enlarge the cache as needed. */
760 if (idx >= bidi_cache_size)
762 ptrdiff_t chunk_size = BIDI_CACHE_CHUNK;
764 if (bidi_cache_size > bidi_cache_max_elts - chunk_size)
765 chunk_size = bidi_cache_max_elts - bidi_cache_size;
767 if (max (idx + 1,
768 bidi_cache_size + chunk_size) <= bidi_cache_max_elts)
770 /* The bidi cache cannot be larger than the largest Lisp
771 string or buffer. */
772 ptrdiff_t string_or_buffer_bound
773 = max (BUF_BYTES_MAX, STRING_BYTES_BOUND);
775 /* Also, it cannot be larger than what C can represent. */
776 ptrdiff_t c_bound
777 = (min (PTRDIFF_MAX, SIZE_MAX) - bidi_shelve_header_size) / elsz;
778 ptrdiff_t max_elts = bidi_cache_max_elts;
780 max_elts = min (max_elts, min (string_or_buffer_bound, c_bound));
782 /* Force xpalloc not to over-allocate by passing it MAX_ELTS
783 as its 4th argument. */
784 bidi_cache = xpalloc (bidi_cache, &bidi_cache_size,
785 max (chunk_size, idx - bidi_cache_size + 1),
786 max_elts, elsz);
787 eassert (bidi_cache_size > idx);
792 static int
793 bidi_cache_iterator_state (struct bidi_it *bidi_it, bool resolved,
794 bool update_only)
796 ptrdiff_t idx;
798 /* We should never cache on backward scans. */
799 if (bidi_it->scan_dir == -1)
800 emacs_abort ();
801 idx = bidi_cache_search (bidi_it->charpos, -1, 1);
803 if (idx < 0 && update_only)
804 return 0;
806 if (idx < 0)
808 idx = bidi_cache_idx;
809 bidi_cache_ensure_space (idx);
810 /* Character positions should correspond to cache positions 1:1.
811 If we are outside the range of cached positions, the cache is
812 useless and must be reset. */
813 if (bidi_cache_start < idx && idx < bidi_cache_size
814 && (bidi_it->charpos > (bidi_cache[idx - 1].charpos
815 + bidi_cache[idx - 1].nchars)
816 || bidi_it->charpos < bidi_cache[bidi_cache_start].charpos))
818 bidi_cache_reset ();
819 idx = bidi_cache_start;
821 if (bidi_it->nchars <= 0)
822 emacs_abort ();
823 /* Don't cache if no available space in the cache. */
824 if (bidi_cache_size > idx)
826 bidi_copy_it (&bidi_cache[idx], bidi_it);
827 if (!resolved)
828 bidi_cache[idx].resolved_level = -1;
831 else
833 /* Copy only the members which could have changed, to avoid
834 costly copying of the entire struct. */
835 bidi_cache[idx].type = bidi_it->type;
836 bidi_check_type (bidi_it->type);
837 bidi_cache[idx].type_after_wn = bidi_it->type_after_wn;
838 bidi_check_type (bidi_it->type_after_wn);
839 if (resolved)
840 bidi_cache[idx].resolved_level = bidi_it->resolved_level;
841 else
842 bidi_cache[idx].resolved_level = -1;
843 bidi_cache[idx].invalid_levels = bidi_it->invalid_levels;
844 bidi_cache[idx].next_for_neutral = bidi_it->next_for_neutral;
845 bidi_cache[idx].next_for_ws = bidi_it->next_for_ws;
846 bidi_cache[idx].disp_pos = bidi_it->disp_pos;
847 bidi_cache[idx].disp_prop = bidi_it->disp_prop;
848 bidi_cache[idx].bracket_pairing_pos = bidi_it->bracket_pairing_pos;
849 bidi_cache[idx].bracket_enclosed_type = bidi_it->bracket_enclosed_type;
852 if (bidi_cache_size > idx)
854 bidi_cache_last_idx = idx;
855 if (idx >= bidi_cache_idx)
856 bidi_cache_idx = idx + 1;
857 return 1;
859 else
861 /* The cache overflowed. */
862 bidi_cache_last_idx = -1;
863 return 0;
867 /* Look for a cached iterator state that corresponds to CHARPOS. If
868 found, copy the cached state into BIDI_IT and return the type of
869 the cached entry. If not found, return UNKNOWN_BT. RESOLVED_ONLY
870 zero means it is OK to return cached states that were not fully
871 resolved yet. This can happen if the state was cached before it
872 was resolved in bidi_resolve_neutral. */
873 static bidi_type_t
874 bidi_cache_find (ptrdiff_t charpos, bool resolved_only, struct bidi_it *bidi_it)
876 ptrdiff_t i = bidi_cache_search (charpos, -1, bidi_it->scan_dir);
878 if (i >= bidi_cache_start
879 && (!resolved_only
880 /* Callers that want only fully resolved states (and set
881 resolved_only = true) need to be sure that there's enough
882 info in the cached state to return the state as final,
883 and if not, they don't want the cached state. */
884 || bidi_cache[i].resolved_level >= 0))
886 bidi_dir_t current_scan_dir = bidi_it->scan_dir;
888 bidi_copy_it (bidi_it, &bidi_cache[i]);
889 bidi_cache_last_idx = i;
890 /* Don't let scan direction from the cached state override
891 the current scan direction. */
892 bidi_it->scan_dir = current_scan_dir;
893 return bidi_it->type;
896 return UNKNOWN_BT;
899 static int
900 bidi_peek_at_next_level (struct bidi_it *bidi_it)
902 if (bidi_cache_idx == bidi_cache_start)
903 emacs_abort ();
904 /* If the cache overflowed, return the level of the last cached
905 character. */
906 if (bidi_cache_last_idx == -1
907 || (bidi_cache_last_idx >= bidi_cache_idx - 1 && bidi_it->scan_dir > 0))
908 return bidi_cache[bidi_cache_idx - 1].resolved_level;
909 return bidi_cache[bidi_cache_last_idx + bidi_it->scan_dir].resolved_level;
913 /***********************************************************************
914 Pushing and popping the bidi iterator state
915 ***********************************************************************/
917 /* Push the bidi iterator state in preparation for reordering a
918 different object, e.g. display string found at certain buffer
919 position. Pushing the bidi iterator boils down to saving its
920 entire state on the cache and starting a new cache "stacked" on top
921 of the current cache. */
922 void
923 bidi_push_it (struct bidi_it *bidi_it)
925 /* Give this stack slot its cache room. */
926 bidi_cache_max_elts += BIDI_CACHE_MAX_ELTS_PER_SLOT;
927 /* Save the current iterator state in its entirety after the last
928 used cache slot. */
929 bidi_cache_ensure_space (bidi_cache_idx);
930 bidi_cache[bidi_cache_idx++] = *bidi_it;
932 /* Push the current cache start onto the stack. */
933 eassert (bidi_cache_sp < IT_STACK_SIZE);
934 bidi_cache_start_stack[bidi_cache_sp++] = bidi_cache_start;
936 /* Start a new level of cache, and make it empty. */
937 bidi_cache_start = bidi_cache_idx;
938 bidi_cache_last_idx = -1;
941 /* Restore the iterator state saved by bidi_push_it and return the
942 cache to the corresponding state. */
943 void
944 bidi_pop_it (struct bidi_it *bidi_it)
946 if (bidi_cache_start <= 0)
947 emacs_abort ();
949 /* Reset the next free cache slot index to what it was before the
950 call to bidi_push_it. */
951 bidi_cache_idx = bidi_cache_start - 1;
953 /* Restore the bidi iterator state saved in the cache. */
954 *bidi_it = bidi_cache[bidi_cache_idx];
956 /* Pop the previous cache start from the stack. */
957 if (bidi_cache_sp <= 0)
958 emacs_abort ();
959 bidi_cache_start = bidi_cache_start_stack[--bidi_cache_sp];
961 /* Invalidate the last-used cache slot data. */
962 bidi_cache_last_idx = -1;
964 bidi_cache_max_elts -= BIDI_CACHE_MAX_ELTS_PER_SLOT;
965 eassert (bidi_cache_max_elts > 0);
968 static ptrdiff_t bidi_cache_total_alloc;
970 /* Stash away a copy of the cache and its control variables. */
971 void *
972 bidi_shelve_cache (void)
974 unsigned char *databuf;
975 ptrdiff_t alloc;
977 /* Empty cache. */
978 if (bidi_cache_idx == 0)
979 return NULL;
981 alloc = (bidi_shelve_header_size
982 + bidi_cache_idx * sizeof (struct bidi_it));
983 databuf = xmalloc (alloc);
984 bidi_cache_total_alloc += alloc;
986 memcpy (databuf, &bidi_cache_idx, sizeof (bidi_cache_idx));
987 memcpy (databuf + sizeof (bidi_cache_idx),
988 bidi_cache, bidi_cache_idx * sizeof (struct bidi_it));
989 memcpy (databuf + sizeof (bidi_cache_idx)
990 + bidi_cache_idx * sizeof (struct bidi_it),
991 bidi_cache_start_stack, sizeof (bidi_cache_start_stack));
992 memcpy (databuf + sizeof (bidi_cache_idx)
993 + bidi_cache_idx * sizeof (struct bidi_it)
994 + sizeof (bidi_cache_start_stack),
995 &bidi_cache_sp, sizeof (bidi_cache_sp));
996 memcpy (databuf + sizeof (bidi_cache_idx)
997 + bidi_cache_idx * sizeof (struct bidi_it)
998 + sizeof (bidi_cache_start_stack) + sizeof (bidi_cache_sp),
999 &bidi_cache_start, sizeof (bidi_cache_start));
1000 memcpy (databuf + sizeof (bidi_cache_idx)
1001 + bidi_cache_idx * sizeof (struct bidi_it)
1002 + sizeof (bidi_cache_start_stack) + sizeof (bidi_cache_sp)
1003 + sizeof (bidi_cache_start),
1004 &bidi_cache_last_idx, sizeof (bidi_cache_last_idx));
1005 memcpy (databuf + sizeof (bidi_cache_idx)
1006 + bidi_cache_idx * sizeof (struct bidi_it)
1007 + sizeof (bidi_cache_start_stack) + sizeof (bidi_cache_sp)
1008 + sizeof (bidi_cache_start) + sizeof (bidi_cache_last_idx),
1009 &bidi_cache_max_elts, sizeof (bidi_cache_max_elts));
1011 return databuf;
1014 /* Restore the cache state from a copy stashed away by
1015 bidi_shelve_cache, and free the buffer used to stash that copy.
1016 JUST_FREE means free the buffer, but don't restore the
1017 cache; used when the corresponding iterator is discarded instead of
1018 being restored. */
1019 void
1020 bidi_unshelve_cache (void *databuf, bool just_free)
1022 unsigned char *p = databuf;
1024 if (!p)
1026 if (!just_free)
1028 /* A NULL pointer means an empty cache. */
1029 bidi_cache_start = 0;
1030 bidi_cache_sp = 0;
1031 bidi_cache_max_elts = BIDI_CACHE_MAX_ELTS_PER_SLOT;
1032 bidi_cache_reset ();
1035 else
1037 if (just_free)
1039 ptrdiff_t idx;
1041 memcpy (&idx, p, sizeof (bidi_cache_idx));
1042 bidi_cache_total_alloc
1043 -= bidi_shelve_header_size + idx * sizeof (struct bidi_it);
1045 else
1047 memcpy (&bidi_cache_idx, p, sizeof (bidi_cache_idx));
1048 bidi_cache_ensure_space (bidi_cache_idx);
1049 memcpy (bidi_cache, p + sizeof (bidi_cache_idx),
1050 bidi_cache_idx * sizeof (struct bidi_it));
1051 memcpy (bidi_cache_start_stack,
1052 p + sizeof (bidi_cache_idx)
1053 + bidi_cache_idx * sizeof (struct bidi_it),
1054 sizeof (bidi_cache_start_stack));
1055 memcpy (&bidi_cache_sp,
1056 p + sizeof (bidi_cache_idx)
1057 + bidi_cache_idx * sizeof (struct bidi_it)
1058 + sizeof (bidi_cache_start_stack),
1059 sizeof (bidi_cache_sp));
1060 memcpy (&bidi_cache_start,
1061 p + sizeof (bidi_cache_idx)
1062 + bidi_cache_idx * sizeof (struct bidi_it)
1063 + sizeof (bidi_cache_start_stack) + sizeof (bidi_cache_sp),
1064 sizeof (bidi_cache_start));
1065 memcpy (&bidi_cache_last_idx,
1066 p + sizeof (bidi_cache_idx)
1067 + bidi_cache_idx * sizeof (struct bidi_it)
1068 + sizeof (bidi_cache_start_stack) + sizeof (bidi_cache_sp)
1069 + sizeof (bidi_cache_start),
1070 sizeof (bidi_cache_last_idx));
1071 memcpy (&bidi_cache_max_elts,
1072 p + sizeof (bidi_cache_idx)
1073 + bidi_cache_idx * sizeof (struct bidi_it)
1074 + sizeof (bidi_cache_start_stack) + sizeof (bidi_cache_sp)
1075 + sizeof (bidi_cache_start) + sizeof (bidi_cache_last_idx),
1076 sizeof (bidi_cache_max_elts));
1077 bidi_cache_total_alloc
1078 -= (bidi_shelve_header_size
1079 + bidi_cache_idx * sizeof (struct bidi_it));
1082 xfree (p);
1087 /***********************************************************************
1088 Initialization
1089 ***********************************************************************/
1090 static void
1091 bidi_initialize (void)
1093 bidi_type_table = uniprop_table (intern ("bidi-class"));
1094 if (NILP (bidi_type_table))
1095 emacs_abort ();
1096 staticpro (&bidi_type_table);
1098 bidi_mirror_table = uniprop_table (intern ("mirroring"));
1099 if (NILP (bidi_mirror_table))
1100 emacs_abort ();
1101 staticpro (&bidi_mirror_table);
1103 bidi_brackets_table = uniprop_table (intern ("bracket-type"));
1104 if (NILP (bidi_brackets_table))
1105 emacs_abort ();
1106 staticpro (&bidi_brackets_table);
1108 paragraph_start_re = build_string ("^\\(\f\\|[ \t]*\\)$");
1109 staticpro (&paragraph_start_re);
1110 paragraph_separate_re = build_string ("^[ \t\f]*$");
1111 staticpro (&paragraph_separate_re);
1113 bidi_cache_sp = 0;
1114 bidi_cache_total_alloc = 0;
1115 bidi_cache_max_elts = BIDI_CACHE_MAX_ELTS_PER_SLOT;
1117 bidi_initialized = 1;
1120 /* Do whatever UAX#9 clause X8 says should be done at paragraph's
1121 end. */
1122 static void
1123 bidi_set_paragraph_end (struct bidi_it *bidi_it)
1125 bidi_it->invalid_levels = 0;
1126 bidi_it->invalid_isolates = 0;
1127 bidi_it->stack_idx = 0;
1128 bidi_it->resolved_level = bidi_it->level_stack[0].level;
1131 /* Initialize the bidi iterator from buffer/string position CHARPOS. */
1132 void
1133 bidi_init_it (ptrdiff_t charpos, ptrdiff_t bytepos, bool frame_window_p,
1134 struct bidi_it *bidi_it)
1136 if (! bidi_initialized)
1137 bidi_initialize ();
1138 if (charpos >= 0)
1139 bidi_it->charpos = charpos;
1140 if (bytepos >= 0)
1141 bidi_it->bytepos = bytepos;
1142 bidi_it->frame_window_p = frame_window_p;
1143 bidi_it->nchars = -1; /* to be computed in bidi_resolve_explicit */
1144 bidi_it->first_elt = 1;
1145 bidi_set_paragraph_end (bidi_it);
1146 bidi_it->new_paragraph = 1;
1147 bidi_it->separator_limit = -1;
1148 bidi_it->type = NEUTRAL_B;
1149 bidi_it->type_after_wn = NEUTRAL_B;
1150 bidi_it->orig_type = NEUTRAL_B;
1151 /* FIXME: Review this!!! */
1152 bidi_it->prev.type = bidi_it->prev.orig_type = UNKNOWN_BT;
1153 bidi_it->last_strong.type = bidi_it->last_strong.orig_type = UNKNOWN_BT;
1154 bidi_it->next_for_neutral.charpos = -1;
1155 bidi_it->next_for_neutral.type
1156 = bidi_it->next_for_neutral.orig_type = UNKNOWN_BT;
1157 bidi_it->prev_for_neutral.charpos = -1;
1158 bidi_it->prev_for_neutral.type
1159 = bidi_it->prev_for_neutral.orig_type = UNKNOWN_BT;
1160 bidi_it->bracket_pairing_pos = -1;
1161 bidi_it->sos = L2R; /* FIXME: should it be user-selectable? */
1162 bidi_it->disp_pos = -1; /* invalid/unknown */
1163 bidi_it->disp_prop = 0;
1164 /* We can only shrink the cache if we are at the bottom level of its
1165 "stack". */
1166 if (bidi_cache_start == 0)
1167 bidi_cache_shrink ();
1168 else
1169 bidi_cache_reset ();
1172 /* Perform initializations for reordering a new line of bidi text. */
1173 static void
1174 bidi_line_init (struct bidi_it *bidi_it)
1176 bidi_it->scan_dir = 1; /* FIXME: do we need to have control on this? */
1177 bidi_it->stack_idx = 0;
1178 bidi_it->resolved_level = bidi_it->level_stack[0].level;
1179 bidi_it->level_stack[0].flags = 0; /* NEUTRAL_DIR, false per X1 */
1180 bidi_it->invalid_levels = 0;
1181 bidi_it->isolate_level = 0; /* X1 */
1182 bidi_it->invalid_isolates = 0; /* X1 */
1183 /* Setting this to zero will force its recomputation the first time
1184 we need it for W5. */
1185 bidi_it->next_en_pos = 0;
1186 bidi_it->next_en_type = UNKNOWN_BT;
1187 bidi_it->next_for_ws.charpos = -1;
1188 bidi_it->next_for_ws.type = UNKNOWN_BT;
1189 bidi_it->bracket_pairing_pos = -1;
1190 bidi_set_sos_type (bidi_it,
1191 (bidi_it->paragraph_dir == R2L ? 1 : 0),
1192 bidi_it->level_stack[0].level); /* X10 */
1194 bidi_cache_reset ();
1198 /***********************************************************************
1199 Fetching characters
1200 ***********************************************************************/
1202 /* Count bytes in string S between BEG/BEGBYTE and END. BEG and END
1203 are zero-based character positions in S, BEGBYTE is byte position
1204 corresponding to BEG. UNIBYTE means S is a unibyte string. */
1205 static ptrdiff_t
1206 bidi_count_bytes (const unsigned char *s, ptrdiff_t beg,
1207 ptrdiff_t begbyte, ptrdiff_t end, bool unibyte)
1209 ptrdiff_t pos = beg;
1210 const unsigned char *p = s + begbyte, *start = p;
1212 if (unibyte)
1213 p = s + end;
1214 else
1216 if (!CHAR_HEAD_P (*p))
1217 emacs_abort ();
1219 while (pos < end)
1221 p += BYTES_BY_CHAR_HEAD (*p);
1222 pos++;
1226 return p - start;
1229 /* Fetch and return the character at byte position BYTEPOS. If S is
1230 non-NULL, fetch the character from string S; otherwise fetch the
1231 character from the current buffer. UNIBYTE means S is a
1232 unibyte string. */
1233 static int
1234 bidi_char_at_pos (ptrdiff_t bytepos, const unsigned char *s, bool unibyte)
1236 if (s)
1238 s += bytepos;
1239 if (unibyte)
1240 return *s;
1242 else
1243 s = BYTE_POS_ADDR (bytepos);
1244 return STRING_CHAR (s);
1247 /* Fetch and return the character at CHARPOS/BYTEPOS. If that
1248 character is covered by a display string, treat the entire run of
1249 covered characters as a single character, either u+2029 or u+FFFC,
1250 and return their combined length in CH_LEN and NCHARS. DISP_POS
1251 specifies the character position of the next display string, or -1
1252 if not yet computed. When the next character is at or beyond that
1253 position, the function updates DISP_POS with the position of the
1254 next display string. *DISP_PROP non-zero means that there's really
1255 a display string at DISP_POS, as opposed to when we searched till
1256 DISP_POS without finding one. If *DISP_PROP is 2, it means the
1257 display spec is of the form `(space ...)', which is replaced with
1258 u+2029 to handle it as a paragraph separator. STRING->s is the C
1259 string to iterate, or NULL if iterating over a buffer or a Lisp
1260 string; in the latter case, STRING->lstring is the Lisp string. */
1261 static int
1262 bidi_fetch_char (ptrdiff_t charpos, ptrdiff_t bytepos, ptrdiff_t *disp_pos,
1263 int *disp_prop, struct bidi_string_data *string,
1264 struct window *w,
1265 bool frame_window_p, ptrdiff_t *ch_len, ptrdiff_t *nchars)
1267 int ch;
1268 ptrdiff_t endpos
1269 = (string->s || STRINGP (string->lstring)) ? string->schars : ZV;
1270 struct text_pos pos;
1271 int len;
1273 /* If we got past the last known position of display string, compute
1274 the position of the next one. That position could be at CHARPOS. */
1275 if (charpos < endpos && charpos > *disp_pos)
1277 SET_TEXT_POS (pos, charpos, bytepos);
1278 *disp_pos = compute_display_string_pos (&pos, string, w, frame_window_p,
1279 disp_prop);
1282 /* Fetch the character at BYTEPOS. */
1283 if (charpos >= endpos)
1285 ch = BIDI_EOB;
1286 *ch_len = 1;
1287 *nchars = 1;
1288 *disp_pos = endpos;
1289 *disp_prop = 0;
1291 else if (charpos >= *disp_pos && *disp_prop)
1293 ptrdiff_t disp_end_pos;
1295 /* We don't expect to find ourselves in the middle of a display
1296 property. Hopefully, it will never be needed. */
1297 if (charpos > *disp_pos)
1298 emacs_abort ();
1299 /* Text covered by `display' properties and overlays with
1300 display properties or display strings is handled as a single
1301 character that represents the entire run of characters
1302 covered by the display property. */
1303 if (*disp_prop == 2)
1305 /* `(space ...)' display specs are handled as paragraph
1306 separators for the purposes of the reordering; see UAX#9
1307 section 3 and clause HL1 in section 4.3 there. */
1308 ch = PARAGRAPH_SEPARATOR;
1310 else
1312 /* All other display specs are handled as the Unicode Object
1313 Replacement Character. */
1314 ch = OBJECT_REPLACEMENT_CHARACTER;
1316 disp_end_pos = compute_display_string_end (*disp_pos, string);
1317 if (disp_end_pos < 0)
1319 /* Somebody removed the display string from the buffer
1320 behind our back. Recover by processing this buffer
1321 position as if no display property were present there to
1322 begin with. */
1323 *disp_prop = 0;
1324 goto normal_char;
1326 *nchars = disp_end_pos - *disp_pos;
1327 if (*nchars <= 0)
1328 emacs_abort ();
1329 if (string->s)
1330 *ch_len = bidi_count_bytes (string->s, *disp_pos, bytepos,
1331 disp_end_pos, string->unibyte);
1332 else if (STRINGP (string->lstring))
1333 *ch_len = bidi_count_bytes (SDATA (string->lstring), *disp_pos,
1334 bytepos, disp_end_pos, string->unibyte);
1335 else
1336 *ch_len = CHAR_TO_BYTE (disp_end_pos) - bytepos;
1338 else
1340 normal_char:
1341 if (string->s)
1344 if (!string->unibyte)
1346 ch = STRING_CHAR_AND_LENGTH (string->s + bytepos, len);
1347 *ch_len = len;
1349 else
1351 ch = UNIBYTE_TO_CHAR (string->s[bytepos]);
1352 *ch_len = 1;
1355 else if (STRINGP (string->lstring))
1357 if (!string->unibyte)
1359 ch = STRING_CHAR_AND_LENGTH (SDATA (string->lstring) + bytepos,
1360 len);
1361 *ch_len = len;
1363 else
1365 ch = UNIBYTE_TO_CHAR (SREF (string->lstring, bytepos));
1366 *ch_len = 1;
1369 else
1371 ch = STRING_CHAR_AND_LENGTH (BYTE_POS_ADDR (bytepos), len);
1372 *ch_len = len;
1374 *nchars = 1;
1377 /* If we just entered a run of characters covered by a display
1378 string, compute the position of the next display string. */
1379 if (charpos + *nchars <= endpos && charpos + *nchars > *disp_pos
1380 && *disp_prop)
1382 SET_TEXT_POS (pos, charpos + *nchars, bytepos + *ch_len);
1383 *disp_pos = compute_display_string_pos (&pos, string, w, frame_window_p,
1384 disp_prop);
1387 return ch;
1390 /* Like bidi_fetch_char, but ignore any text between an isolate
1391 initiator and its matching PDI or, if it has no matching PDI, the
1392 end of the paragraph. If isolates were skipped, CH_LEN and NCHARS
1393 are set to the number of bytes and characters between BYTEPOS/CHARPOS
1394 and the character that was fetched after skipping the isolates. */
1395 static int
1396 bidi_fetch_char_skip_isolates (ptrdiff_t charpos, ptrdiff_t bytepos,
1397 ptrdiff_t *disp_pos, int *disp_prop,
1398 struct bidi_string_data *string,
1399 struct window *w, bool frame_window_p,
1400 ptrdiff_t *ch_len, ptrdiff_t *nchars)
1402 ptrdiff_t orig_charpos = charpos, orig_bytepos = bytepos;
1403 int ch = bidi_fetch_char (charpos, bytepos, disp_pos, disp_prop, string, w,
1404 frame_window_p, ch_len, nchars);
1405 bidi_type_t ch_type = bidi_get_type (ch, NEUTRAL_DIR);
1406 ptrdiff_t level = 0;
1408 if (ch_type == LRI || ch_type == RLI || ch_type == FSI)
1410 level++;
1411 while (level > 0 && ch_type != NEUTRAL_B)
1413 charpos += *nchars;
1414 bytepos += *ch_len;
1415 ch = bidi_fetch_char (charpos, bytepos, disp_pos, disp_prop, string,
1416 w, frame_window_p, ch_len, nchars);
1417 ch_type = bidi_get_type (ch, NEUTRAL_DIR);
1418 /* A Note to P2 says to ignore max_depth limit. */
1419 if (ch_type == LRI || ch_type == RLI || ch_type == FSI)
1420 level++;
1421 else if (ch_type == PDI)
1422 level--;
1426 /* Communicate to the caller how much did we skip, so it could get
1427 past the last character position we examined. */
1428 *nchars += charpos - orig_charpos;
1429 *ch_len += bytepos - orig_bytepos;
1430 return ch;
1435 /***********************************************************************
1436 Determining paragraph direction
1437 ***********************************************************************/
1439 /* Check if buffer position CHARPOS/BYTEPOS is the end of a paragraph.
1440 Value is the non-negative length of the paragraph separator
1441 following the buffer position, -1 if position is at the beginning
1442 of a new paragraph, or -2 if position is neither at beginning nor
1443 at end of a paragraph. */
1444 static ptrdiff_t
1445 bidi_at_paragraph_end (ptrdiff_t charpos, ptrdiff_t bytepos)
1447 Lisp_Object sep_re;
1448 Lisp_Object start_re;
1449 ptrdiff_t val;
1451 if (STRINGP (BVAR (current_buffer, bidi_paragraph_separate_re)))
1452 sep_re = BVAR (current_buffer, bidi_paragraph_separate_re);
1453 else
1454 sep_re = paragraph_separate_re;
1455 if (STRINGP (BVAR (current_buffer, bidi_paragraph_start_re)))
1456 start_re = BVAR (current_buffer, bidi_paragraph_start_re);
1457 else
1458 start_re = paragraph_start_re;
1460 val = fast_looking_at (sep_re, charpos, bytepos, ZV, ZV_BYTE, Qnil);
1461 if (val < 0)
1463 if (fast_looking_at (start_re, charpos, bytepos, ZV, ZV_BYTE, Qnil) >= 0)
1464 val = -1;
1465 else
1466 val = -2;
1469 return val;
1472 /* If the user has requested the long scans caching, make sure that
1473 BIDI cache is enabled. Otherwise, make sure it's disabled. */
1475 static struct region_cache *
1476 bidi_paragraph_cache_on_off (void)
1478 struct buffer *cache_buffer = current_buffer;
1479 bool indirect_p = false;
1481 /* For indirect buffers, make sure to use the cache of their base
1482 buffer. */
1483 if (cache_buffer->base_buffer)
1485 cache_buffer = cache_buffer->base_buffer;
1486 indirect_p = true;
1489 /* Don't turn on or off the cache in the base buffer, if the value
1490 of cache-long-scans of the base buffer is inconsistent with that.
1491 This is because doing so will just make the cache pure overhead,
1492 since if we turn it on via indirect buffer, it will be
1493 immediately turned off by its base buffer. */
1494 if (NILP (BVAR (current_buffer, cache_long_scans)))
1496 if (!indirect_p
1497 || NILP (BVAR (cache_buffer, cache_long_scans)))
1499 if (cache_buffer->bidi_paragraph_cache)
1501 free_region_cache (cache_buffer->bidi_paragraph_cache);
1502 cache_buffer->bidi_paragraph_cache = 0;
1505 return NULL;
1507 else
1509 if (!indirect_p
1510 || !NILP (BVAR (cache_buffer, cache_long_scans)))
1512 if (!cache_buffer->bidi_paragraph_cache)
1513 cache_buffer->bidi_paragraph_cache = new_region_cache ();
1515 return cache_buffer->bidi_paragraph_cache;
1519 /* On my 2005-vintage machine, searching back for paragraph start
1520 takes ~1 ms per line. And bidi_paragraph_init is called 4 times
1521 when user types C-p. The number below limits each call to
1522 bidi_paragraph_init to about 10 ms. */
1523 #define MAX_PARAGRAPH_SEARCH 7500
1525 /* Find the beginning of this paragraph by looking back in the buffer.
1526 Value is the byte position of the paragraph's beginning, or
1527 BEGV_BYTE if paragraph_start_re is still not found after looking
1528 back MAX_PARAGRAPH_SEARCH lines in the buffer. */
1529 static ptrdiff_t
1530 bidi_find_paragraph_start (ptrdiff_t pos, ptrdiff_t pos_byte)
1532 Lisp_Object re =
1533 STRINGP (BVAR (current_buffer, bidi_paragraph_start_re))
1534 ? BVAR (current_buffer, bidi_paragraph_start_re)
1535 : paragraph_start_re;
1536 ptrdiff_t limit = ZV, limit_byte = ZV_BYTE;
1537 struct region_cache *bpc = bidi_paragraph_cache_on_off ();
1538 ptrdiff_t n = 0, oldpos = pos, next;
1539 struct buffer *cache_buffer = current_buffer;
1541 if (cache_buffer->base_buffer)
1542 cache_buffer = cache_buffer->base_buffer;
1544 while (pos_byte > BEGV_BYTE
1545 && n++ < MAX_PARAGRAPH_SEARCH
1546 && fast_looking_at (re, pos, pos_byte, limit, limit_byte, Qnil) < 0)
1548 /* FIXME: What if the paragraph beginning is covered by a
1549 display string? And what if a display string covering some
1550 of the text over which we scan back includes
1551 paragraph_start_re? */
1552 DEC_BOTH (pos, pos_byte);
1553 if (bpc && region_cache_backward (cache_buffer, bpc, pos, &next))
1555 pos = next, pos_byte = CHAR_TO_BYTE (pos);
1556 break;
1558 else
1559 pos = find_newline_no_quit (pos, pos_byte, -1, &pos_byte);
1561 if (n >= MAX_PARAGRAPH_SEARCH)
1562 pos = BEGV, pos_byte = BEGV_BYTE;
1563 if (bpc)
1564 know_region_cache (cache_buffer, bpc, pos, oldpos);
1565 /* Positions returned by the region cache are not limited to
1566 BEGV..ZV range, so we limit them here. */
1567 pos_byte = clip_to_bounds (BEGV_BYTE, pos_byte, ZV_BYTE);
1568 return pos_byte;
1571 /* On a 3.4 GHz machine, searching forward for a strong directional
1572 character in a long paragraph full of weaks or neutrals takes about
1573 1 ms for each 20K characters. The number below limits each call to
1574 bidi_paragraph_init to less than 10 ms even on slow machines. */
1575 #define MAX_STRONG_CHAR_SEARCH 100000
1577 /* Starting from POS, find the first strong (L, R, or AL) character,
1578 while skipping over any characters between an isolate initiator and
1579 its matching PDI. STOP_AT_PDI non-zero means stop at the PDI that
1580 matches the isolate initiator at POS. Return the bidi type of the
1581 character where the search stopped. Give up if after examining
1582 MAX_STRONG_CHAR_SEARCH buffer or string positions no strong
1583 character was found. */
1584 static bidi_type_t
1585 find_first_strong_char (ptrdiff_t pos, ptrdiff_t bytepos, ptrdiff_t end,
1586 ptrdiff_t *disp_pos, int *disp_prop,
1587 struct bidi_string_data *string, struct window *w,
1588 bool string_p, bool frame_window_p,
1589 ptrdiff_t *ch_len, ptrdiff_t *nchars, bool stop_at_pdi)
1591 ptrdiff_t pos1;
1592 bidi_type_t type;
1593 int ch;
1595 if (stop_at_pdi)
1597 /* If STOP_AT_PDI is non-zero, we must have been called with FSI
1598 at POS. Get past it. */
1599 #ifdef ENABLE_CHECKING
1600 ch = bidi_fetch_char (pos, bytepos, disp_pos, disp_prop, string, w,
1601 frame_window_p, ch_len, nchars);
1602 type = bidi_get_type (ch, NEUTRAL_DIR);
1603 eassert (type == FSI /* || type == LRI || type == RLI */);
1604 #endif
1605 pos += *nchars;
1606 bytepos += *ch_len;
1608 ch = bidi_fetch_char_skip_isolates (pos, bytepos, disp_pos, disp_prop, string,
1609 w, frame_window_p, ch_len, nchars);
1610 type = bidi_get_type (ch, NEUTRAL_DIR);
1612 pos1 = pos;
1613 for (pos += *nchars, bytepos += *ch_len;
1614 bidi_get_category (type) != STRONG
1615 /* If requested to stop at first PDI, stop there. */
1616 && !(stop_at_pdi && type == PDI)
1617 /* Stop when searched too far into an abnormally large
1618 paragraph full of weak or neutral characters. */
1619 && pos - pos1 < MAX_STRONG_CHAR_SEARCH;
1620 type = bidi_get_type (ch, NEUTRAL_DIR))
1622 if (pos >= end)
1624 /* Pretend there's a paragraph separator at end of
1625 buffer/string. */
1626 type = NEUTRAL_B;
1627 break;
1629 if (!string_p
1630 && type == NEUTRAL_B
1631 && bidi_at_paragraph_end (pos, bytepos) >= -1)
1632 break;
1633 /* Fetch next character and advance to get past it. */
1634 ch = bidi_fetch_char_skip_isolates (pos, bytepos, disp_pos, disp_prop,
1635 string, w, frame_window_p,
1636 ch_len, nchars);
1637 pos += *nchars;
1638 bytepos += *ch_len;
1640 return type;
1643 /* Determine the base direction, a.k.a. base embedding level, of the
1644 paragraph we are about to iterate through. If DIR is either L2R or
1645 R2L, just use that. Otherwise, determine the paragraph direction
1646 from the first strong directional character of the paragraph.
1648 NO_DEFAULT_P means don't default to L2R if the paragraph
1649 has no strong directional characters and both DIR and
1650 bidi_it->paragraph_dir are NEUTRAL_DIR. In that case, search back
1651 in the buffer until a paragraph is found with a strong character,
1652 or until hitting BEGV. In the latter case, fall back to L2R. This
1653 flag is used in current-bidi-paragraph-direction.
1655 Note that this function gives the paragraph separator the same
1656 direction as the preceding paragraph, even though Emacs generally
1657 views the separator as not belonging to any paragraph. */
1658 void
1659 bidi_paragraph_init (bidi_dir_t dir, struct bidi_it *bidi_it, bool no_default_p)
1661 ptrdiff_t bytepos = bidi_it->bytepos;
1662 bool string_p = bidi_it->string.s || STRINGP (bidi_it->string.lstring);
1663 ptrdiff_t pstartbyte;
1664 /* Note that begbyte is a byte position, while end is a character
1665 position. Yes, this is ugly, but we are trying to avoid costly
1666 calls to BYTE_TO_CHAR and its ilk. */
1667 ptrdiff_t begbyte = string_p ? 0 : BEGV_BYTE;
1668 ptrdiff_t end = string_p ? bidi_it->string.schars : ZV;
1670 /* Special case for an empty buffer. */
1671 if (bytepos == begbyte && bidi_it->charpos == end)
1672 dir = L2R;
1673 /* We should never be called at EOB or before BEGV. */
1674 else if (bidi_it->charpos >= end || bytepos < begbyte)
1675 emacs_abort ();
1677 if (dir == L2R)
1679 bidi_it->paragraph_dir = L2R;
1680 bidi_it->new_paragraph = 0;
1682 else if (dir == R2L)
1684 bidi_it->paragraph_dir = R2L;
1685 bidi_it->new_paragraph = 0;
1687 else if (dir == NEUTRAL_DIR) /* P2 */
1689 ptrdiff_t ch_len, nchars;
1690 ptrdiff_t pos, disp_pos = -1;
1691 int disp_prop = 0;
1692 bidi_type_t type;
1693 const unsigned char *s;
1695 if (!bidi_initialized)
1696 bidi_initialize ();
1698 /* If we are inside a paragraph separator, we are just waiting
1699 for the separator to be exhausted; use the previous paragraph
1700 direction. But don't do that if we have been just reseated,
1701 because we need to reinitialize below in that case. */
1702 if (!bidi_it->first_elt
1703 && bidi_it->charpos < bidi_it->separator_limit)
1704 return;
1706 /* If we are on a newline, get past it to where the next
1707 paragraph might start. But don't do that at BEGV since then
1708 we are potentially in a new paragraph that doesn't yet
1709 exist. */
1710 pos = bidi_it->charpos;
1711 s = (STRINGP (bidi_it->string.lstring)
1712 ? SDATA (bidi_it->string.lstring)
1713 : bidi_it->string.s);
1714 if (bytepos > begbyte
1715 && bidi_char_at_pos (bytepos, s, bidi_it->string.unibyte) == '\n')
1717 bytepos++;
1718 pos++;
1721 /* We are either at the beginning of a paragraph or in the
1722 middle of it. Find where this paragraph starts. */
1723 if (string_p)
1725 /* We don't support changes of paragraph direction inside a
1726 string. It is treated as a single paragraph. */
1727 pstartbyte = 0;
1729 else
1730 pstartbyte = bidi_find_paragraph_start (pos, bytepos);
1731 bidi_it->separator_limit = -1;
1732 bidi_it->new_paragraph = 0;
1734 /* The following loop is run more than once only if NO_DEFAULT_P,
1735 and only if we are iterating on a buffer. */
1736 do {
1737 bytepos = pstartbyte;
1738 if (!string_p)
1739 pos = BYTE_TO_CHAR (bytepos);
1740 type = find_first_strong_char (pos, bytepos, end, &disp_pos, &disp_prop,
1741 &bidi_it->string, bidi_it->w,
1742 string_p, bidi_it->frame_window_p,
1743 &ch_len, &nchars, false);
1744 if (type == STRONG_R || type == STRONG_AL) /* P3 */
1745 bidi_it->paragraph_dir = R2L;
1746 else if (type == STRONG_L)
1747 bidi_it->paragraph_dir = L2R;
1748 if (!string_p
1749 && no_default_p && bidi_it->paragraph_dir == NEUTRAL_DIR)
1751 /* If this paragraph is at BEGV, default to L2R. */
1752 if (pstartbyte == BEGV_BYTE)
1753 bidi_it->paragraph_dir = L2R; /* P3 and HL1 */
1754 else
1756 ptrdiff_t prevpbyte = pstartbyte;
1757 ptrdiff_t p = BYTE_TO_CHAR (pstartbyte), pbyte = pstartbyte;
1759 /* Find the beginning of the previous paragraph, if any. */
1760 while (pbyte > BEGV_BYTE && prevpbyte >= pstartbyte)
1762 /* FXIME: What if p is covered by a display
1763 string? See also a FIXME inside
1764 bidi_find_paragraph_start. */
1765 DEC_BOTH (p, pbyte);
1766 prevpbyte = bidi_find_paragraph_start (p, pbyte);
1768 pstartbyte = prevpbyte;
1771 } while (!string_p
1772 && no_default_p && bidi_it->paragraph_dir == NEUTRAL_DIR);
1774 else
1775 emacs_abort ();
1777 /* Contrary to UAX#9 clause P3, we only default the paragraph
1778 direction to L2R if we have no previous usable paragraph
1779 direction. This is allowed by the HL1 clause. */
1780 if (bidi_it->paragraph_dir != L2R && bidi_it->paragraph_dir != R2L)
1781 bidi_it->paragraph_dir = L2R; /* P3 and HL1 ``higher-level protocols'' */
1782 if (bidi_it->paragraph_dir == R2L)
1783 bidi_it->level_stack[0].level = 1;
1784 else
1785 bidi_it->level_stack[0].level = 0;
1787 bidi_line_init (bidi_it);
1791 /***********************************************************************
1792 Resolving explicit and implicit levels.
1793 The rest of this file constitutes the core of the UBA implementation.
1794 ***********************************************************************/
1796 static bool
1797 bidi_explicit_dir_char (int ch)
1799 bidi_type_t ch_type;
1801 if (!bidi_initialized)
1802 emacs_abort ();
1803 if (ch < 0)
1805 eassert (ch == BIDI_EOB);
1806 return false;
1808 ch_type = (bidi_type_t) XINT (CHAR_TABLE_REF (bidi_type_table, ch));
1809 return (ch_type == LRE || ch_type == LRO
1810 || ch_type == RLE || ch_type == RLO
1811 || ch_type == PDF);
1814 /* Given an iterator state in BIDI_IT, advance one character position
1815 in the buffer/string to the next character (in the logical order),
1816 resolve any explicit embeddings, directional overrides, and isolate
1817 initiators and terminators, and return the embedding level of the
1818 character after resolving these explicit directives. */
1819 static int
1820 bidi_resolve_explicit (struct bidi_it *bidi_it)
1822 int curchar;
1823 bidi_type_t type, typ1, prev_type = UNKNOWN_BT;
1824 int current_level;
1825 int new_level;
1826 bidi_dir_t override;
1827 bool isolate_status;
1828 bool string_p = bidi_it->string.s || STRINGP (bidi_it->string.lstring);
1829 ptrdiff_t ch_len, nchars, disp_pos, end;
1830 int disp_prop;
1831 ptrdiff_t eob
1832 = ((bidi_it->string.s || STRINGP (bidi_it->string.lstring))
1833 ? bidi_it->string.schars : ZV);
1835 /* Record the info about the previous character. */
1836 if (bidi_it->type_after_wn != WEAK_BN /* W1/Retaining */
1837 && bidi_it->type != WEAK_BN)
1839 /* This special case is needed in support of Unicode 8.0
1840 correction to N0, as implemented in bidi_resolve_weak/W1
1841 below. */
1842 if (bidi_it->type_after_wn == NEUTRAL_ON
1843 && bidi_get_category (bidi_it->type) == STRONG
1844 && bidi_paired_bracket_type (bidi_it->ch) == BIDI_BRACKET_CLOSE)
1845 bidi_remember_char (&bidi_it->prev, bidi_it, 1);
1846 else
1847 bidi_remember_char (&bidi_it->prev, bidi_it, 0);
1849 if (bidi_it->type_after_wn == STRONG_R
1850 || bidi_it->type_after_wn == STRONG_L
1851 || bidi_it->type_after_wn == STRONG_AL)
1852 bidi_remember_char (&bidi_it->last_strong, bidi_it, 0);
1853 if (bidi_it->type == STRONG_R || bidi_it->type == STRONG_L
1854 || bidi_it->type == WEAK_EN || bidi_it->type == WEAK_AN)
1855 bidi_remember_char (&bidi_it->prev_for_neutral, bidi_it, 1);
1857 /* If we overstepped the characters used for resolving neutrals
1858 and whitespace, invalidate their info in the iterator. */
1859 if (bidi_it->charpos >= bidi_it->next_for_neutral.charpos)
1861 bidi_it->next_for_neutral.type = UNKNOWN_BT;
1862 /* If needed, reset the "magical" value of pairing bracket
1863 position, so that bidi_resolve_brackets will resume
1864 resolution of brackets according to BPA. */
1865 if (bidi_it->bracket_pairing_pos == eob)
1866 bidi_it->bracket_pairing_pos = -1;
1868 if (bidi_it->next_en_pos >= 0
1869 && bidi_it->charpos >= bidi_it->next_en_pos)
1871 bidi_it->next_en_pos = 0;
1872 bidi_it->next_en_type = UNKNOWN_BT;
1875 /* Reset the bracket resolution info, unless we previously decided
1876 (in bidi_find_bracket_pairs) that brackets in this level run
1877 should be resolved as neutrals. */
1878 if (bidi_it->bracket_pairing_pos != eob)
1880 bidi_it->bracket_pairing_pos = -1;
1881 bidi_it->bracket_enclosed_type = UNKNOWN_BT;
1884 /* If reseat()'ed, don't advance, so as to start iteration from the
1885 position where we were reseated. bidi_it->bytepos can be less
1886 than BEGV_BYTE after reseat to BEGV. */
1887 if (bidi_it->bytepos < (string_p ? 0 : BEGV_BYTE)
1888 || bidi_it->first_elt)
1890 bidi_it->first_elt = 0;
1891 if (string_p)
1893 const unsigned char *p
1894 = (STRINGP (bidi_it->string.lstring)
1895 ? SDATA (bidi_it->string.lstring)
1896 : bidi_it->string.s);
1898 if (bidi_it->charpos < 0)
1899 bidi_it->charpos = bidi_it->bytepos = 0;
1900 eassert (bidi_it->bytepos == bidi_count_bytes (p, 0, 0,
1901 bidi_it->charpos,
1902 bidi_it->string.unibyte));
1904 else
1906 if (bidi_it->charpos < BEGV)
1908 bidi_it->charpos = BEGV;
1909 bidi_it->bytepos = BEGV_BYTE;
1911 eassert (bidi_it->bytepos == CHAR_TO_BYTE (bidi_it->charpos));
1913 /* Determine the original bidi type of the previous character,
1914 which is needed for handling isolate initiators and PDF. The
1915 type of the previous character will be non-trivial only if
1916 our caller moved through some previous text in
1917 get_visually_first_element, in which case bidi_it->prev holds
1918 the information we want. */
1919 if (bidi_it->first_elt && bidi_it->prev.type != UNKNOWN_BT)
1921 eassert (bidi_it->prev.charpos == bidi_it->charpos - 1);
1922 prev_type = bidi_it->prev.orig_type;
1925 /* Don't move at end of buffer/string. */
1926 else if (bidi_it->charpos < (string_p ? bidi_it->string.schars : ZV))
1928 /* Advance to the next character, skipping characters covered by
1929 display strings (nchars > 1). */
1930 if (bidi_it->nchars <= 0)
1931 emacs_abort ();
1932 bidi_it->charpos += bidi_it->nchars;
1933 if (bidi_it->ch_len == 0)
1934 emacs_abort ();
1935 bidi_it->bytepos += bidi_it->ch_len;
1936 prev_type = bidi_it->orig_type;
1938 else /* EOB or end of string */
1939 prev_type = NEUTRAL_B;
1941 current_level = bidi_it->level_stack[bidi_it->stack_idx].level; /* X1 */
1942 isolate_status = ISOLATE_STATUS (bidi_it, bidi_it->stack_idx);
1943 override = OVERRIDE (bidi_it, bidi_it->stack_idx);
1944 new_level = current_level;
1946 if (bidi_it->charpos >= (string_p ? bidi_it->string.schars : ZV))
1948 curchar = BIDI_EOB;
1949 bidi_it->ch_len = 1;
1950 bidi_it->nchars = 1;
1951 bidi_it->disp_pos = (string_p ? bidi_it->string.schars : ZV);
1952 bidi_it->disp_prop = 0;
1954 else
1956 /* LRI, RLI, and FSI increment, and PDF decrements, the
1957 embedding level of the _following_ characters, so we must
1958 first look at the type of the previous character to support
1959 that. */
1960 switch (prev_type)
1962 case RLI: /* X5a */
1963 if (current_level < BIDI_MAXDEPTH
1964 && bidi_it->invalid_levels == 0
1965 && bidi_it->invalid_isolates == 0)
1967 new_level = ((current_level + 1) & ~1) + 1;
1968 bidi_it->isolate_level++;
1969 bidi_push_embedding_level (bidi_it, new_level,
1970 NEUTRAL_DIR, true);
1972 else
1973 bidi_it->invalid_isolates++;
1974 break;
1975 case LRI: /* X5b */
1976 if (current_level < BIDI_MAXDEPTH - 1
1977 && bidi_it->invalid_levels == 0
1978 && bidi_it->invalid_isolates == 0)
1980 new_level = ((current_level + 2) & ~1);
1981 bidi_it->isolate_level++;
1982 bidi_push_embedding_level (bidi_it, new_level,
1983 NEUTRAL_DIR, true);
1985 else
1986 bidi_it->invalid_isolates++;
1987 break;
1988 case PDF: /* X7 */
1989 if (!bidi_it->invalid_isolates)
1991 if (bidi_it->invalid_levels)
1992 bidi_it->invalid_levels--;
1993 else if (!isolate_status && bidi_it->stack_idx >= 1)
1994 new_level = bidi_pop_embedding_level (bidi_it);
1996 break;
1997 default:
1998 eassert (prev_type != FSI);
1999 /* Nothing. */
2000 break;
2002 /* Fetch the character at BYTEPOS. If it is covered by a
2003 display string, treat the entire run of covered characters as
2004 a single character u+FFFC. */
2005 curchar = bidi_fetch_char (bidi_it->charpos, bidi_it->bytepos,
2006 &bidi_it->disp_pos, &bidi_it->disp_prop,
2007 &bidi_it->string, bidi_it->w,
2008 bidi_it->frame_window_p,
2009 &bidi_it->ch_len, &bidi_it->nchars);
2011 bidi_it->ch = curchar;
2012 bidi_it->resolved_level = new_level;
2014 /* Don't apply directional override here, as all the types we handle
2015 below will not be affected by the override anyway, and we need
2016 the original type unaltered. The override will be applied in
2017 bidi_resolve_weak. */
2018 type = bidi_get_type (curchar, NEUTRAL_DIR);
2019 bidi_it->orig_type = type;
2020 bidi_check_type (bidi_it->orig_type);
2022 bidi_it->type_after_wn = UNKNOWN_BT;
2024 switch (type)
2026 case RLE: /* X2 */
2027 case RLO: /* X4 */
2028 bidi_it->type_after_wn = type;
2029 bidi_check_type (bidi_it->type_after_wn);
2030 type = WEAK_BN; /* X9/Retaining */
2031 if (new_level < BIDI_MAXDEPTH
2032 && bidi_it->invalid_levels == 0
2033 && bidi_it->invalid_isolates == 0)
2035 /* Compute the least odd embedding level greater than
2036 the current level. */
2037 new_level = ((new_level + 1) & ~1) + 1;
2038 if (bidi_it->type_after_wn == RLE)
2039 override = NEUTRAL_DIR;
2040 else
2041 override = R2L;
2042 bidi_push_embedding_level (bidi_it, new_level, override, false);
2043 bidi_it->resolved_level = new_level;
2045 else
2047 if (bidi_it->invalid_isolates == 0)
2048 bidi_it->invalid_levels++;
2050 break;
2051 case LRE: /* X3 */
2052 case LRO: /* X5 */
2053 bidi_it->type_after_wn = type;
2054 bidi_check_type (bidi_it->type_after_wn);
2055 type = WEAK_BN; /* X9/Retaining */
2056 if (new_level < BIDI_MAXDEPTH - 1
2057 && bidi_it->invalid_levels == 0
2058 && bidi_it->invalid_isolates == 0)
2060 /* Compute the least even embedding level greater than
2061 the current level. */
2062 new_level = ((new_level + 2) & ~1);
2063 if (bidi_it->type_after_wn == LRE)
2064 override = NEUTRAL_DIR;
2065 else
2066 override = L2R;
2067 bidi_push_embedding_level (bidi_it, new_level, override, false);
2068 bidi_it->resolved_level = new_level;
2070 else
2072 if (bidi_it->invalid_isolates == 0)
2073 bidi_it->invalid_levels++;
2075 break;
2076 case FSI: /* X5c */
2077 end = string_p ? bidi_it->string.schars : ZV;
2078 disp_pos = bidi_it->disp_pos;
2079 disp_prop = bidi_it->disp_prop;
2080 nchars = bidi_it->nchars;
2081 ch_len = bidi_it->ch_len;
2082 typ1 = find_first_strong_char (bidi_it->charpos,
2083 bidi_it->bytepos, end,
2084 &disp_pos, &disp_prop,
2085 &bidi_it->string, bidi_it->w,
2086 string_p, bidi_it->frame_window_p,
2087 &ch_len, &nchars, true);
2088 if (typ1 != STRONG_R && typ1 != STRONG_AL)
2090 type = LRI;
2091 /* Override orig_type, which will be needed when we come to
2092 examine the next character, which is the first character
2093 inside the isolate. */
2094 bidi_it->orig_type = type;
2095 goto fsi_as_lri;
2097 else
2099 type = RLI;
2100 bidi_it->orig_type = type;
2102 FALLTHROUGH;
2103 case RLI: /* X5a */
2104 if (override == NEUTRAL_DIR)
2105 bidi_it->type_after_wn = type;
2106 else /* Unicode 8.0 correction. */
2107 bidi_it->type_after_wn = (override == L2R ? STRONG_L : STRONG_R);
2108 bidi_check_type (bidi_it->type_after_wn);
2109 break;
2110 case LRI: /* X5b */
2111 fsi_as_lri:
2112 if (override == NEUTRAL_DIR)
2113 bidi_it->type_after_wn = type;
2114 else /* Unicode 8.0 correction. */
2115 bidi_it->type_after_wn = (override == L2R ? STRONG_L : STRONG_R);
2116 bidi_check_type (bidi_it->type_after_wn);
2117 break;
2118 case PDI: /* X6a */
2119 if (bidi_it->invalid_isolates)
2120 bidi_it->invalid_isolates--;
2121 else if (bidi_it->isolate_level > 0)
2123 bidi_it->invalid_levels = 0;
2124 while (!ISOLATE_STATUS (bidi_it, bidi_it->stack_idx))
2125 bidi_pop_embedding_level (bidi_it);
2126 eassert (bidi_it->stack_idx > 0);
2127 new_level = bidi_pop_embedding_level (bidi_it);
2128 bidi_it->isolate_level--;
2130 bidi_it->resolved_level = new_level;
2131 /* Unicode 8.0 correction. */
2133 bidi_dir_t stack_override = OVERRIDE (bidi_it, bidi_it->stack_idx);
2134 if (stack_override == L2R)
2135 bidi_it->type_after_wn = STRONG_L;
2136 else if (stack_override == R2L)
2137 bidi_it->type_after_wn = STRONG_R;
2138 else
2139 bidi_it->type_after_wn = type;
2141 break;
2142 case PDF: /* X7 */
2143 bidi_it->type_after_wn = type;
2144 bidi_check_type (bidi_it->type_after_wn);
2145 type = WEAK_BN; /* X9/Retaining */
2146 break;
2147 default:
2148 /* Nothing. */
2149 break;
2152 bidi_it->type = type;
2153 bidi_check_type (bidi_it->type);
2155 if (bidi_it->type == NEUTRAL_B) /* X8 */
2157 bidi_set_paragraph_end (bidi_it);
2158 /* This is needed by bidi_resolve_weak below, and in L1. */
2159 bidi_it->type_after_wn = bidi_it->type;
2162 eassert (bidi_it->resolved_level >= 0);
2163 return bidi_it->resolved_level;
2166 /* Advance in the buffer/string, resolve weak types and return the
2167 type of the next character after weak type resolution. */
2168 static bidi_type_t
2169 bidi_resolve_weak (struct bidi_it *bidi_it)
2171 bidi_type_t type;
2172 bidi_dir_t override;
2173 int prev_level = bidi_it->level_stack[bidi_it->stack_idx].level;
2174 int new_level = bidi_resolve_explicit (bidi_it);
2175 int next_char;
2176 bidi_type_t type_of_next;
2177 struct bidi_it saved_it;
2178 ptrdiff_t eob
2179 = ((STRINGP (bidi_it->string.lstring) || bidi_it->string.s)
2180 ? bidi_it->string.schars : ZV);
2182 type = bidi_it->type;
2183 override = OVERRIDE (bidi_it, bidi_it->stack_idx);
2185 eassert (!(type == UNKNOWN_BT
2186 || type == LRE
2187 || type == LRO
2188 || type == RLE
2189 || type == RLO
2190 || type == PDF));
2192 eassert (prev_level >= 0);
2193 if (bidi_it->type == NEUTRAL_B)
2195 /* We've got a new isolating sequence, compute the directional
2196 type of sos and initialize per-run variables (UAX#9, clause
2197 X10). */
2198 bidi_set_sos_type (bidi_it, prev_level, new_level);
2200 if (type == NEUTRAL_S || type == NEUTRAL_WS
2201 || type == WEAK_BN || type == STRONG_AL)
2202 bidi_it->type_after_wn = type; /* needed in L1 */
2203 bidi_check_type (bidi_it->type_after_wn);
2205 /* Level and directional override status are already recorded in
2206 bidi_it, and do not need any change; see X6. */
2207 if (override == R2L) /* X6 */
2208 type = STRONG_R;
2209 else if (override == L2R)
2210 type = STRONG_L;
2211 else
2213 if (type == WEAK_NSM) /* W1 */
2215 /* Note that we don't need to consider the case where the
2216 prev character has its type overridden by an RLO or LRO,
2217 because then either the type of this NSM would have been
2218 also overridden, or the previous character is outside the
2219 current level run, and thus not relevant to this NSM.
2220 This is why NSM gets the type_after_wn of the previous
2221 character. */
2222 /* bidi_set_sos_type sets type_after_wn to UNKNOWN_BT. */
2223 if (bidi_it->prev.type != UNKNOWN_BT
2224 /* If type_after_wn is NEUTRAL_B, this NSM is at sos. */
2225 && bidi_it->prev.type != NEUTRAL_B)
2227 if (bidi_isolate_fmt_char (bidi_it->prev.type))
2229 /* From W1: "Note that in an isolating run sequence,
2230 an isolate initiator followed by an NSM or any
2231 type other than PDI must be an overflow isolate
2232 initiator." */
2233 eassert (bidi_it->invalid_isolates > 0);
2234 type = NEUTRAL_ON;
2236 else
2238 /* This includes the Unicode 8.0 correction for N0,
2239 due to how we set prev.type in bidi_resolve_explicit,
2240 which see. */
2241 type = bidi_it->prev.type;
2244 else if (bidi_it->sos == R2L)
2245 type = STRONG_R;
2246 else if (bidi_it->sos == L2R)
2247 type = STRONG_L;
2248 else /* shouldn't happen! */
2249 emacs_abort ();
2251 if (type == WEAK_EN /* W2 */
2252 && bidi_it->last_strong.type == STRONG_AL)
2253 type = WEAK_AN;
2254 else if (type == STRONG_AL) /* W3 */
2255 type = STRONG_R;
2256 else if ((type == WEAK_ES /* W4 */
2257 && bidi_it->prev.type == WEAK_EN
2258 && bidi_it->prev.orig_type == WEAK_EN)
2259 || (type == WEAK_CS
2260 && ((bidi_it->prev.type == WEAK_EN
2261 && bidi_it->prev.orig_type == WEAK_EN)
2262 || bidi_it->prev.type == WEAK_AN)))
2264 const unsigned char *s
2265 = (STRINGP (bidi_it->string.lstring)
2266 ? SDATA (bidi_it->string.lstring)
2267 : bidi_it->string.s);
2269 next_char = (bidi_it->charpos + bidi_it->nchars >= eob
2270 ? BIDI_EOB
2271 : bidi_char_at_pos (bidi_it->bytepos + bidi_it->ch_len,
2272 s, bidi_it->string.unibyte));
2273 type_of_next = bidi_get_type (next_char, override);
2275 if (type_of_next == WEAK_BN
2276 || bidi_explicit_dir_char (next_char))
2278 bidi_copy_it (&saved_it, bidi_it);
2279 while (bidi_resolve_explicit (bidi_it) == new_level
2280 && bidi_it->type == WEAK_BN)
2281 type_of_next = bidi_it->type;
2282 bidi_copy_it (bidi_it, &saved_it);
2285 /* If the next character is EN, but the last strong-type
2286 character is AL, that next EN will be changed to AN when
2287 we process it in W2 above. So in that case, this ES
2288 should not be changed into EN. */
2289 if (type == WEAK_ES
2290 && type_of_next == WEAK_EN
2291 && bidi_it->last_strong.type != STRONG_AL)
2292 type = WEAK_EN;
2293 else if (type == WEAK_CS)
2295 if (bidi_it->prev.type == WEAK_AN
2296 && (type_of_next == WEAK_AN
2297 /* If the next character is EN, but the last
2298 strong-type character is AL, EN will be later
2299 changed to AN when we process it in W2 above.
2300 So in that case, this ES should not be
2301 changed into EN. */
2302 || (type_of_next == WEAK_EN
2303 && bidi_it->last_strong.type == STRONG_AL)))
2304 type = WEAK_AN;
2305 else if (bidi_it->prev.type == WEAK_EN
2306 && type_of_next == WEAK_EN
2307 && bidi_it->last_strong.type != STRONG_AL)
2308 type = WEAK_EN;
2311 else if (type == WEAK_ET /* W5: ET with EN before or after it */
2312 || type == WEAK_BN) /* W5/Retaining */
2314 if (bidi_it->prev.type == WEAK_EN) /* ET/BN w/EN before it */
2315 type = WEAK_EN;
2316 else if (bidi_it->next_en_pos > bidi_it->charpos
2317 && bidi_it->next_en_type != WEAK_BN)
2319 if (bidi_it->next_en_type == WEAK_EN) /* ET/BN with EN after it */
2320 type = WEAK_EN;
2322 else if (type == WEAK_BN
2323 /* This condition is for the following important case:
2325 . we are at level zero
2326 . either previous strong character was L,
2327 or we've seen no strong characters since sos
2328 and the base paragraph direction is L2R
2329 . this BN is NOT a bidi directional control
2331 For such a situation, either this BN will be
2332 converted to EN per W5, and then to L by virtue
2333 of W7; or it will become ON per W6, and then L
2334 because of N1/N2. So we take a shortcut here
2335 and make it L right away, to avoid the
2336 potentially costly loop below. This is
2337 important when the buffer has a long series of
2338 control characters, like binary nulls, and no
2339 R2L characters at all. */
2340 && new_level == 0
2341 && !bidi_explicit_dir_char (bidi_it->ch)
2342 && ((bidi_it->last_strong.type == STRONG_L)
2343 || (bidi_it->last_strong.type == UNKNOWN_BT
2344 && bidi_it->sos == L2R)))
2345 type = STRONG_L;
2346 else if (bidi_it->next_en_pos >= 0)
2348 /* We overstepped the last known position for ET
2349 resolution but there could be other such characters
2350 in this paragraph (when we are sure there are no more
2351 such positions, we set next_en_pos to a negative
2352 value). Try to find the next position for ET
2353 resolution. */
2354 ptrdiff_t en_pos = bidi_it->charpos + bidi_it->nchars;
2355 const unsigned char *s = (STRINGP (bidi_it->string.lstring)
2356 ? SDATA (bidi_it->string.lstring)
2357 : bidi_it->string.s);
2359 if (bidi_it->nchars <= 0)
2360 emacs_abort ();
2361 next_char
2362 = (bidi_it->charpos + bidi_it->nchars >= eob
2363 ? BIDI_EOB
2364 : bidi_char_at_pos (bidi_it->bytepos + bidi_it->ch_len, s,
2365 bidi_it->string.unibyte));
2366 type_of_next = bidi_get_type (next_char, override);
2368 if (type_of_next == WEAK_ET
2369 || type_of_next == WEAK_BN
2370 || bidi_explicit_dir_char (next_char))
2372 bidi_copy_it (&saved_it, bidi_it);
2373 while (bidi_resolve_explicit (bidi_it) == new_level
2374 && (bidi_it->type == WEAK_BN
2375 || bidi_it->type == WEAK_ET))
2376 type_of_next = bidi_it->type;
2377 if (type == WEAK_BN
2378 && bidi_it->charpos == saved_it.charpos + saved_it.nchars)
2380 /* If we entered the above loop with a BN that
2381 changes the level, the type of next
2382 character, which is in a different level, is
2383 not relevant to resolving this series of ET
2384 and BN. */
2385 en_pos = saved_it.charpos;
2386 type_of_next = type;
2388 else
2389 en_pos = bidi_it->charpos;
2390 bidi_copy_it (bidi_it, &saved_it);
2392 /* Remember this position, to speed up processing of the
2393 next ETs. */
2394 bidi_it->next_en_pos = en_pos;
2395 if (type_of_next == WEAK_EN)
2397 /* If the last strong character is AL, the EN we've
2398 found will become AN when we get to it (W2). */
2399 if (bidi_it->last_strong.type == STRONG_AL)
2400 type_of_next = WEAK_AN;
2401 else if (type == WEAK_BN)
2402 type = NEUTRAL_ON; /* W6/Retaining */
2403 else
2404 type = WEAK_EN;
2406 else if (type_of_next == NEUTRAL_B)
2407 /* Record the fact that there are no more ENs from
2408 here to the end of paragraph, to avoid entering the
2409 loop above ever again in this paragraph. */
2410 bidi_it->next_en_pos = -1;
2411 /* Record the type of the character where we ended our search. */
2412 bidi_it->next_en_type = type_of_next;
2417 if (type == WEAK_ES || type == WEAK_ET || type == WEAK_CS /* W6 */
2418 || (type == WEAK_BN
2419 && (bidi_it->prev.type == WEAK_CS /* W6/Retaining */
2420 || bidi_it->prev.type == WEAK_ES
2421 || bidi_it->prev.type == WEAK_ET)))
2422 type = NEUTRAL_ON;
2424 /* Store the type we've got so far, before we clobber it with strong
2425 types in W7 and while resolving neutral types. But leave alone
2426 the original types that were recorded above, because we will need
2427 them for the L1 clause. */
2428 if (bidi_it->type_after_wn == UNKNOWN_BT)
2429 bidi_it->type_after_wn = type;
2430 bidi_check_type (bidi_it->type_after_wn);
2432 if (type == WEAK_EN) /* W7 */
2434 if ((bidi_it->last_strong.type == STRONG_L)
2435 || (bidi_it->last_strong.type == UNKNOWN_BT && bidi_it->sos == L2R))
2436 type = STRONG_L;
2439 bidi_it->type = type;
2440 bidi_check_type (bidi_it->type);
2441 return type;
2444 /* Resolve the type of a neutral character according to the type of
2445 surrounding strong text and the current embedding level. */
2446 static bidi_type_t
2447 bidi_resolve_neutral_1 (bidi_type_t prev_type, bidi_type_t next_type, int lev)
2449 /* N1: "European and Arabic numbers act as if they were R in terms
2450 of their influence on NIs." */
2451 if (next_type == WEAK_EN || next_type == WEAK_AN)
2452 next_type = STRONG_R;
2453 if (prev_type == WEAK_EN || prev_type == WEAK_AN)
2454 prev_type = STRONG_R;
2456 if (next_type == prev_type) /* N1 */
2457 return next_type;
2458 else if ((lev & 1) == 0) /* N2 */
2459 return STRONG_L;
2460 else
2461 return STRONG_R;
2464 #define FLAG_EMBEDDING_INSIDE 1
2465 #define FLAG_OPPOSITE_INSIDE 2
2467 /* A data type used in the stack maintained by
2468 bidi_find_bracket_pairs below. */
2469 typedef struct bpa_stack_entry {
2470 int close_bracket_char;
2471 int open_bracket_idx;
2472 #ifdef ENABLE_CHECKING
2473 ptrdiff_t open_bracket_pos;
2474 #endif
2475 unsigned flags : 2;
2476 } bpa_stack_entry;
2478 /* Allow for the two struct bidi_it objects too, since they can be big.
2479 With MAX_ALLOCA of 16 KiB, this should allow at least 900 slots in the
2480 BPA stack, which should be more than enough for actual bidi text. */
2481 enum { MAX_BPA_STACK = max (1, ((MAX_ALLOCA - 2 * sizeof (struct bidi_it))
2482 / sizeof (bpa_stack_entry))) };
2484 /* UAX#9 says to match opening brackets with the matching closing
2485 brackets or their canonical equivalents. As of Unicode 8.0, there
2486 are only 2 bracket characters that have canonical equivalence
2487 decompositions: u+2329 and u+232A. So instead of accessing the
2488 table in uni-decomposition.el, we just handle these 2 characters
2489 with this simple macro. Note that ASCII characters don't have
2490 canonical equivalents by definition. */
2492 /* To find all the characters that need to be processed by
2493 CANONICAL_EQU, first find all the characters which have
2494 decompositions in UnicodeData.txt, with this Awk script:
2496 awk -F ";" " {if ($6 != \"\") print $1, $6}" UnicodeData.txt
2498 Then produce a list of all the bracket characters in BidiBrackets.txt:
2500 awk -F "[ ;]" " {if ($1 != \"#\" && $1 != \"\") print $1}" BidiBrackets.txt
2502 And finally, cross-reference these two:
2504 grep -Fw -f brackets.txt decompositions.txt
2506 where "decompositions.txt" was produced by the 1st script, and
2507 "brackets.txt" by the 2nd script. In the output of grep, look
2508 only for decompositions that don't begin with some compatibility
2509 formatting tag, such as "<compat>". Only decompositions that
2510 consist solely of character codepoints are relevant to bidi
2511 brackets processing. */
2513 #define CANONICAL_EQU(c) \
2514 ( ASCII_CHAR_P (c) ? c \
2515 : (c) == LEFT_POINTING_ANGLE_BRACKET ? LEFT_ANGLE_BRACKET \
2516 : (c) == RIGHT_POINTING_ANGLE_BRACKET ? RIGHT_ANGLE_BRACKET \
2517 : c )
2519 #ifdef ENABLE_CHECKING
2520 # define STORE_BRACKET_CHARPOS \
2521 bpa_stack[bpa_sp].open_bracket_pos = bidi_it->charpos
2522 #else
2523 # define STORE_BRACKET_CHARPOS /* nothing */
2524 #endif
2526 #define PUSH_BPA_STACK \
2527 do { \
2528 int ch; \
2529 if (bpa_sp < MAX_BPA_STACK - 1 && bidi_cache_last_idx <= INT_MAX) \
2531 bpa_sp++; \
2532 ch = CANONICAL_EQU (bidi_it->ch); \
2533 bpa_stack[bpa_sp].close_bracket_char = bidi_mirror_char (ch); \
2534 bpa_stack[bpa_sp].open_bracket_idx = bidi_cache_last_idx; \
2535 bpa_stack[bpa_sp].flags = 0; \
2536 STORE_BRACKET_CHARPOS; \
2538 } while (0)
2541 /* This function implements BPA, the Bidi Parenthesis Algorithm,
2542 described in BD16 and N0 of UAX#9. It finds all the bracket pairs
2543 in the current isolating sequence, and records the enclosed type
2544 and the position of the matching bracket in the cache. It returns
2545 non-zero if called with the iterator on the opening bracket which
2546 has a matching closing bracket in the current isolating sequence,
2547 zero otherwise. */
2548 static bool
2549 bidi_find_bracket_pairs (struct bidi_it *bidi_it)
2551 bidi_bracket_type_t btype;
2552 bidi_type_t type = bidi_it->type;
2553 bool retval = false;
2555 /* When scanning backwards, we don't expect any unresolved bidi
2556 bracket characters. */
2557 if (bidi_it->scan_dir != 1)
2558 emacs_abort ();
2560 btype = bidi_paired_bracket_type (bidi_it->ch);
2561 if (btype == BIDI_BRACKET_OPEN)
2563 bpa_stack_entry bpa_stack[MAX_BPA_STACK];
2564 int bpa_sp = -1;
2565 struct bidi_it saved_it;
2566 int base_level = bidi_it->level_stack[0].level;
2567 int embedding_level = bidi_it->level_stack[bidi_it->stack_idx].level;
2568 int maxlevel = embedding_level;
2569 bidi_type_t embedding_type = (embedding_level & 1) ? STRONG_R : STRONG_L;
2570 struct bidi_it tem_it;
2571 bool l2r_seen = false, r2l_seen = false;
2572 ptrdiff_t pairing_pos;
2573 int idx_at_entry = bidi_cache_idx;
2575 verify (MAX_BPA_STACK >= 100);
2576 bidi_copy_it (&saved_it, bidi_it);
2577 /* bidi_cache_iterator_state refuses to cache on backward scans,
2578 and bidi_cache_fetch_state doesn't bring scan_dir from the
2579 cache, so we must initialize this explicitly. */
2580 tem_it.scan_dir = 1;
2582 while (1)
2584 int old_sidx, new_sidx;
2585 int current_level = bidi_it->level_stack[bidi_it->stack_idx].level;
2587 if (maxlevel < current_level)
2588 maxlevel = current_level;
2589 /* Mark every opening bracket character we've traversed by
2590 putting its own position into bracket_pairing_pos. This
2591 is examined in bidi_resolve_brackets to distinguish
2592 brackets that were already resolved to stay NEUTRAL_ON,
2593 and those that were not yet processed by this function
2594 (because they were skipped when we skip higher embedding
2595 levels below). */
2596 if (btype == BIDI_BRACKET_OPEN && bidi_it->bracket_pairing_pos == -1)
2597 bidi_it->bracket_pairing_pos = bidi_it->charpos;
2598 if (!bidi_cache_iterator_state (bidi_it, type == NEUTRAL_B, 0))
2600 /* No more space in cache -- give up and let the opening
2601 bracket that started this be processed as a
2602 NEUTRAL_ON. */
2603 bidi_cache_reset_to (idx_at_entry - bidi_cache_start);
2604 bidi_copy_it (bidi_it, &saved_it);
2605 goto give_up;
2607 if (btype == BIDI_BRACKET_OPEN)
2608 PUSH_BPA_STACK;
2609 else if (btype == BIDI_BRACKET_CLOSE)
2611 int sp = bpa_sp;
2612 int curchar = CANONICAL_EQU (bidi_it->ch);
2614 eassert (sp >= 0);
2615 while (sp >= 0 && bpa_stack[sp].close_bracket_char != curchar)
2616 sp--;
2617 if (sp >= 0)
2619 /* Update and cache the corresponding opening bracket. */
2620 bidi_cache_fetch_state (bpa_stack[sp].open_bracket_idx,
2621 &tem_it);
2622 #ifdef ENABLE_CHECKING
2623 eassert (bpa_stack[sp].open_bracket_pos == tem_it.charpos);
2624 #endif
2625 /* Determine the enclosed type for this bracket
2626 pair's type resolution according to N0. */
2627 if (bpa_stack[sp].flags & FLAG_EMBEDDING_INSIDE)
2628 tem_it.bracket_enclosed_type = embedding_type; /* N0b */
2629 else if (bpa_stack[sp].flags & FLAG_OPPOSITE_INSIDE)
2630 tem_it.bracket_enclosed_type /* N0c */
2631 = (embedding_type == STRONG_L ? STRONG_R : STRONG_L);
2632 else /* N0d */
2633 tem_it.bracket_enclosed_type = UNKNOWN_BT;
2635 /* Record the position of the matching closing
2636 bracket, and update the cache. */
2637 tem_it.bracket_pairing_pos = bidi_it->charpos;
2638 bidi_cache_iterator_state (&tem_it, 0, 1);
2640 /* Pop the BPA stack. */
2641 bpa_sp = sp - 1;
2643 if (bpa_sp < 0)
2645 retval = true;
2646 break;
2649 else if (bidi_get_category (bidi_it->type_after_wn) != NEUTRAL)
2651 unsigned flag = 0;
2652 int sp;
2654 /* Whenever we see a strong type, update the flags of
2655 all the slots on the stack. */
2656 switch (bidi_it->type)
2658 case STRONG_L:
2659 flag = ((embedding_level & 1) == 0
2660 ? FLAG_EMBEDDING_INSIDE
2661 : FLAG_OPPOSITE_INSIDE);
2662 l2r_seen = true;
2663 break;
2664 case STRONG_R:
2665 case WEAK_EN:
2666 case WEAK_AN:
2667 flag = ((embedding_level & 1) == 1
2668 ? FLAG_EMBEDDING_INSIDE
2669 : FLAG_OPPOSITE_INSIDE);
2670 r2l_seen = true;
2671 break;
2672 default:
2673 break;
2675 if (flag)
2677 for (sp = bpa_sp; sp >= 0; sp--)
2678 bpa_stack[sp].flags |= flag;
2681 old_sidx = bidi_it->stack_idx;
2682 type = bidi_resolve_weak (bidi_it);
2683 /* Skip level runs excluded from this isolating run sequence. */
2684 new_sidx = bidi_it->stack_idx;
2685 if (bidi_it->level_stack[new_sidx].level > current_level
2686 && (ISOLATE_STATUS (bidi_it, new_sidx)
2687 || (new_sidx > old_sidx + 1
2688 && ISOLATE_STATUS (bidi_it, new_sidx - 1))))
2690 while (bidi_it->level_stack[bidi_it->stack_idx].level
2691 > current_level)
2693 if (maxlevel < bidi_it->level_stack[bidi_it->stack_idx].level)
2694 maxlevel = bidi_it->level_stack[bidi_it->stack_idx].level;
2695 if (!bidi_cache_iterator_state (bidi_it,
2696 type == NEUTRAL_B, 0))
2698 /* No more space in cache -- give up and let the
2699 opening bracket that started this be
2700 processed as any other NEUTRAL_ON. */
2701 bidi_cache_reset_to (idx_at_entry - bidi_cache_start);
2702 bidi_copy_it (bidi_it, &saved_it);
2703 goto give_up;
2705 type = bidi_resolve_weak (bidi_it);
2708 if (type == NEUTRAL_B
2709 || (bidi_it->level_stack[bidi_it->stack_idx].level
2710 != current_level))
2712 /* We've marched all the way to the end of this
2713 isolating run sequence, and didn't find matching
2714 closing brackets for some opening brackets. Leave
2715 their type unchanged. */
2716 pairing_pos = bidi_it->charpos;
2717 break;
2719 if (bidi_it->type_after_wn == NEUTRAL_ON) /* Unicode 8.0 correction */
2720 btype = bidi_paired_bracket_type (bidi_it->ch);
2721 else
2722 btype = BIDI_BRACKET_NONE;
2725 /* Restore bidi_it from the cache, which should have the bracket
2726 resolution members set as determined by the above loop. */
2727 type = bidi_cache_find (saved_it.charpos, 0, bidi_it);
2728 eassert (type == NEUTRAL_ON);
2730 /* The following is an optimization for bracketed text that has
2731 only one level which is equal to the paragraph's base
2732 embedding level. That is, only L2R and weak/neutral
2733 characters in a L2R paragraph, or only R2L and weak/neutral
2734 characters in a R2L paragraph. Such brackets can be resolved
2735 by bidi_resolve_neutral, which has a further shortcut for
2736 this case. So we pretend we did not resolve the brackets in
2737 this case, set up next_for_neutral for the entire bracketed
2738 text, and reset the cache to the character before the opening
2739 bracket. The upshot is to allow bidi_move_to_visually_next
2740 reset the cache when it returns this opening bracket, thus
2741 cutting significantly on the size of the cache, which is
2742 important with long lines, especially if word-wrap is non-nil
2743 (which requires the display engine to copy the cache back and
2744 forth many times). */
2745 if (maxlevel == base_level
2746 && ((base_level == 0 && !r2l_seen)
2747 || (base_level == 1 && !l2r_seen)))
2749 ptrdiff_t eob
2750 = ((bidi_it->string.s || STRINGP (bidi_it->string.lstring))
2751 ? bidi_it->string.schars : ZV);
2753 if (retval)
2754 pairing_pos = bidi_it->bracket_pairing_pos;
2756 /* This special value (which cannot possibly happen when
2757 brackets are resolved, since there's no character at ZV)
2758 will be noticed by bidi_resolve_explicit, and will be
2759 copied to the following iterator states, instead of being
2760 reset to -1. */
2761 bidi_it->bracket_pairing_pos = eob;
2762 /* This type value will be used for resolving the outermost
2763 closing bracket in bidi_resolve_brackets. */
2764 bidi_it->bracket_enclosed_type = embedding_type;
2765 /* bidi_cache_last_idx is set to the index of the current
2766 state, because we just called bidi_cache_find above.
2767 That state describes the outermost opening bracket, the
2768 one with which we entered this function. Force the cache
2769 to "forget" all the cached states starting from that state. */
2770 bidi_cache_reset_to (bidi_cache_last_idx - bidi_cache_start);
2771 /* Set up the next_for_neutral member, to help
2772 bidi_resolve_neutral. */
2773 bidi_it->next_for_neutral.type = embedding_type;
2774 bidi_it->next_for_neutral.charpos = pairing_pos;
2775 /* Pretend we didn't resolve this bracket. */
2776 retval = false;
2780 give_up:
2781 return retval;
2784 static void
2785 bidi_record_type_for_neutral (struct bidi_saved_info *info, int level,
2786 bool nextp)
2788 int idx;
2790 for (idx = bidi_cache_last_idx + 1; idx < bidi_cache_idx; idx++)
2792 int lev = bidi_cache[idx].level_stack[bidi_cache[idx].stack_idx].level;
2794 if (lev <= level)
2796 eassert (lev == level);
2797 if (nextp)
2798 bidi_cache[idx].next_for_neutral = *info;
2799 else
2800 bidi_cache[idx].prev_for_neutral = *info;
2801 break;
2806 static bidi_type_t
2807 bidi_resolve_brackets (struct bidi_it *bidi_it)
2809 int prev_level = bidi_it->level_stack[bidi_it->stack_idx].level;
2810 bool resolve_bracket = false;
2811 bidi_type_t type = UNKNOWN_BT;
2812 int ch;
2813 struct bidi_saved_info prev_for_neutral, next_for_neutral;
2814 ptrdiff_t eob
2815 = ((bidi_it->string.s || STRINGP (bidi_it->string.lstring))
2816 ? bidi_it->string.schars : ZV);
2818 /* Record the prev_for_neutral type either from the previous
2819 character, if it was a strong or AN/EN, or from the
2820 prev_for_neutral information recorded previously. */
2821 if (bidi_it->type == STRONG_L || bidi_it->type == STRONG_R
2822 || bidi_it->type == WEAK_AN || bidi_it->type == WEAK_EN)
2823 bidi_remember_char (&prev_for_neutral, bidi_it, 1);
2824 else
2825 prev_for_neutral = bidi_it->prev_for_neutral;
2826 /* Record the next_for_neutral type information. */
2827 if (bidi_it->next_for_neutral.charpos > bidi_it->charpos)
2828 next_for_neutral = bidi_it->next_for_neutral;
2829 else
2830 next_for_neutral.charpos = -1;
2831 if (!bidi_it->first_elt)
2833 type = bidi_cache_find (bidi_it->charpos + bidi_it->nchars, 0, bidi_it);
2834 ch = bidi_it->ch;
2836 if (type == UNKNOWN_BT)
2838 type = bidi_resolve_weak (bidi_it);
2839 if (type == NEUTRAL_ON)
2841 /* bracket_pairing_pos == eob means this bracket does not
2842 need to be resolved as a bracket, but as a neutral, see
2843 the optimization trick we play near the end of
2844 bidi_find_bracket_pairs. */
2845 if (bidi_it->bracket_pairing_pos == eob)
2847 /* If this is the outermost closing bracket of a run of
2848 characters in which we decided to resolve brackets as
2849 neutrals, use the embedding level's type, recorded in
2850 bracket_enclosed_type, to resolve the bracket. */
2851 if (bidi_it->next_for_neutral.charpos == bidi_it->charpos
2852 && bidi_paired_bracket_type (bidi_it->ch) == BIDI_BRACKET_CLOSE)
2853 type = bidi_it->bracket_enclosed_type;
2855 else if (bidi_find_bracket_pairs (bidi_it))
2856 resolve_bracket = true;
2859 else if (bidi_it->bracket_pairing_pos != eob)
2861 eassert (bidi_it->resolved_level == -1);
2862 /* If the cached state shows an increase of embedding level due
2863 to an isolate initiator, we need to update the 1st cached
2864 state of the next run of the current isolating sequence with
2865 the prev_for_neutral and next_for_neutral information, so
2866 that it will be picked up when we advance to that next run. */
2867 if (bidi_it->level_stack[bidi_it->stack_idx].level > prev_level
2868 && ISOLATE_STATUS (bidi_it, bidi_it->stack_idx))
2870 bidi_record_type_for_neutral (&prev_for_neutral, prev_level, 0);
2871 bidi_record_type_for_neutral (&next_for_neutral, prev_level, 1);
2873 if (type == NEUTRAL_ON
2874 && bidi_paired_bracket_type (ch) == BIDI_BRACKET_OPEN)
2876 if (bidi_it->bracket_pairing_pos > bidi_it->charpos)
2878 /* A cached opening bracket that wasn't completely
2879 resolved yet. */
2880 resolve_bracket = true;
2882 else if (bidi_it->bracket_pairing_pos == -1)
2884 /* Higher levels were not BPA-resolved yet, even if
2885 cached by bidi_find_bracket_pairs. Force application
2886 of BPA to the new level now. */
2887 if (bidi_find_bracket_pairs (bidi_it))
2888 resolve_bracket = true;
2891 /* Keep track of the prev_for_neutral and next_for_neutral
2892 types, needed for resolving brackets below and for resolving
2893 neutrals in bidi_resolve_neutral. */
2894 if (bidi_it->level_stack[bidi_it->stack_idx].level == prev_level)
2896 bidi_it->prev_for_neutral = prev_for_neutral;
2897 if (next_for_neutral.charpos > 0)
2898 bidi_it->next_for_neutral = next_for_neutral;
2902 /* If needed, resolve the bracket type according to N0. */
2903 if (resolve_bracket)
2905 int embedding_level = bidi_it->level_stack[bidi_it->stack_idx].level;
2906 bidi_type_t embedding_type = (embedding_level & 1) ? STRONG_R : STRONG_L;
2908 eassert (bidi_it->prev_for_neutral.type != UNKNOWN_BT);
2909 eassert (bidi_it->bracket_pairing_pos > bidi_it->charpos);
2910 if (bidi_it->bracket_enclosed_type == embedding_type) /* N0b */
2911 type = embedding_type;
2912 else
2914 switch (bidi_it->prev_for_neutral.type)
2916 case STRONG_R:
2917 case WEAK_EN:
2918 case WEAK_AN:
2919 type =
2920 (bidi_it->bracket_enclosed_type == STRONG_R) /* N0c */
2921 ? STRONG_R /* N0c1 */
2922 : embedding_type; /* N0c2 */
2923 break;
2924 case STRONG_L:
2925 type =
2926 (bidi_it->bracket_enclosed_type == STRONG_L) /* N0c */
2927 ? STRONG_L /* N0c1 */
2928 : embedding_type; /* N0c2 */
2929 break;
2930 default:
2931 /* N0d: Do not set the type for that bracket pair. */
2932 break;
2935 eassert (type == STRONG_L || type == STRONG_R || type == NEUTRAL_ON);
2937 /* Update the type of the paired closing bracket to the same
2938 type as for the resolved opening bracket. */
2939 if (type != NEUTRAL_ON)
2941 ptrdiff_t idx = bidi_cache_search (bidi_it->bracket_pairing_pos,
2942 -1, 1);
2944 if (idx < bidi_cache_start)
2945 emacs_abort ();
2946 bidi_cache[idx].type = type;
2950 return type;
2953 static bidi_type_t
2954 bidi_resolve_neutral (struct bidi_it *bidi_it)
2956 bidi_type_t type = bidi_resolve_brackets (bidi_it);
2957 int current_level;
2958 bool is_neutral;
2960 eassert (type == STRONG_R
2961 || type == STRONG_L
2962 || type == WEAK_BN
2963 || type == WEAK_EN
2964 || type == WEAK_AN
2965 || type == NEUTRAL_B
2966 || type == NEUTRAL_S
2967 || type == NEUTRAL_WS
2968 || type == NEUTRAL_ON
2969 || type == LRI
2970 || type == RLI
2971 || type == PDI);
2973 current_level = bidi_it->level_stack[bidi_it->stack_idx].level;
2974 eassert (current_level >= 0);
2975 is_neutral = bidi_get_category (type) == NEUTRAL;
2977 if ((type != NEUTRAL_B /* Don't risk entering the long loop below if
2978 we are already at paragraph end. */
2979 && (is_neutral || bidi_isolate_fmt_char (type)))
2980 /* N1-N2/Retaining */
2981 || type == WEAK_BN)
2983 if (bidi_it->next_for_neutral.type != UNKNOWN_BT
2984 && (bidi_it->next_for_neutral.charpos > bidi_it->charpos
2985 /* PDI defines an eos, so it's OK for it to serve as its
2986 own next_for_neutral. */
2987 || (bidi_it->next_for_neutral.charpos == bidi_it->charpos
2988 && bidi_it->type == PDI)))
2990 type = bidi_resolve_neutral_1 (bidi_it->prev_for_neutral.type,
2991 bidi_it->next_for_neutral.type,
2992 current_level);
2994 /* The next two "else if" clauses are shortcuts for the
2995 important special case when we have a long sequence of
2996 neutral or WEAK_BN characters, such as whitespace or nulls or
2997 other control characters, on the base embedding level of the
2998 paragraph, and that sequence goes all the way to the end of
2999 the paragraph and follows a character whose resolved
3000 directionality is identical to the base embedding level.
3001 (This is what happens in a buffer with plain L2R text that
3002 happens to include long sequences of control characters.) By
3003 virtue of N1, the result of examining this long sequence will
3004 always be either STRONG_L or STRONG_R, depending on the base
3005 embedding level. So we use this fact directly instead of
3006 entering the expensive loop in the "else" clause. */
3007 else if (current_level == 0
3008 && bidi_it->prev_for_neutral.type == STRONG_L
3009 && (ASCII_CHAR_P (bidi_it->ch)
3010 || (type != WEAK_BN
3011 && !bidi_explicit_dir_char (bidi_it->ch)
3012 && !bidi_isolate_fmt_char (type))))
3013 type = bidi_resolve_neutral_1 (bidi_it->prev_for_neutral.type,
3014 STRONG_L, current_level);
3015 else if (/* current level is 1 */
3016 current_level == 1
3017 /* base embedding level is also 1 */
3018 && bidi_it->level_stack[0].level == 1
3019 /* previous character is one of those considered R for
3020 the purposes of W5 */
3021 && (bidi_it->prev_for_neutral.type == STRONG_R
3022 || bidi_it->prev_for_neutral.type == WEAK_EN
3023 || bidi_it->prev_for_neutral.type == WEAK_AN)
3024 && type != WEAK_BN
3025 && !bidi_explicit_dir_char (bidi_it->ch)
3026 && !bidi_isolate_fmt_char (type))
3027 type = bidi_resolve_neutral_1 (bidi_it->prev_for_neutral.type,
3028 STRONG_R, current_level);
3029 else
3031 /* Arrrgh!! The UAX#9 algorithm is too deeply entrenched in
3032 the assumption of batch-style processing; see clauses W4,
3033 W5, and especially N1, which require looking far forward
3034 (as well as back) in the buffer/string. May the fleas of
3035 a thousand camels infest the armpits of those who design
3036 supposedly general-purpose algorithms by looking at their
3037 own implementations, and fail to consider other possible
3038 implementations! */
3039 struct bidi_it saved_it;
3040 bidi_type_t next_type;
3041 bool adjacent_to_neutrals = is_neutral;
3043 bidi_copy_it (&saved_it, bidi_it);
3044 /* Scan the text forward until we find the first non-neutral
3045 character, and then use that to resolve the neutral we
3046 are dealing with now. We also cache the scanned iterator
3047 states, to salvage some of the effort later. */
3048 do {
3049 int old_sidx, new_sidx;
3051 /* Paragraph separators have their levels fully resolved
3052 at this point, so cache them as resolved. */
3053 bidi_cache_iterator_state (bidi_it, type == NEUTRAL_B, 0);
3054 old_sidx = bidi_it->stack_idx;
3055 type = bidi_resolve_brackets (bidi_it);
3056 /* Skip level runs excluded from this isolating run sequence. */
3057 new_sidx = bidi_it->stack_idx;
3058 if (bidi_it->level_stack[new_sidx].level > current_level
3059 && (ISOLATE_STATUS (bidi_it, new_sidx)
3060 /* This is for when we have an isolate initiator
3061 immediately followed by an embedding or
3062 override initiator, in which case we get the
3063 level stack pushed twice by the single call to
3064 bidi_resolve_weak above. */
3065 || (new_sidx > old_sidx + 1
3066 && ISOLATE_STATUS (bidi_it, new_sidx - 1))))
3068 while (bidi_it->level_stack[bidi_it->stack_idx].level
3069 > current_level)
3071 bidi_cache_iterator_state (bidi_it, type == NEUTRAL_B, 0);
3072 type = bidi_resolve_brackets (bidi_it);
3075 if (!adjacent_to_neutrals
3076 && (bidi_get_category (type) == NEUTRAL
3077 || bidi_isolate_fmt_char (type)))
3078 adjacent_to_neutrals = true;
3079 } while (!(type == NEUTRAL_B
3080 || (type != WEAK_BN
3081 && bidi_get_category (type) != NEUTRAL
3082 && !bidi_isolate_fmt_char (type))
3083 /* This is all per level run, so stop when we
3084 reach the end of this level run. */
3085 || (bidi_it->level_stack[bidi_it->stack_idx].level
3086 != current_level)));
3088 /* Record the character we stopped at. */
3089 bidi_remember_char (&saved_it.next_for_neutral, bidi_it, 1);
3091 if ((bidi_it->level_stack[bidi_it->stack_idx].level != current_level)
3092 || type == NEUTRAL_B)
3094 /* Marched all the way to the end of this level run. We
3095 need to use the eos type, whose information is stored
3096 by bidi_set_sos_type in the prev_for_neutral
3097 member. */
3098 if (adjacent_to_neutrals)
3099 next_type = bidi_it->prev_for_neutral.type;
3100 else
3102 /* This is a BN which does not adjoin neutrals.
3103 Leave its type alone. */
3104 bidi_copy_it (bidi_it, &saved_it);
3105 return bidi_it->type;
3108 else
3110 switch (type)
3112 case STRONG_L:
3113 case STRONG_R:
3114 case STRONG_AL:
3115 /* Actually, STRONG_AL cannot happen here, because
3116 bidi_resolve_weak converts it to STRONG_R, per W3. */
3117 eassert (type != STRONG_AL);
3118 next_type = type;
3119 break;
3120 case WEAK_EN:
3121 case WEAK_AN:
3122 /* N1: "European and Arabic numbers act as if they
3123 were R in terms of their influence on NIs." */
3124 next_type = STRONG_R;
3125 break;
3126 default:
3127 emacs_abort ();
3128 break;
3131 /* Resolve the type of all the NIs found during the above loop. */
3132 type = bidi_resolve_neutral_1 (saved_it.prev_for_neutral.type,
3133 next_type, current_level);
3134 /* Update next_for_neutral with the resolved type, so we
3135 could use it for all the other NIs up to the place where
3136 we exited the loop. */
3137 saved_it.next_for_neutral.type = next_type;
3138 bidi_check_type (type);
3139 /* Update the character which caused us to enter the above loop. */
3140 saved_it.type = type;
3141 bidi_check_type (next_type);
3142 bidi_copy_it (bidi_it, &saved_it);
3145 return type;
3148 /* Given an iterator state in BIDI_IT, advance one character position
3149 in the buffer/string to the next character (in the logical order),
3150 resolve the bidi type of that next character, and return that
3151 type. */
3152 static bidi_type_t
3153 bidi_type_of_next_char (struct bidi_it *bidi_it)
3155 bidi_type_t type;
3157 /* This should always be called during a forward scan. */
3158 if (bidi_it->scan_dir != 1)
3159 emacs_abort ();
3161 type = bidi_resolve_neutral (bidi_it);
3163 return type;
3166 /* Given an iterator state BIDI_IT, advance one character position in
3167 the buffer/string to the next character (in the current scan
3168 direction), resolve the embedding and implicit levels of that next
3169 character, and return the resulting level. */
3170 static int
3171 bidi_level_of_next_char (struct bidi_it *bidi_it)
3173 bidi_type_t type = UNKNOWN_BT;
3174 int level;
3175 ptrdiff_t next_char_pos = -2;
3177 if (bidi_it->scan_dir == 1)
3179 ptrdiff_t eob
3180 = ((bidi_it->string.s || STRINGP (bidi_it->string.lstring))
3181 ? bidi_it->string.schars : ZV);
3183 /* There's no sense in trying to advance if we've already hit
3184 the end of text. */
3185 if (bidi_it->charpos >= eob)
3187 eassert (bidi_it->resolved_level >= 0);
3188 return bidi_it->resolved_level;
3192 /* Perhaps the character we want is already cached as fully resolved.
3193 If it is, the call to bidi_cache_find below will return a type
3194 other than UNKNOWN_BT. */
3195 if (bidi_cache_idx > bidi_cache_start && !bidi_it->first_elt)
3197 int bob = ((bidi_it->string.s || STRINGP (bidi_it->string.lstring))
3198 ? 0 : 1);
3200 if (bidi_it->scan_dir > 0)
3202 if (bidi_it->nchars <= 0)
3203 emacs_abort ();
3204 next_char_pos = bidi_it->charpos + bidi_it->nchars;
3206 else if (bidi_it->charpos >= bob)
3207 /* Implementation note: we allow next_char_pos to be as low as
3208 0 for buffers or -1 for strings, and that is okay because
3209 that's the "position" of the sentinel iterator state we
3210 cached at the beginning of the iteration. */
3211 next_char_pos = bidi_it->charpos - 1;
3212 if (next_char_pos >= bob - 1)
3213 type = bidi_cache_find (next_char_pos, 1, bidi_it);
3214 if (type != UNKNOWN_BT)
3216 /* We asked the cache for fully resolved states. */
3217 eassert (bidi_it->resolved_level >= 0);
3218 return bidi_it->resolved_level;
3222 if (bidi_it->scan_dir == -1)
3223 /* If we are going backwards, the iterator state is already cached
3224 from previous scans, and should be fully resolved. */
3225 emacs_abort ();
3227 if (type == UNKNOWN_BT)
3228 type = bidi_type_of_next_char (bidi_it);
3230 if (type == NEUTRAL_B)
3232 eassert (bidi_it->resolved_level >= 0);
3233 return bidi_it->resolved_level;
3236 level = bidi_it->level_stack[bidi_it->stack_idx].level;
3238 eassert ((type == STRONG_R
3239 || type == STRONG_L
3240 || type == WEAK_BN
3241 || type == WEAK_EN
3242 || type == WEAK_AN));
3243 bidi_it->type = type;
3244 bidi_check_type (bidi_it->type);
3246 /* For L1 below, we need to know, for each WS character, whether
3247 it belongs to a sequence of WS characters preceding a newline
3248 or a TAB or a paragraph separator. */
3249 if ((bidi_it->orig_type == NEUTRAL_WS
3250 || bidi_it->orig_type == WEAK_BN
3251 || bidi_isolate_fmt_char (bidi_it->orig_type))
3252 && bidi_it->next_for_ws.charpos < bidi_it->charpos
3253 /* If this character is already at base level, we don't need to
3254 reset it, so avoid the potentially costly loop below. */
3255 && level != bidi_it->level_stack[0].level)
3257 int ch;
3258 ptrdiff_t clen = bidi_it->ch_len;
3259 ptrdiff_t bpos = bidi_it->bytepos;
3260 ptrdiff_t cpos = bidi_it->charpos;
3261 ptrdiff_t disp_pos = bidi_it->disp_pos;
3262 ptrdiff_t nc = bidi_it->nchars;
3263 struct bidi_string_data bs = bidi_it->string;
3264 bidi_type_t chtype;
3265 bool fwp = bidi_it->frame_window_p;
3266 int dpp = bidi_it->disp_prop;
3268 if (bidi_it->nchars <= 0)
3269 emacs_abort ();
3270 do {
3271 ch = bidi_fetch_char (cpos += nc, bpos += clen, &disp_pos, &dpp, &bs,
3272 bidi_it->w, fwp, &clen, &nc);
3273 chtype = bidi_get_type (ch, NEUTRAL_DIR);
3274 } while (chtype == NEUTRAL_WS || chtype == WEAK_BN
3275 || bidi_isolate_fmt_char (chtype)
3276 || bidi_explicit_dir_char (ch)); /* L1/Retaining */
3277 bidi_it->next_for_ws.type = chtype;
3278 bidi_check_type (bidi_it->next_for_ws.type);
3279 bidi_it->next_for_ws.charpos = cpos;
3282 /* Update the cache, but only if this state was already cached. */
3283 bidi_cache_iterator_state (bidi_it, 1, 1);
3285 /* Resolve implicit levels. */
3286 if (bidi_it->orig_type == NEUTRAL_B /* L1 */
3287 || bidi_it->orig_type == NEUTRAL_S
3288 || bidi_it->ch == '\n' || bidi_it->ch == BIDI_EOB
3289 || ((bidi_it->orig_type == NEUTRAL_WS
3290 || bidi_it->orig_type == WEAK_BN
3291 || bidi_isolate_fmt_char (bidi_it->orig_type)
3292 || bidi_explicit_dir_char (bidi_it->ch))
3293 && (bidi_it->next_for_ws.type == NEUTRAL_B
3294 || bidi_it->next_for_ws.type == NEUTRAL_S)))
3295 level = bidi_it->level_stack[0].level;
3296 else if ((level & 1) == 0) /* I1 */
3298 if (type == STRONG_R)
3299 level++;
3300 else if (type == WEAK_EN || type == WEAK_AN)
3301 level += 2;
3303 else /* I2 */
3305 if (type == STRONG_L || type == WEAK_EN || type == WEAK_AN)
3306 level++;
3309 bidi_it->resolved_level = level;
3310 return level;
3313 /* Move to the other edge of a level given by LEVEL. If END_FLAG,
3314 we are at the end of a level, and we need to prepare to
3315 resume the scan of the lower level.
3317 If this level's other edge is cached, we simply jump to it, filling
3318 the iterator structure with the iterator state on the other edge.
3319 Otherwise, we walk the buffer or string until we come back to the
3320 same level as LEVEL.
3322 Note: we are not talking here about a ``level run'' in the UAX#9
3323 sense of the term, but rather about a ``level'' which includes
3324 all the levels higher than it. In other words, given the levels
3325 like this:
3327 11111112222222333333334443343222222111111112223322111
3328 A B C
3330 and assuming we are at point A scanning left to right, this
3331 function moves to point C, whereas the UAX#9 ``level 2 run'' ends
3332 at point B. */
3333 static void
3334 bidi_find_other_level_edge (struct bidi_it *bidi_it, int level, bool end_flag)
3336 int dir = end_flag ? -bidi_it->scan_dir : bidi_it->scan_dir;
3337 ptrdiff_t idx;
3339 /* Try the cache first. */
3340 if ((idx = bidi_cache_find_level_change (level, dir, end_flag))
3341 >= bidi_cache_start)
3342 bidi_cache_fetch_state (idx, bidi_it);
3343 else
3345 int new_level;
3347 /* If we are at end of level, its edges must be cached. */
3348 if (end_flag)
3349 emacs_abort ();
3351 if (!bidi_cache_iterator_state (bidi_it, 1, 0))
3353 /* Can't happen: if the cache needs to grow, it means we
3354 were at base embedding level, so the cache should have
3355 been either empty or already large enough to cover this
3356 character position. */
3357 emacs_abort ();
3359 do {
3360 new_level = bidi_level_of_next_char (bidi_it);
3361 /* If the cache is full, perform an emergency return by
3362 pretending that the level ended. */
3363 if (!bidi_cache_iterator_state (bidi_it, 1, 0))
3365 new_level = level - 1;
3366 /* Since the cache should only grow when we are scanning
3367 forward looking for the edge of the level that is one
3368 above the base embedding level, we can only have this
3369 contingency when LEVEL - 1 is the base embedding
3370 level. */
3371 eassert (new_level == bidi_it->level_stack[0].level);
3372 /* Plan B, for when the cache overflows: Back up to the
3373 previous character by fetching the last cached state,
3374 and force the resolved level of that character be the
3375 base embedding level. */
3376 bidi_cache_fetch_state (bidi_cache_idx - 1, bidi_it);
3377 bidi_it->resolved_level = new_level;
3378 bidi_cache_iterator_state (bidi_it, 1, 1);
3380 } while (new_level >= level);
3384 void
3385 bidi_move_to_visually_next (struct bidi_it *bidi_it)
3387 int old_level, new_level, next_level;
3388 struct bidi_it sentinel;
3390 if (bidi_it->charpos < 0 || bidi_it->bytepos < 0)
3391 emacs_abort ();
3393 if (bidi_it->scan_dir == 0)
3395 bidi_it->scan_dir = 1; /* default to logical order */
3398 /* If we just passed a newline, initialize for the next line. */
3399 if (!bidi_it->first_elt
3400 && (bidi_it->ch == '\n' || bidi_it->ch == BIDI_EOB))
3401 bidi_line_init (bidi_it);
3403 /* Prepare the sentinel iterator state, and cache it. When we bump
3404 into it, scanning backwards, we'll know that the last non-base
3405 level is exhausted. */
3406 if (bidi_cache_idx == bidi_cache_start)
3408 bidi_copy_it (&sentinel, bidi_it);
3409 if (bidi_it->first_elt)
3411 sentinel.charpos--; /* cached charpos needs to be monotonic */
3412 sentinel.bytepos--;
3413 sentinel.ch = '\n'; /* doesn't matter, but why not? */
3414 sentinel.ch_len = 1;
3415 sentinel.nchars = 1;
3417 bidi_cache_iterator_state (&sentinel, 1, 0);
3420 old_level = bidi_it->resolved_level;
3421 new_level = bidi_level_of_next_char (bidi_it);
3423 /* Reordering of resolved levels (clause L2) is implemented by
3424 jumping to the other edge of the level and flipping direction of
3425 scanning the text whenever we find a level change. */
3426 if (new_level != old_level)
3428 bool ascending = new_level > old_level;
3429 int level_to_search = ascending ? old_level + 1 : old_level;
3430 int incr = ascending ? 1 : -1;
3431 int expected_next_level = old_level + incr;
3433 /* Jump (or walk) to the other edge of this level. */
3434 bidi_find_other_level_edge (bidi_it, level_to_search, !ascending);
3435 /* Switch scan direction and peek at the next character in the
3436 new direction. */
3437 bidi_it->scan_dir = -bidi_it->scan_dir;
3439 /* The following loop handles the case where the resolved level
3440 jumps by more than one. This is typical for numbers inside a
3441 run of text with left-to-right embedding direction, but can
3442 also happen in other situations. In those cases the decision
3443 where to continue after a level change, and in what direction,
3444 is tricky. For example, given a text like below:
3446 abcdefgh
3447 11336622
3449 (where the numbers below the text show the resolved levels),
3450 the result of reordering according to UAX#9 should be this:
3452 efdcghba
3454 This is implemented by the loop below which flips direction
3455 and jumps to the other edge of the level each time it finds
3456 the new level not to be the expected one. The expected level
3457 is always one more or one less than the previous one. */
3458 next_level = bidi_peek_at_next_level (bidi_it);
3459 while (next_level != expected_next_level)
3461 /* If next_level is -1, it means we have an unresolved level
3462 in the cache, which at this point should not happen. If
3463 it does, we will infloop. */
3464 eassert (next_level >= 0);
3465 /* If next_level is not consistent with incr, we might
3466 infloop. */
3467 eassert (incr > 0
3468 ? next_level > expected_next_level
3469 : next_level < expected_next_level);
3470 expected_next_level += incr;
3471 level_to_search += incr;
3472 bidi_find_other_level_edge (bidi_it, level_to_search, !ascending);
3473 bidi_it->scan_dir = -bidi_it->scan_dir;
3474 next_level = bidi_peek_at_next_level (bidi_it);
3477 /* Finally, deliver the next character in the new direction. */
3478 next_level = bidi_level_of_next_char (bidi_it);
3481 /* Take note when we have just processed the newline that precedes
3482 the end of the paragraph. The next time we are about to be
3483 called, set_iterator_to_next will automatically reinit the
3484 paragraph direction, if needed. We do this at the newline before
3485 the paragraph separator, because the next character might not be
3486 the first character of the next paragraph, due to the bidi
3487 reordering, whereas we _must_ know the paragraph base direction
3488 _before_ we process the paragraph's text, since the base
3489 direction affects the reordering. */
3490 if (bidi_it->scan_dir == 1
3491 && (bidi_it->ch == '\n' || bidi_it->ch == BIDI_EOB))
3493 /* The paragraph direction of the entire string, once
3494 determined, is in effect for the entire string. Setting the
3495 separator limit to the end of the string prevents
3496 bidi_paragraph_init from being called automatically on this
3497 string. */
3498 if (bidi_it->string.s || STRINGP (bidi_it->string.lstring))
3499 bidi_it->separator_limit = bidi_it->string.schars;
3500 else if (bidi_it->bytepos < ZV_BYTE)
3502 ptrdiff_t sep_len
3503 = bidi_at_paragraph_end (bidi_it->charpos + bidi_it->nchars,
3504 bidi_it->bytepos + bidi_it->ch_len);
3505 if (bidi_it->nchars <= 0)
3506 emacs_abort ();
3507 if (sep_len >= 0)
3509 bidi_it->new_paragraph = 1;
3510 /* Record the buffer position of the last character of
3511 the paragraph separator. If the paragraph separator
3512 is an empty string (e.g., the regex is "^"), the
3513 newline that precedes the end of the paragraph is
3514 that last character. */
3515 if (sep_len > 0)
3516 bidi_it->separator_limit
3517 = bidi_it->charpos + bidi_it->nchars + sep_len;
3518 else
3519 bidi_it->separator_limit = bidi_it->charpos;
3524 if (bidi_it->scan_dir == 1 && bidi_cache_idx > bidi_cache_start)
3526 /* If we are at paragraph's base embedding level and beyond the
3527 last cached position, the cache's job is done and we can
3528 discard it. */
3529 if (bidi_it->resolved_level == bidi_it->level_stack[0].level
3530 && bidi_it->charpos > (bidi_cache[bidi_cache_idx - 1].charpos
3531 + bidi_cache[bidi_cache_idx - 1].nchars - 1))
3532 bidi_cache_reset ();
3533 /* Also reset the cache if it overflowed and we have just
3534 emergency-exited using Plan B. */
3535 else if (bidi_it->resolved_level == bidi_it->level_stack[0].level
3536 && bidi_cache_idx >= bidi_cache_size
3537 && bidi_it->charpos == bidi_cache[bidi_cache_idx - 1].charpos)
3538 bidi_cache_reset ();
3539 /* But as long as we are caching during forward scan, we must
3540 cache each state, or else the cache integrity will be
3541 compromised: it assumes cached states correspond to buffer
3542 positions 1:1. */
3543 else
3544 bidi_cache_iterator_state (bidi_it, 1, 0);
3547 eassert (bidi_it->resolved_level >= 0
3548 && bidi_it->resolved_level <= BIDI_MAXDEPTH + 2);
3551 /* Utility function for looking for strong directional characters
3552 whose bidi type was overridden by a directional override. */
3553 ptrdiff_t
3554 bidi_find_first_overridden (struct bidi_it *bidi_it)
3556 ptrdiff_t found_pos = ZV;
3560 /* Need to call bidi_resolve_weak, not bidi_resolve_explicit,
3561 because the directional overrides are applied by the
3562 former. */
3563 bidi_type_t type = bidi_resolve_weak (bidi_it);
3565 if ((type == STRONG_R && bidi_it->orig_type == STRONG_L)
3566 || (type == STRONG_L
3567 && (bidi_it->orig_type == STRONG_R
3568 || bidi_it->orig_type == STRONG_AL)))
3569 found_pos = bidi_it->charpos;
3570 } while (found_pos == ZV
3571 && bidi_it->charpos < ZV
3572 && bidi_it->ch != BIDI_EOB
3573 && bidi_it->ch != '\n');
3575 return found_pos;
3578 /* This is meant to be called from within the debugger, whenever you
3579 wish to examine the cache contents. */
3580 void bidi_dump_cached_states (void) EXTERNALLY_VISIBLE;
3581 void
3582 bidi_dump_cached_states (void)
3584 ptrdiff_t i;
3585 int ndigits = 1;
3587 if (bidi_cache_idx == 0)
3589 fprintf (stderr, "The cache is empty.\n");
3590 return;
3592 fprintf (stderr, "Total of %"pD"d state%s in cache:\n",
3593 bidi_cache_idx, bidi_cache_idx == 1 ? "" : "s");
3595 for (i = bidi_cache[bidi_cache_idx - 1].charpos; i > 0; i /= 10)
3596 ndigits++;
3597 fputs ("ch ", stderr);
3598 for (i = 0; i < bidi_cache_idx; i++)
3599 fprintf (stderr, "%*c", ndigits, bidi_cache[i].ch);
3600 fputs ("\n", stderr);
3601 fputs ("lvl ", stderr);
3602 for (i = 0; i < bidi_cache_idx; i++)
3603 fprintf (stderr, "%*d", ndigits, bidi_cache[i].resolved_level);
3604 fputs ("\n", stderr);
3605 fputs ("pos ", stderr);
3606 for (i = 0; i < bidi_cache_idx; i++)
3607 fprintf (stderr, "%*"pD"d", ndigits, bidi_cache[i].charpos);
3608 fputs ("\n", stderr);