mirror of
https://git.savannah.gnu.org/git/emacs.git
synced 2025-01-30 19:53:09 +00:00
src/bidi.c: Describe the design of reordering engine in the commentary.
This commit is contained in:
parent
c8e7f832ea
commit
aa5ccb01a5
190
src/bidi.c
190
src/bidi.c
@ -22,9 +22,16 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */
|
||||
A sequential implementation of the Unicode Bidirectional algorithm,
|
||||
(UBA) as per UAX#9, a part of the Unicode Standard.
|
||||
|
||||
Unlike the reference and most other implementations, this one is
|
||||
designed to be called once for every character in the buffer or
|
||||
string.
|
||||
Unlike the Reference Implementation and most other implementations,
|
||||
this one is designed to be called once for every character in the
|
||||
buffer or string. That way, we can leave intact the design of the
|
||||
Emacs display engine, whereby an iterator object is used to
|
||||
traverse buffer or string text character by character, and generate
|
||||
the necessary data for displaying each character in 'struct glyph'
|
||||
objects. (See xdisp.c for the details of that iteration.) The
|
||||
functions on this file replace the original linear iteration in the
|
||||
logical order of the text with a non-linear iteration in the visual
|
||||
order, i.e. in the order characters should be shown on display.
|
||||
|
||||
The main entry point is bidi_move_to_visually_next. Each time it
|
||||
is called, it finds the next character in the visual order, and
|
||||
@ -52,7 +59,182 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */
|
||||
A note about references to UAX#9 rules: if the reference says
|
||||
something like "X9/Retaining", it means that you need to refer to
|
||||
rule X9 and to its modifications described in the "Implementation
|
||||
Notes" section of UAX#9, under "Retaining Format Codes". */
|
||||
Notes" section of UAX#9, under "Retaining Format Codes".
|
||||
|
||||
Here's the overview of the design of the reordering engine
|
||||
implemented by this file.
|
||||
|
||||
Basic implementation structure
|
||||
------------------------------
|
||||
|
||||
The sequential processing steps described by UAX#9 are implemented
|
||||
as recursive levels of processing, all of which examine the next
|
||||
character in the logical order. This hierarchy of processing looks
|
||||
as follows, from the innermost (deepest) to the outermost level,
|
||||
omitting some subroutines used by each level:
|
||||
|
||||
bidi_fetch_char -- fetch next character
|
||||
bidi_resolve_explicit -- resolve explicit levels and directions
|
||||
bidi_resolve_weak -- resolve weak types
|
||||
bidi_resolve_neutral -- resolve neutral types
|
||||
bidi_level_of_next_char -- resolve implicit levels
|
||||
|
||||
Each level calls the level below it, and works on the result
|
||||
returned by the lower level, including all of its sub-levels.
|
||||
|
||||
Unlike all the levels below it, bidi_level_of_next_char can return
|
||||
the information about either the next or previous character in the
|
||||
logical order, depending on the current direction of scanning the
|
||||
buffer or string. For the next character, it calls all the levels
|
||||
below it; for the previous character, it uses the cache, described
|
||||
below.
|
||||
|
||||
Thus, the result of calling bidi_level_of_next_char is the resolved
|
||||
level of the next or the previous character in the logical order.
|
||||
Based on this information, the function bidi_move_to_visually_next
|
||||
finds the next character in the visual order and updates the
|
||||
direction in which the buffer is scanned, either forward or
|
||||
backward, to find the next character to be displayed. (Text is
|
||||
scanned backwards when it needs to be reversed for display, i.e. if
|
||||
the visual order is the inverse of the logical order.) This
|
||||
implements the last, reordering steps of the UBA, by successively
|
||||
calling bidi_level_of_next_char until the character of the required
|
||||
embedding level is found; the scan direction is dynamically updated
|
||||
as a side effect. See the commentary before the 'while' loop in
|
||||
bidi_move_to_visually_next, for the details.
|
||||
|
||||
Fetching characters
|
||||
-------------------
|
||||
|
||||
In a nutshell, fetching the next character boils down to calling
|
||||
STRING_CHAR_AND_LENGTH, passing it the address of a buffer or
|
||||
string position. See bidi_fetch_char. However, if the next
|
||||
character is "covered" by a display property of some kind,
|
||||
bidi_fetch_char returns the u+FFFC "object replacement character"
|
||||
that represents the entire run of text covered by the display
|
||||
property. (The ch_len and nchars members of 'struct bidi_it'
|
||||
reflect the length in bytes and characters of that text.) This is
|
||||
so we reorder text on both sides of the display property as
|
||||
appropriate for an image or embedded string. Similarly, text
|
||||
covered by a display spec of the form '(space ...)', is replaced
|
||||
with the u+2029 paragraph separator character, so such display
|
||||
specs produce the same effect as a TAB under UBA. Both these
|
||||
special characters are not actually displayed -- the display
|
||||
property is displayed instead -- but just used to compute the
|
||||
embedding level of the surrounding text so as to produce the
|
||||
required effect.
|
||||
|
||||
Bidi iterator states
|
||||
--------------------
|
||||
|
||||
The UBA is highly context dependent in some of its parts,
|
||||
i.e. results of processing a character can generally depend on
|
||||
characters very far away. The UAX#9 description of the UBA
|
||||
prescribes a stateful processing of each character, whereby the
|
||||
results of this processing depend on various state variables, such
|
||||
as the current embedding level, level stack, and directional
|
||||
override status. In addition, the UAX#9 description includes many
|
||||
passages like this (from rule W2 in this case):
|
||||
|
||||
Search backward from each instance of a European number until the
|
||||
first strong type (R, L, AL, or sos) is found. If an AL is found,
|
||||
change the type of the European number to Arabic number.
|
||||
|
||||
To support this, we use a bidi iterator object, 'struct bidi_it',
|
||||
which is a sub-structure of 'struct it' used by xdisp.c (see
|
||||
dispextern.h for the definition of both of these structures). The
|
||||
bidi iterator holds the entire state of the iteration required by
|
||||
the UBA, and is updated as the text is traversed. In particular,
|
||||
the embedding level of the current character being resolved is
|
||||
recorded in the iterator state. To avoid costly searches backward
|
||||
in support of rules like W2 above, the necessary character types
|
||||
are also recorded in the iterator state as they are found during
|
||||
the forward scan, and then used when such rules need to be applied.
|
||||
(Forward scans cannot be avoided in this way; they need to be
|
||||
performed at least once, and the results recorded in the iterator
|
||||
state, to be reused until the forward scan oversteps the recorded
|
||||
position.)
|
||||
|
||||
In this manner, the iterator state acts as a mini-cache of
|
||||
contextual information required for resolving the level of the
|
||||
current character by various UBA rules.
|
||||
|
||||
Caching of bidi iterator states
|
||||
-------------------------------
|
||||
|
||||
As described above, the reordering engine uses the information
|
||||
recorded in the bidi iterator state in order to resolve the
|
||||
embedding level of the current character. When the reordering
|
||||
engine needs to process the next character in the logical order, it
|
||||
fetches it and applies to it all the UBA levels, updating the
|
||||
iterator state as it goes. But when the buffer or string is
|
||||
scanned backwards, i.e. in the reverse order of buffer/string
|
||||
positions, the scanned characters were already processed during the
|
||||
preceding forward scan (see bidi_find_other_level_edge). To avoid
|
||||
costly re-processing of characters that were already processed
|
||||
during the forward scan, the iterator states computed while
|
||||
scanning forward are cached.
|
||||
|
||||
The cache is just a linear array of 'struct bidi_it' objects, which
|
||||
is dynamically allocated and reallocated as needed, since the size
|
||||
of the cache depends on the text being processed. We only need the
|
||||
cache while processing embedded levels higher than the base
|
||||
paragraph embedding level, because these higher levels require
|
||||
changes in scan direction. Therefore, as soon as we are back to
|
||||
the base embedding level, we can free the cache; see the calls to
|
||||
bidi_cache_reset and bidi_cache_shrink, for the conditions to do
|
||||
this.
|
||||
|
||||
The cache maintains the index of the next unused cache slot -- this
|
||||
is where the next iterator state will be cached. The function
|
||||
bidi_cache_iterator_state saves an instance of the state in the
|
||||
cache and increments the unused slot index. The companion function
|
||||
bidi_cache_find looks up a cached state that corresponds to a given
|
||||
buffer/string position. All of the cached states must correspond
|
||||
1:1 to the buffer or string region whose processing they reflect;
|
||||
bidi.c will abort if it finds cache slots that violate this 1:1
|
||||
correspondence.
|
||||
|
||||
When the parent iterator 'struct it' is pushed (see push_it in
|
||||
xdisp.c) to pause the current iteration and start iterating over a
|
||||
different object (e.g., a 'display' string that covers some buffer
|
||||
text), the bidi iterator cache needs to be "pushed" as well, so
|
||||
that a new empty cache could be used while iterating over the new
|
||||
object. Later, when the new object is exhausted, and xdisp.c calls
|
||||
pop_it, we need to "pop" the bidi cache as well and return to the
|
||||
original cache. See bidi_push_it and bidi_pop_it for how this is
|
||||
done.
|
||||
|
||||
Some functions of the display engine save copies of 'struct it' in
|
||||
local variables, and restore them later. For examples, see
|
||||
pos_visible_p and move_it_in_display_line_to in xdisp.c, and
|
||||
window_scroll_pixel_based in window.c. When this happens, we need
|
||||
to save and restore the bidi cache as well, because conceptually
|
||||
the cache is part of the 'struct it' state, and needs to be in
|
||||
perfect sync with the portion of the buffer/string that is being
|
||||
processed. This saving and restoring of the cache state is handled
|
||||
by bidi_shelve_cache and bidi_unshelve_cache, and the helper macros
|
||||
SAVE_IT and RESTORE_IT defined on xdisp.c.
|
||||
|
||||
Note that, because reordering is implemented below the level in
|
||||
xdisp.c that breaks glyphs into screen lines, we are violating
|
||||
paragraph 3.4 of UAX#9. which mandates that line breaking shall be
|
||||
done before reordering each screen line separately. However,
|
||||
following UAX#9 to the letter in this matter goes against the basic
|
||||
design of the Emacs display engine, and so we choose here this
|
||||
minor deviation from the UBA letter in preference to redesign of
|
||||
the display engine. The effect of this is only seen in continued
|
||||
lines that are broken into screen lines in the middle of a run
|
||||
whose direction is opposite to the paragraph's base direction.
|
||||
|
||||
Important design and implementation note: when the code needs to
|
||||
scan far ahead, be sure to avoid such scans as much as possible
|
||||
when the buffer/string doesn't contain any RTL characters. Users
|
||||
of left-to-right scripts will never forgive you if you introduce
|
||||
some slow-down due to bidi in situations that don't involve any
|
||||
bidirectional text. See the large comment near the beginning of
|
||||
bidi_resolve_neutral, for one situation where such shortcut was
|
||||
necessary. */
|
||||
|
||||
#include <config.h>
|
||||
#include <stdio.h>
|
||||
|
Loading…
Reference in New Issue
Block a user