This question is related to text editing. Say you have a piece of text in normalization form NFC, and a cursor that points to an extended grapheme cluster boundary within this text. You want to insert another piece of text at the cursor location, and make sure that the resulting text is also in NFC. You also want to move the cursor on the first grapheme boundary that immediately follows the inserted text.

Now, since concatenating two strings that are both in NFC doesn't necessarily produce a string that is also in NFC, you might have to emend the text around the insertion point. For instance, if you have a string that contains 4 code points like so:

[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E
[2] COMBINING MACRON BELOW
--- Cursor location
[3] LATIN SMALL LETTER A

And you want to insert a 2-codepoints string {COMBINING ACUTE ACCENT, COMBINING DOT ABOVE} at the cursor location. Then the result will be:

[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E WITH ACUTE
[2] COMBINING MACRON BELOW
[3] COMBINING DOT ABOVE
--- Cursor location
[4] LATIN SMALL LETTER A

Now my question is: how do you figure out at which offset you should place the cursor after inserting the string, in such a way that the cursor ends up after the inserted string and also on a grapheme boundary? In this particular case, the text that follows the cursor location cannot possibly interact, during normalization, with what precedes. So the following sample Python code would work:

import unicodedata

def insert(text, cursor_pos, text_to_insert):
    new_text = text[:cursor_pos] + text_to_insert
    new_text = unicodedata.normalize("NFC", new_text)
    new_cursor_pos = len(new_text)
    new_text += text[cursor_pos:]
    if new_cursor_pos == 0:
        # grapheme_break_after is a function that
        # returns the offset of the first grapheme
        # boundary after the given index
        new_cursor_pos = grapheme_break_after(new_text, 0)
    return new_text, new_cursor_pos

But does this approach necessarily work? To be more explicit: is it necessarily the case that the text that follows a grapheme boundary doesn't interact with what precedes it during normalization, such that NFC(text[:grapheme_break]) + NFC(text[grapheme_break:]) == NFC(text) is always true?

Update

@nwellnhof's excellent analysis below motivated me to investigate things further. So I followed the "When in doubt, use brute force" mantra and wrote a small script that parses grapheme break properties and examines each code point that can appear at the beginning of a grapheme, to test whether it can possibly interact with preceding code points during normalization. Here's the script:

from urllib.request import urlopen
import icu, unicodedata

URL = "http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt"

break_props = {}

with urlopen(URL) as f:
    for line in f:
        line = line.decode()
        p = line.find("#")
        if p >= 0:
            line = line[:p]
        line = line.strip()
        if not line:
            continue
        fields = [x.strip() for x in line.split(";")]
        codes = [int(x, 16) for x in fields[0].split("..")]
        if len(codes) == 2:
            start, end = codes
        else:
            assert(len(codes) == 1)
            start, end = codes[0], codes[0]
        category = fields[1]
        break_props.setdefault(category, []).extend(range(start, end + 1))

# The only code points that can't appear at the beginning of a grapheme boundary
# are those that appear in the following categories. See the regexps in
# UAX #29 Tables 1b and 1c.
to_ignore = set(c for name in ("Extend", "ZWJ", "SpacingMark") for c in break_props[name])

nfc = icu.Normalizer2.getNFCInstance()
for c in range(0x10FFFF + 1):
    if c in to_ignore:
        continue
    if not nfc.hasBoundaryBefore(chr(c)):
        print("U+%04X %s" % (c, unicodedata.name(chr(c))))

Looking at the output, it appears that there are about 40 code points that are grapheme starters but still compose with preceding code points in NFC. Basically, they are non-precomposed Hangul syllables of type V (U+1161..U+1175) and T (U+11A8..U+11C2). Things makes sense when you examine the regular expressions in UAX #29, Table 1c together with what the standard says about Jamo composition (section 3.12, p. 147 of the version 13 of the standard). The gist of it is that Hangul sequences of the form {L, V} can compose to a Hangul syllable of type LV, and similarly sequences of the form {LV, T} can compose to a syllable of type LVT.

To sum up, and assuming I'm not mistaken, the above Python code could be corrected as follows:

import unicodedata
import icu # pip3 install icu

def insert(text, cursor_pos, text_to_insert):
    new_text = text[:cursor_pos] + text_to_insert
    new_text = unicodedata.normalize("NFC", new_text)
    new_cursor_pos = len(new_text)
    new_text += text[cursor_pos:]
    new_text = unicodedata.normalize("NFC", new_text)
    break_iter = icu.BreakIterator.createCharacterInstance(icu.Locale())
    break_iter.setText(new_text)
    if new_cursor_pos == 0:
        # Move the cursor to the first grapheme boundary > 0.
        new_cursor_pos = breakIter.nextBoundary()
    elif new_cursor_pos > len(new_text):
        new_cursor_pos = len(new_text)
    elif not break_iter.isBoundary(new_cursor_pos):
        # isBoundary() moves the cursor on the first boundary >= the given
        # position.
        new_cursor_pos = break_iter.current()
    return new_text, new_cursor_pos

The (possibly) pointless test new_cursor_pos > len(new_text) is there to catch the case len(NFC(x)) > len(NFC(x + y)). I'm not sure whether this can actually happen with the current Unicode database (more tests would be needed to prove it), but it is theoretically quite possible. If, say, you have a set a three code points A, B and C and two precomposed forms A+B and A+B+C (but not A+C), then you could very well have NFC({A, C} + {B}) = {A+B+C}.

If this case doesn't occur in practice (which is very likely, especially with "real" texts), then the above Python code will necessarily locate the first grapheme boundary after the end of the inserted text. Otherwise, it will merely locate some grapheme boundary after the inserted text, but not necessarily the first one. I don't yet see how it could be possible to improve the second case (assuming it isn't merely theoretical), so I think I'll leave my investigation at that for now.

As mentioned in my comment, the actual boundaries can differ slightly. But AFAICS, there should be no meaningful interaction. UAX #29 states:

6.1 Normalization

[...] the grapheme cluster boundary specification has the following features:

There is never a break within a sequence of nonspacing marks.

There is never a break between a base character and subsequent nonspacing marks.

This only mentions nonspacing marks. But with extended grapheme clusters (as opposed to legacy ones), I'm pretty sure these statements also apply to "non-starter" spacing marks^[1]. This would cover all normalization non-starters (which must be either nonspacing (Mn) or spacing (Mc) marks). So there's never an extended grapheme cluster boundary before a non-starter^[2] which should give you the guarantee you need.

Note that it's possible to have multiple runs of starters and non-starters ("normalization boundaries") within a single grapheme cluster, for example with U+034F COMBINING GRAPHEME JOINER.

[1] Some spacing marks are excluded, but these should all be starters.

[2] Except at the start of text.

Update

Recommended topics

Hot tags