Combining Devanagari characters
Asked Answered
P

7

29

I have something like

a = "बिक्रम मेरो नाम हो"

I want to achieve something like

a[0] = बि
a[1] = क्र
a[3] = म

but as म takes 4 bytes while बि takes 8 bytes I am not able to get to that straight. So what could be done to achieve that? In Python.

Proton answered 24/7, 2011 at 6:26 Comment(2)
I never really played with devnagri, which i am definitely going try now :P, but knowing devnagri script, I have a feeling that that the difference in 'ma' and 'be' as you mentioned might be because, in devnagri "ma" is one character , but "be" = "ba" + "e" (ba mai e ki maatra! :P is what i mean). If the difference in representation is because of that, then you should be able to separate the 'matras' or one and half letters like 'kra' by doing some simple bit operations to check and saperate, then out them in a List like data structure. Do post a solution if you find one. I am curious!Roundy
#something like this may help # gist.github.com/950405Sublease
H
29

The algorithm for splitting text into grapheme clusters is given in Unicode Annex 29, section 3.1. I'm not going to implement the full algorithm for you here, but I'll show you roughly how to handle the case of Devanagari, and then you can read the Annex for yourself and see what else you need to implement.

The unicodedata module contains the information you need to detect the grapheme clusters.

>>> import unicodedata
>>> a = "बिक्रम मेरो नाम हो"
>>> [unicodedata.name(c) for c in a]
['DEVANAGARI LETTER BA', 'DEVANAGARI VOWEL SIGN I', 'DEVANAGARI LETTER KA', 
 'DEVANAGARI SIGN VIRAMA', 'DEVANAGARI LETTER RA', 'DEVANAGARI LETTER MA',
 'SPACE', 'DEVANAGARI LETTER MA', 'DEVANAGARI VOWEL SIGN E',
 'DEVANAGARI LETTER RA', 'DEVANAGARI VOWEL SIGN O', 'SPACE',
 'DEVANAGARI LETTER NA', 'DEVANAGARI VOWEL SIGN AA', 'DEVANAGARI LETTER MA',
 'SPACE', 'DEVANAGARI LETTER HA', 'DEVANAGARI VOWEL SIGN O']

In Devanagari, each grapheme cluster consists of an initial letter, optional pairs of virama (vowel killer) and letter, and an optional vowel sign. In regular expression notation that would be LETTER (VIRAMA LETTER)* VOWEL?. You can tell which is which by looking up the Unicode category for each code point:

>>> [unicodedata.category(c) for c in a]
['Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Lo', 'Zs', 'Lo', 'Mn', 'Lo', 'Mc', 'Zs',
 'Lo', 'Mc', 'Lo', 'Zs', 'Lo', 'Mc']

Letters are category Lo (Letter, Other), vowel signs are category Mc (Mark, Spacing Combining), virama is category Mn (Mark, Nonspacing) and spaces are category Zs (Separator, Space).

So here's a rough approach to split out the grapheme clusters:

def splitclusters(s):
    """Generate the grapheme clusters for the string s. (Not the full
    Unicode text segmentation algorithm, but probably good enough for
    Devanagari.)

    """
    virama = u'\N{DEVANAGARI SIGN VIRAMA}'
    cluster = u''
    last = None
    for c in s:
        cat = unicodedata.category(c)[0]
        if cat == 'M' or cat == 'L' and last == virama:
            cluster += c
        else:
            if cluster:
                yield cluster
            cluster = c
        last = c
    if cluster:
        yield cluster

>>> list(splitclusters(a))
['बि', 'क्र', 'म', ' ', 'मे', 'रो', ' ', 'ना', 'म', ' ', 'हो']
Hypognathous answered 24/7, 2011 at 10:12 Comment(1)
Hello Gareth, Your are genius!!! Your answers works with various VIRAM signs. LIke Nukta (dot below LETTER), Anuswar (dot above LETTER), Ardha-Chandra-Bindi (a dot with moonlit above LETTER) and Kra (Leftward oblique stroke on character, as shown in example: क्रPositron
H
15

So, you want to achieve something like this

a[0] = बि a[1] = क्र a[3] = म

My advice is to ditch the idea that string indexing corresponds to the characters you see on the screen. Devanagari, as well as several other scripts, do not play well with programmers who grew up with Latin characters. I suggest reading the Unicode standard chapter 9 (available here).

It looks like what you are trying to do is break a string into grapheme clusters. String indexing by itself will not let you do this. Hangul is another script which plays poorly with string indexing, although with combining characters, even something as familiar as Spanish will cause problems.

You will need an external library such as ICU to achieve this (unless you have lots of free time). ICU has Python bindings.

>>> a = u"बिक्रम मेरो नाम हो"
>>> import icu
    # Note: This next line took a lot of guesswork.  The C, C++, and Java
    # interfaces have better documentation.
>>> b = icu.BreakIterator.createCharacterInstance(icu.Locale())
>>> b.setText(a)
>>> i = 0
>>> for j in b:
...     s = a[i:j]
...     print '|', s, len(s)
...     i = j
... 
| बि 2
| क् 2
| र 1
| म 1
|   1
| मे 2
| रो 2
|   1
| ना 2
| म 1
|   1
| हो 2

Note how some of these "characters" (grapheme clusters) have length 2, and some have length 1. This is why string indexing is problematic: if I want to get grapheme cluster #69450 from a text file, then I have to linearly scan through the entire file and count. So your options are:

  • Build an index (kind of crazy...)
  • Just realize that you can't break on every character boundary. The break iterator object is capable of going both forwards AND backwards, so if you need to extract the first 140 characters of a string, then you look at index 140 and iterate backwards to the previous grapheme cluster break, that way you don't end up with funny text. (Better yet, you can use a word break iterator for the appropriate locale.) The benefit of using this level of abstraction (character iterators and the like) is that it no longer matters which encoding you use: you can use UTF-8, UTF-16, UTF-32 and it all just works. Well, mostly works.
Hairtail answered 24/7, 2011 at 7:2 Comment(3)
Is that right? You've output क् (ka + virama) and र (ra) as separate clusters, but according to the Unicode Text Segmentation algorithm these should form the single cluster क्र (kra).Hypognathous
@Gareth: I suspect that is a "tailored grapheme cluster" -- which means that it will only be separated in that manner in certain locales. Since I supply the default locale, no "tailoring" will be done.Hairtail
@Gareth: On further research, it appears that not only are such rules not implemented by ICU, but they do not appear in the Unicode locale database. The tailored grapheme cluster examples in the Unicode text segmentation algorithm page appear to be non-normative, as I cannot find rules for the other two examples either.Hairtail
C
4

You can achieve this with a simple regex for any engine that supports \X

Demo

Unfortunately, Python's re does not support the \X grapheme match.

Fortunately, the proposed replacement, regex, does support \X:

>>> a = "बिक्रम मेरो नाम हो"
>>> regex.findall(r'\X', a)
['बि', 'क्', 'र', 'म', ' ', 'मे', 'रो', ' ', 'ना', 'म', ' ', 'हो']
Casiano answered 7/5, 2015 at 17:50 Comment(0)
I
2

It is an old question, but there are gaps in the discussion, and the solutions are dated, and changes in icu4c make the details of some answers invalid.

Unicode defines what it refers to as extended grapheme clusters (the default) and tailored grapheme clusters. For many scripts, there is no differences in the results between the two approaches.

For the Devanagari script, the differences are significant, and lie at the heart of the prior answers and various comments on those answers.

Take for instance, the endonym for the Hindi language: हिन्दी. There would be three extended grapheme clusters: हि | न् | दी. But a tailored solution for Hindi and the Devanagari script would give two clusters: हि | न्दी.

There are three modules that are currently kept up to date with the Unicode standard: regex, pyuegc, and pyicu.

First with regex, which has been used in prior answers:

term = 'हिन्दी'

import regex
print(regex.findall(r'\X', term))
# ['हि', 'न्', 'दी']

The regex module using the \X metacharacter returns 3 extended grapheme clusters.

For pyuegc:

term = 'हिन्दी'
from pyuegc import EGC
print(EGC(term))
# ['हि', 'न्दी']

This gives two tailored grapheme clusters. While pyicu:

term = 'हिन्दी'

import icu
def get_boundaries(text, brkiter):
    brkiter.setText(text)
    boundaries = [*brkiter]
    boundaries.insert(0, 0)
    return boundaries
def get_graphemes(text, locale=icu.Locale.getRoot()):
    bi = icu.BreakIterator.createCharacterInstance(locale)
    boundary_indices = get_boundaries(text, bi)
    return [text[boundary_indices[i]:boundary_indices[i+1]] for i in range(len(boundary_indices)-1)]

get_graphemes(term, icu.Locale('hi'))
# ['हि', 'न्दी']

Likewise, pyicu gives two tailored grapheme clusters.The key difference is how consonant clusters treated. There is no support for consonant clusters in extended grapheme clusters. A virama does not extend a cluster, so tailoring is necessary.

The pyicu solution is a lower level, and requires more code, but is more powerful. It is possible to further customise the default break iteration for graphemes.

Inhaler answered 14/4 at 14:4 Comment(0)
L
1

Indic and non Latin scripts like Hangul do not generally follow the idea of matching string indices to code points. It's generally a pain working with Indic scripts. Most characters are two bytes with some rare ones extending into three. With Dravidian, it's no defined order. See the Unicode specification for more details.

That said,check here for some ideas about unicode and python with C++.

Finally,as said by Dietrich, you might want to check out ICU too. It has bindings available for C/C++ and java via icu4c and icu4j respectively. There's some learning curve involved, so I suggest you set aside some loads of time for it. :)

Legislation answered 24/7, 2011 at 7:36 Comment(0)
P
1

The Grammar

Let's cover the grammar very quickly: The Devanagari Block. As a developer, there are two character classes you'll want to concern yourself with:

  • Sign: This is a character that affects a previously-occurring character. Example, this character: . The light-colored circle indicates the location of the center of the character it is to be placed upon.
  • Letter / Vowel / Other: This is a character that may be affected by signs. Example, this character: .

Combination result of and : क्. But combinations can extend, so क् and षति will actually become क्षति (in this case, we right-rotate the first character by 90 degrees, modify some of the stylish elements, and attach it at the left side of the second character).

My answer here is not to solve the situation of these infinite (and tremendously beautiful) combinations, but simply clusters of singular letters and/or clusters of singular letters with their affecting, sign characters. If we are thinking "what are the characters of this Devanagari string?", then this is the right way to go, otherwise any combination of letters would form a unique character of a unique length, and then most of the concepts and algorithms associated with letter-systems would fail.

So, for instance, a symbol word would be...

(letter) (letter) (sign) (sign) (letter) (sign)

In this case, you'll want the result...

[
    0=>(letter),
    1=>(letter) (sign) (sign),
    2=>(letter) (sign),
]

The Code

The logic then isn't too bad, just make a foreach loop that goes in reverse.

I understand this is JavaScript code below, but the same principles will apply. Set the sign-types...

function getEndWordGroupings() {return {'2304':true,'2305':true,'2306':true,'2307':true,'2362':true,'2363':true,'2364':true,'2365':true,'2366':true,'2367':true,'2368':true,'2369':true,'2370':true,'2371':true,'2372':true,'2373':true,'2374':true,'2375':true,'2376':true,'2377':true,'2378':true,'2379':true,'2380':true,'2381':true,'2382':true,'2383':true,'2385':true,'2386':true,'2389':true,'2390':true,'2391':true,'2402':true,'2403':true,'2416':true,'2417':true,};}

And convert string to chars...

function stringToChars(args) {
    var word = args.word;
    var chars = [];
    
    var endings = getEndWordGroupings();
    
    var incluster = false;
    var cluster = '';
    
    var whitespace = new RegExp("\\s+");
    
    for(var i = word.length - 1; i >= 0; i--) {
        var character = word.charAt(i);
        var charactercode = word.charCodeAt(i);
        
        if(incluster) {
            if(whitespace.test(character)) {
                incluster = false;
                chars.push(cluster);
                cluster = '';
            } else if(endings[charactercode]) {
                chars.push(cluster);
                cluster = character;
            } else {
                incluster = false;
                cluster = character + cluster;
                chars.push(cluster);
                cluster = '';
            }
        } else if(endings[charactercode]) {
            incluster = true;
            cluster = character;
        } else if(whitespace.test(character)) {
            incluster = false;
            chars.push(cluster);
            cluster = '';
        } else {
            chars.push(character);
        }
    }
    
    if(cluster.length > 0) {
        chars.push(cluster);
    }
    
    return chars.reverse();
}

console.log(stringToChars({'word':'क्षऀति'}));</script>

The Results

Output:

["क्", "षऀ", "ति"]

If I had used plain parsing, the output would have been

["क", "्", "ष", "त", "ि"]

Hint: See the two signs up above with a light circle in them? That light circle indicates the location of the character that the sign affects. Looking back at the converted translation, it's very easy to see how the letters were combined into new characters. Neat!

Pediatrician answered 19/7, 2020 at 18:43 Comment(0)
G
0

There's a pure-Python library called uniseg which provides a number of utilities including a grapheme cluster iterator which provides the behaviour you described:

>>> a = u"बिक्रम मेरो नाम हो"
>>> from uniseg.graphemecluster import grapheme_clusters
>>> for i in grapheme_clusters(a): print(i)
... 
बि
क्
र
म

मे
रो

ना
म

हो

It claims to implement the full Unicode text segmentation algorithm described in http://www.unicode.org/reports/tr29/tr29-21.html.

Gastro answered 13/7, 2016 at 18:47 Comment(2)
Output is not correct. 'क् र' should be 'क्र'. Looks like there is an issue with the library.Tinker
@Tinker That's certainly possible – I would report that to the author: bitbucket.org/emptypage/uniseg-pythonGastro

© 2022 - 2024 — McMap. All rights reserved.