Splitting Thai text by characters
Asked Answered
H

3

8

Not by word boundaries, that is solvable.

Example:

#!/usr/bin/env python3  
text = 'เมื่อแรกเริ่ม'  
for char in text:  
    print(char)  

This produces:






Which obviously is not the desired output. Any ideas?

A portable representation of text is:

text = u'\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e41\u0e23\u0e01\u0e40\u0e23\u0e34\u0e48\u0e21'
Hawse answered 7/5, 2015 at 14:26 Comment(6)
While the "obviously wrong" nature of the output is apparent to you, it will not be to most of us. What makes it wrong?Lucifer
It seems perfectly fine for me. What is the desired output?Dermatitis
Thai text is dificult for latin oriented users. Some characters with marks it splits on several fields (3), like 3 utf8 characters, like for example 3.th character in textHawse
Take a look at this: #13826831Dermatitis
I can't reproduce desired output since stackoverflow copy/paste is not representing well those characters (it acts similar to python split)Hawse
I would find it helpful if you could: 1) provide what you would like the desired output to be; 2) provide an ascii string of unicode character identifiers for your sample ( u'\u0e40' , etc)Paulapauldron
B
11

tl;dr: Use \X regular expression to extract user-perceived characters:

>>> import regex # $ pip install regex
>>> regex.findall(u'\\X', u'เมื่อแรกเริ่ม')
['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']

While I do not know Thai, I know a little French.

Consider the letter è. Let s and s2 equal è in the Python shell:

>>> s
'è'
>>> s2
'è'

Same letter? To a French speaker visually, oui. To a computer, no:

>>> s==s2
False

You can create the same letter either using the actual code point for è or by taking the letter e and adding a combining code point that adds that accent character. They have different encodings:

>>> s.encode('utf-8')
b'\xc3\xa8'
>>> s2.encode('utf-8')
b'e\xcc\x80'

And differnet lengths:

>>> len(s)
1
>>> len(s2)
2

But visually both encodings result in the 'letter' è. This is called a grapheme, or what the end user considers one character.

You can demonstrate the same looping behavior you are seeing:

>>> [c for c in s]
['è']
>>> [c for c in s2]
['e', '̀']

Your string has several combining characters in it. Hence a 9 grapheme character Thai string to your eyes becomes a 13 character string to Python.

The solution in French is to normalize the string based on Unicode equivalence:

>>> from unicodedata import normalize
>>> normalize('NFC', s2) == s
True

That does not work for many non Latin languages though. An easy way to deal with unicode strings that may be multiple code points composing a single grapheme is with a regex engine that correctly deals with this by supporting \X. Unfortunately Python's included re module doesn't yet.

The proposed replacement, regex, does support \X though:

>>> import regex
>>> text = 'เมื่อแรกเริ่ม'
>>> regex.findall(r'\X', text)
['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']
>>> len(_)
9
Businesslike answered 7/5, 2015 at 15:41 Comment(10)
tnx for your efort (upvote is from me), there might be something in this direction, however utf-8 and thai are not best friends @BusinesslikeHawse
I had also looked at normalize, and it did not worked for the Thai characters. But regex seems to be a really nice tool :-)Quotha
cool solution with newest regex @Businesslike I can't make two accepted answersHawse
Using regex with \X is more robust. Serge Ballesta's solution is only combining the characters for console output -- not in a logical fashion.Businesslike
hmm @Businesslike is pattern r'\X' for matching single characters in all languages (not only thai?). If yes, than solution is robust!Hawse
Yes -- any language. It takes a regular letter and combines it with all following combination marks.Businesslike
you win @Businesslike . Congratulation :)Hawse
s2 is not a grapheme: it is a grapheme cluster the so-called "user-perceived character". For clarity, you could use explicit Unicode code points numbers such as è (U+00e8) or (U+0065 U+0300).Adrieneadrienne
note: \X regex handles eXtended grapheme clusters such as กำ (U+0E01 U+0E33). It doesn't work for Tailored grapheme clusters such as Slovak ch digraph (U+0063 U+0068).Adrieneadrienne
I've added summary. Feel free to rollback.Adrieneadrienne
Q
3

I cannot exactly reproduce, but here is a slight modified version of you script, with the output on IDLE 3.4 on a Windows7 64 system :

>>> for char in text:
    print(char, hex(ord(char)), unicodedata.name(char),'-',
          unicodedata.category(char), '-', unicodedata.combining(char), '-',
          unicodedata.east_asian_width(char))


เ 0xe40 THAI CHARACTER SARA E - Lo - 0 - N
ม 0xe21 THAI CHARACTER MO MA - Lo - 0 - N
ื 0xe37 THAI CHARACTER SARA UEE - Mn - 0 - N
่ 0xe48 THAI CHARACTER MAI EK - Mn - 107 - N
อ 0xe2d THAI CHARACTER O ANG - Lo - 0 - N
แ 0xe41 THAI CHARACTER SARA AE - Lo - 0 - N
ร 0xe23 THAI CHARACTER RO RUA - Lo - 0 - N
ก 0xe01 THAI CHARACTER KO KAI - Lo - 0 - N
เ 0xe40 THAI CHARACTER SARA E - Lo - 0 - N
ร 0xe23 THAI CHARACTER RO RUA - Lo - 0 - N
ิ 0xe34 THAI CHARACTER SARA I - Mn - 0 - N
่ 0xe48 THAI CHARACTER MAI EK - Mn - 107 - N
ม 0xe21 THAI CHARACTER MO MA - Lo - 0 - N
>>>

I really do not know what those characters can be - my Thai is very poor :-) - but it shows that :

  • text is acknowledged to be Thai ...
  • output is coherent with len(text) (13)
  • category and combining are different when characters are combined

If it is expected output, it proves that your problem is not in Python but more on the console where you display it. You should try to redirect output to a file, and then open the file in an unicode editor supporting Thai characters.

If expected output is only 9 characters, that is if you do not want to decompose composed characters, and provided there are no other composing rules that should be considered, you could use something like :

def Thaidump(t):
    old = None
    for i in t:
        if unicodedata.category(i) == 'Mn':
            if old is not None:
                old = old + i
        else:
            if old is not None:
                print(old)
            old = i
    print(old)

That way :

>>> Thaidump(text)
เ
มื่
อ
แ
ร
ก
เ
ริ่
ม
>>> 
Quotha answered 7/5, 2015 at 14:59 Comment(4)
Tnx @serge-ballesta , i'm reading carefully your answer. Problem is that len(text) should be 9, not 13. It seems strategy using utf-8 better to change. ReadingHawse
@Hawse : this comes beyond my Thai knowledge, that's why I added unicodedata.category and combining. By mixing that all, it is possible to display 9 characters only by combining decomposed characters, provided there are no other special rules to considereQuotha
let me check @serge-ballesta your newest function in python3Hawse
also to mention, here on stack splitted characters are not represented well, but while executing python script in terminal are ok. But i'll have to check are they single utf-8 characers or for those problematic their len is > 1Hawse
P
2

For clarification of the previous answers, the issue you have is that the missing characters are "combining characters" - vowels and diacritics that must be combined with other characters in order to be displayed properly. There is no standard way to display these characters by themselves, although the most common convention is to use a dotted circle as a null consonant as shown in the answer by Serge Ballesta.

The question is then, for your application are each vowel and diacritic considered a separate character or do you wish to separate by "print cell" as shown in Serge's answer ?

By the way, in normal usage the lead vowels SARA E and SARA AE should not be displayed without a following consonant except in the process of typing a longer word.

For more information, see the WTT 2.0 standard published by the Thai API Consortium (TAPIC) which defines how characters can be combined, displayed and how to cope with errors.

Playboy answered 20/7, 2017 at 13:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.