tl;dr: Use \X
regular expression to extract user-perceived characters:
>>> import regex # $ pip install regex
>>> regex.findall(u'\\X', u'เมื่อแรกเริ่ม')
['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']
While I do not know Thai, I know a little French.
Consider the letter è
. Let s
and s2
equal è
in the Python shell:
>>> s
'è'
>>> s2
'è'
Same letter? To a French speaker visually, oui. To a computer, no:
>>> s==s2
False
You can create the same letter either using the actual code point for è
or by taking the letter e
and adding a combining code point that adds that accent character. They have different encodings:
>>> s.encode('utf-8')
b'\xc3\xa8'
>>> s2.encode('utf-8')
b'e\xcc\x80'
And differnet lengths:
>>> len(s)
1
>>> len(s2)
2
But visually both encodings result in the 'letter' è
. This is called a grapheme, or what the end user considers one character.
You can demonstrate the same looping behavior you are seeing:
>>> [c for c in s]
['è']
>>> [c for c in s2]
['e', '̀']
Your string has several combining characters in it. Hence a 9 grapheme character Thai string to your eyes becomes a 13 character string to Python.
The solution in French is to normalize the string based on Unicode equivalence:
>>> from unicodedata import normalize
>>> normalize('NFC', s2) == s
True
That does not work for many non Latin languages though. An easy way to deal with unicode strings that may be multiple code points composing a single grapheme is with a regex engine that correctly deals with this by supporting \X
. Unfortunately Python's included re
module doesn't yet.
The proposed replacement, regex, does support \X
though:
>>> import regex
>>> text = 'เมื่อแรกเริ่ม'
>>> regex.findall(r'\X', text)
['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']
>>> len(_)
9