A Viable Solution for Word Splitting Khmer?

Asked 1/2, 2011 at 10:48 Answered 2/2, 2011 at 21:16

Solved python nlp word-boundary text-segmentation southeast-asian-languages

I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few solutions out there, but they are far from adequate (here and here), and those projects have fallen by the wayside.

Here is a sample line of Khmer that needs to be split (they can be longer than this):

ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវ ហើយដែលអ្នកមិនអាចរកការទាំងអស់នោះដោយសារការប្រព្រឹត្តរបស់អ្នកឡើយ។

The goal of creating a viable solution that splits Khmer words is twofold: it will encourage those who used Khmer legacy (non-Unicode) fonts to convert over to Unicode (which has many benefits), and it will enable legacy Khmer fonts to be imported into Unicode to be used with a spelling checker quickly (rather than manually going through and splitting words which, with a large document, can take a very long time).

I don't need 100% accuracy, but speed is important (especially since the line that needs to be split into Khmer words can be quite long). I am open to suggestions, but currently I have a large corpus of Khmer words that are correctly split (with a non-breaking space), and I have created a word probability dictionary file (frequency.csv) to use as a dictionary for the word splitter.

I found this python code here that uses the Viterbi algorithm and it supposedly runs fast.

import re
from itertools import groupby

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary.get(word, 0) / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = dict((w, len(list(ws)))
                  for w, ws in groupby(sorted(words(open('big.txt').read()))))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

I also tried using the source java code from the author of this page: Text segmentation: dictionary-based word splitting but it ran too slow to be of any use (because my word probability dictionary has over 100k terms...).

And here is another option in python from Detect most likely words from text without spaces / combined words:

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

I am a newbee when it comes to python and I am really new to all real programming (outside of websites), so please bear with me. Does anyone have any options that they feel would work well?

Monafo answered 1/2, 2011 at 10:48 Comment(1)

Additional implementations in Python and Ruby at #11448359 – Arly 27/12, 2019 at 19:40

The ICU library (that has Python and Java bindings) has a DictionaryBasedBreakIterator class that can be used for this.

Austriahungary answered 1/2, 2011 at 11:25 Comment(9)

@Lennart Thanks - Yes, I've seen the ICU library DictionaryBasedBreakIterator class - but because I am so limited in programming experience I was unable to do anything with it. I see there are some examples here: source.icu-project.org/repos/icu/icu/trunk/source/samples/break but do you know of any Python and Java examples that would get me started (Sometimes I can edit a script if enough is done)? Or are there some examples there that I am missing... – Monafo 1/2, 2011 at 11:51

@Nathan: Yeah, the ICU Python bindings don't have any real docs, which is a shame. No, I don't know of any examples, sorry. If you have a dictionary I could try and see if I can figure something out. – Austriahungary 1/2, 2011 at 11:54

Here is the frequency dictionary I have so far. It's not huge, but it's a start: sbbic.org/Khmer-Corpus-Work.zip (I also included a sample txt file in Khmer - all in UTF-8) Any way you would be willing to help would be awesome. Thanks for taking the time to look into it. – Monafo 1/2, 2011 at 12:31

Thai and Khmer are closely related and since the DictionaryBasedBreakIterator class was made for Thai perhaps there is a way to take from the Thai source and make a few changes for Khmer? – Monafo 1/2, 2011 at 12:36

@Nathan: I'm sure what language it is doesn't change the way you use it so, yes, of course. But I can't find any single example of how to use that class in any language, and the Python API doesn't seem to correspond with the C API. That may be a bug. I think you should contact the author(s) of PyICU. – Austriahungary 1/2, 2011 at 13:23

Ok, thanks Lennart - I will look into this more. Is ICU known for putting out the best solutions for an issue (for instance their Thai word breaker)? – Monafo 1/2, 2011 at 13:28

@Monafo we try to provide the best solutions. I think the users of ICU are the best way to judge it. I saw your post to the icu-support list, also ICU's break data comes from CLDR, so you should probably file a ticket there cldr.unicode.org - also the python bindings are (as far as I understand) meant to be used together with ICU's own documentation. – Philippine 2/2, 2011 at 18:48

@Steven Thanks for your response, I will look into CLDR - I realize word breaking is a very complex problem, but hopefully there will eventually be a solution that is practical; and I will do my best to contribute. – Monafo 3/2, 2011 at 9:21

We have the beginnings of a word splitter for Khmer with ICU - bugs.icu-project.org/trac/ticket/8329 Hopefully it will get added soon! – Monafo 8/4, 2011 at 15:19

The python with example filesaveas appears to recurse through the entire input string (for i in xrange(1, len(text) + 1)), stuffing the best results into the cache along the way; at each potential word, it then starts looking at the next word (which will in turn look at the word after that, and so on), and if that second word doesn't look very good, it won't save that particular one. It feels like O(N!) runtime, where N is the length of the input string.

Super clever, but probably horrible for anything but simple tasks. What's the longest Khmer word you've got? I'm hoping < 20 characters.

Maybe if you feed input into that example 20 characters at a time you can keep the runtime down to something approaching reasonable. Feed in the first 20 chars, suck off the first word, and then feed in the remaining input. If you re-use the cache it might do something silly like store partial words along the way.

On a completely different tack, how many Khmer words are formed by concatenating two or more legal Khmer words? (similar to 'penknife' or 'basketball') If not too many, it might make sense to create a set of dictionaries, segregated by length of word, mapping from word to probability of use.

Say, the longest Khmer word is 14 chars long; feed in 14 characters of input into the len14 dictionary, store the probability. Feed in 13 characters into len13, store the probability. Feed in 12 characters ... all the way down to 1 into len1. Then pick the interpretation with the highest probability, save the word, strip off that many characters, and try again.

So it won't fail badly for inputs like "I" vs "Image", maybe longer inputs should have automatically inflated probabilities?

Thanks for the fun question ;) I didn't know of any languages like this, pretty cool.

Lyonnaise answered 1/2, 2011 at 11:42 Comment(1)

Thanks for your input - yes there are quite a few concatenated words in Khmer, but the nice thing is we can ignore them for the most part (because both are legal, and there aren't any visible spaces). I might have bitten off more than I can chew here, but it is good to know the python example wouldn't work well with a bunch of characters - though processing 20 characters at a time is a good idea... – Monafo 1/2, 2011 at 11:53

I think this is a good idea, as it is.

I suggest you, when you have some experience with it, you add some rules, that can be very specific, for example, depending on word before, depending on word after, depending on surrounding words, depending on a sequence of words before the current word, just to enumerate the most frequent ones. You can find a set of rules in gposttl.sf.net project, which is a pos tagging project, in the file data/contextualrulefile.

Rules should be used AFTER the statistics evaluation is finished, they make some fine tuning, and can improve accuracy remarkably.

Trivet answered 2/2, 2011 at 21:16 Comment(0)

Recommended topics

Hot tags