How to split a Thai sentence, which does not use spaces, into words?
Asked Answered
S

5

16

How to split word from Thai sentence? English we can split word by space.

Example: I go to school, split = ['I', 'go', 'to' ,'school'] Split by looking only space.

But Thai language had no space, so I don't know how to do. Example spit ฉันจะไปโรงเรียน to from txt file to ['ฉัน' 'จะ' 'ไป' 'โรง' 'เรียน'] = output another txt file.

Are there any programs or libraries that identify Thai word boundaries and split?

Sepsis answered 11/12, 2012 at 18:34 Comment(5)
Here's one person's solution: issues.apache.org/jira/browse/LUCENE-503Cauca
If there is no token, then you can't split a string like that.Deni
This will not be a simple tool built into a language like C or Python - will require a library built with detailed knowledge of the Thai language (which I imagine exists.)Cleisthenes
@Deni - IMO should post as answer (the library link).Cleisthenes
@Cleisthenes OK, you've twisted my arm. ;-)Cauca
C
8

In 2006, someone contributed code to the Apache Lucene project to make this work.

Their approach (written in Java) was to use the BreakIterator class, calling getWordInstance() to get a dictionary-based word iterator for the Thai language. Note also that there is a stated dependency on the ICU4J project. I have pasted the relevant section of their code below:

  private BreakIterator breaker = null;
  private Token thaiToken = null;

  public ThaiWordFilter(TokenStream input) {
    super(input);
    breaker = BreakIterator.getWordInstance(new Locale("th"));
  }

  public Token next() throws IOException {
    if (thaiToken != null) {
      String text = thaiToken.termText();
      int start = breaker.current();
      int end = breaker.next();
      if (end != BreakIterator.DONE) {
        return new Token(text.substring(start, end), 
            thaiToken.startOffset()+start,
            thaiToken.startOffset()+end, thaiToken.type());
      }
      thaiToken = null;
    }
    Token tk = input.next();
    if (tk == null) {
      return null;
    }
    String text = tk.termText();
    if (UnicodeBlock.of(text.charAt(0)) != UnicodeBlock.THAI) {
      return new Token(text.toLowerCase(), 
                       tk.startOffset(), 
                       tk.endOffset(), 
                       tk.type());
    }
    thaiToken = tk;
    breaker.setText(text);
    int end = breaker.next();
    if (end != BreakIterator.DONE) {
      return new Token(text.substring(0, end), 
          thaiToken.startOffset(), 
          thaiToken.startOffset()+end,
          thaiToken.type());
    }
    return null;
  }
Cauca answered 11/12, 2012 at 18:49 Comment(1)
FYI, if you read the history on the Lucene tracker, you'll see that the dependency on ICU4J had been removed from the final code. It just uses the standard Java libraries. However, if you were going to do this in a non-Java platform, you could use the ICU project.Cauca
A
4

There are multiple ways to do 'Thai words tokenization'. One way is to use dictionary-based or pattern-based. In this case, the algorithm will go through characters and if it appears in the dictionary, we'll count as a word.

Also, there are also recent libraries to tokenize Thai text where it trained Deep learning to tokenize Thai word on BEST corpus including rkcosmos/deepcut, pucktada/cutkum and more.

Example usage of deepcut:

import deepcut
deepcut.tokenize('ฉันจะไปโรงเรียน')
# output as ['ฉัน', 'จะ', 'ไป', 'โรง', 'เรียน']
Avoidance answered 5/7, 2017 at 22:31 Comment(0)
E
1

The simplest segmenter for Chinese and Japanese is to use a greedy dictionary based scheme. This should work just as well for Thai---get a dictionary of Thai words, and at the current character, match the longest string from that character that exists in the dictionary. This gets you a pretty decent segmenter, at least in Chinese and Japanese.

Emerald answered 12/12, 2012 at 12:2 Comment(1)
In Chinese and Japanese, it's very clear where the syllable breaks are. Thai is a bit more complex in that regard.Interact
A
1

Here's how to split Thai text into words using Kotlin and ICU4J. ICU4J is a better choice than Lucene's version (last updated 6/2011), because ICU4J is constantly updated and has additional related tools. Search for icu4j at mvnrepository.com to see them all.

 fun splitIntoWords(s: String): List<String> {
    val wordBreaker = BreakIterator.getWordInstance(Locale("th"));
    wordBreaker.setText(s)
    var startPos = wordBreaker.first()
    var endPos = wordBreaker.next()

    val words = mutableListOf<String>()

    while(endPos != BreakIterator.DONE) {
        words.add(s.substring(startPos,endPos))
        startPos = endPos
        endPos = wordBreaker.next()
    }

    return words.toMutableList()
}
Afterbrain answered 15/7, 2020 at 15:59 Comment(0)
L
1

In a way, you don't need spaces between Thai words because a good 70% of words have their own built-in "demarcators".

This is how I teach foreigners to read Thai.

It does involve some heuristics, though. Not as straightforward as a simple space as in other languages.

For a start, all the "left hand vowels" (like เ แ โ ไใ) signify the start of a word.

The ห letter nearly always starts a word too.

There are several letters that always end a word (like ะ).

And there are combinations for letters that signify the word ends at the next letter, e.g. เบิx เxา เxย บัx บืx vบ็x/บ็อx (where บ is any consonant letter or "cluster" and x is any consonant, and v is a vowel).

Unfortunately, when you've used up all the built-in demarcators, you have to do a dictionary search and invoke the heuristics - because some combinations of letters could be read more than one way - and you have to know from the context which is the correct word. If you have a decent vocabulary then it's usually obvious.

Leprechaun answered 8/4, 2023 at 11:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.