I used BreakIterator.getWordInstance to split a Chinese text into words. Here is my example
import java.text.BreakIterator;
import java.util.Locale;
public class Sample {
public static void main(String[] args) {
String stringToExamine = "I like to eat apples. 我喜欢吃苹果。";
//print each word in order
BreakIterator boundary = BreakIterator.getWordInstance(new Locale("zh", "CN"));
boundary.setText(stringToExamine);
printEachForward(boundary, stringToExamine);
}
public static void printEachForward(BreakIterator boundary, String source) {
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
System.out.println(start + ": " + source.substring(start, end));
}
}
}
My example text is taken from https://mcmap.net/q/830714/-how-does-breakiterator-work-in-android
The output that I get is
0: I
1:
2: like
6:
7: to
9:
10: eat
13:
14: apples
20: .
21:
22: 我喜欢吃苹果
28: 。
Whereas, the expected output is
0 I
1
2 like
6
7 to
9
10 eat
13
14 apples
20 .
21
22 我
23 喜欢
25 吃
26 苹果
28 。
I even tried pure Chinese text, but the words are broken on whitespace and punctuation characters.
I am programming for a server, so the jar file size is not a big concern. I am trying to find the number of words that is different in a given content compared to a sample content using Least Common Subsequence (but on words).
What am I doing wrong?