BreakIterator not working correctly with Chinese text
Asked Answered
B

1

4

I used BreakIterator.getWordInstance to split a Chinese text into words. Here is my example

import java.text.BreakIterator;
import java.util.Locale;

public class Sample {
    public static void main(String[] args) {
        String stringToExamine = "I like to eat apples. 我喜欢吃苹果。";

        //print each word in order
        BreakIterator boundary = BreakIterator.getWordInstance(new Locale("zh", "CN"));
        boundary.setText(stringToExamine);

        printEachForward(boundary, stringToExamine);
    }

    public static void printEachForward(BreakIterator boundary, String source) {
        int start = boundary.first();
        for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
            System.out.println(start + ": " + source.substring(start, end));
        }
    }
}

My example text is taken from https://mcmap.net/q/830714/-how-does-breakiterator-work-in-android

The output that I get is

0: I
1:  
2: like
6:  
7: to
9:  
10: eat
13:  
14: apples
20: .
21:  
22: 我喜欢吃苹果
28: 。

Whereas, the expected output is

0 I
1  
2 like
6  
7 to
9  
10 eat
13  
14 apples
20 .
21  
22 我
23 喜欢
25 吃
26 苹果
28 。

I even tried pure Chinese text, but the words are broken on whitespace and punctuation characters.

I am programming for a server, so the jar file size is not a big concern. I am trying to find the number of words that is different in a given content compared to a sample content using Least Common Subsequence (but on words).

What am I doing wrong?

Behavior answered 12/6, 2017 at 20:4 Comment(1)
@Suragch I am programming for a server, so the jar file size is not a big concern. I am trying to find the number of words that is different in a given content compared to a sample content using Least Common Subsequence (but on words).Behavior
G
7

The standard BreakIterator does not support detection of "word" boundaries within unbroken strings of CJK ideographs. There is a bug report on this subject, but it was closed in 2006 as "Won't Fix".

Instead, you'll need to use the ICU implementation. If you're developing on Android, you already have this as android.icu.text.BreakIterator. Otherwise, you'll need to download the ICU4J library from http://site.icu-project.org/download, which has it as com.ibm.icu.text.BreakIterator.

Gretagretal answered 12/6, 2017 at 21:14 Comment(5)
I wonder how it worked for the person who answered https://mcmap.net/q/830714/-how-does-breakiterator-work-in-android . I was also seeing other sites that claim that BreakIterator works with Chinese text.Behavior
@Behavior It appears that answer’s code is running in Android, which has a different BreakIterator implementation.Zircon
Updated my answer, there's an alternate implementation that works.Gretagretal
@srgsanky, Although I was programming for Android, I was specifically using the Java BreakIterator and not the ICU version because the ICU version was not supported until Android API version 24 and I needed to support earlier devices. It is possible to include the ICU jar but it is rather large so I decided against it. (But see this.) This may be a viable option for you, though.Meg
There probably is something about the Android version of the Java BreakIterator as opposed to the ICU BreakIterator that made it work for me.Meg

© 2022 - 2024 — McMap. All rights reserved.