Using boost::locale/ICU boundary analysis with Chinese
Asked Answered
R

1

6

Using the sample code from the boost::locale documentation, I can't get the following to correctly tokenize Chinese text:

using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text="中華人民共和國";
ssegment_index map(word,text.begin(),text.end(),gen("zh_CN.UTF-8")); 
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
    std::cout <<"\""<< * it << "\", ";
std::cout << std::endl;

This splits 中華人民共和國 into seven distinct characters 中/華/人/民/共/和/國, rather than 中華/人民/共和國 as expected. The documentation of ICU, which Boost is compiled against, claims that Chinese should work out of the box and use a dictionary-based tokenizer to split phrases correctly. Using the example Japanese test phrase "生きるか死ぬか、それが問題だ。" in the code above with the "ja_JP.UTF-8" locale does work, but this tokenization does not depend on a dictionary, only on kanji/kana boundaries.

I've tried the same code directly in ICU as suggested here, but the results are the same.

UnicodeString text = "中華人民共和國";
UErrorCode status = U_ZERO_ERROR;
BreakIterator* bi = BreakIterator::createWordInstance(Locale::getChinese(), status);
bi->setText(text);
int32_t p = bi->first();
while (p != BreakIterator::DONE) {
    printf("Boundary at position %d\n", p);
    p = bi->next();
}
delete bi;

Any idea what I'm doing wrong?

Restharrow answered 13/3, 2015 at 17:25 Comment(0)
S
1

You most likely use an ICU version prior to 5.0, which is the first release supporting dictionary based Chinese word segmentation.

Also, note that boost by default uses ICU as the local backend, hence the mirroring results.

Sliest answered 27/1, 2017 at 15:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.