Using the sample code from the boost::locale documentation, I can't get the following to correctly tokenize Chinese text:
using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text="中華人民共和國";
ssegment_index map(word,text.begin(),text.end(),gen("zh_CN.UTF-8"));
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
std::cout <<"\""<< * it << "\", ";
std::cout << std::endl;
This splits 中華人民共和國 into seven distinct characters 中/華/人/民/共/和/國, rather than 中華/人民/共和國 as expected. The documentation of ICU, which Boost is compiled against, claims that Chinese should work out of the box and use a dictionary-based tokenizer to split phrases correctly. Using the example Japanese test phrase "生きるか死ぬか、それが問題だ。" in the code above with the "ja_JP.UTF-8" locale does work, but this tokenization does not depend on a dictionary, only on kanji/kana boundaries.
I've tried the same code directly in ICU as suggested here, but the results are the same.
UnicodeString text = "中華人民共和國";
UErrorCode status = U_ZERO_ERROR;
BreakIterator* bi = BreakIterator::createWordInstance(Locale::getChinese(), status);
bi->setText(text);
int32_t p = bi->first();
while (p != BreakIterator::DONE) {
printf("Boundary at position %d\n", p);
p = bi->next();
}
delete bi;
Any idea what I'm doing wrong?