Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Korean)?
Asked Answered
H

0

7

I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese Korean I verify my sqlite supports fts5

sqlite> pragma compile_options;
BUG_COMPATIBLE_20160819
COMPILER=clang-9.0.0
DEFAULT_CACHE_SIZE=2000
DEFAULT_CKPTFULLFSYNC
DEFAULT_JOURNAL_SIZE_LIMIT=32768
DEFAULT_PAGE_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5

But to my surprise it can't find any CJK word at all. Why is that ?

sqlite> CREATE VIRTUAL TABLE ft5_test USING fts5(content, tokenize = 'porter unicode61 remove_diacritics 1');
sqlite> INSERT INTO ft5_test values('为什么不支持中文 fts5 does not seem to work for chinese');
sqlite> select * from ft5_test where ft5_test = '中文';
sqlite>
sqlite> select * from ft5_test where ft5_test = 'Chinese';
为什么不支持中文 fts5 does not seem to work for chinese

------------- update ----------

I spend quite some time in building an icu version. I shared my experience here https://mcmap.net/q/1481673/-building-sqlite-icu-with-xcode

From what I have learned using icu version is the only way to support CJK and fts5 has not support icu tokenizer.

I leave my question here in case others may have new ideas about the problem.

Hedvig answered 20/9, 2018 at 9:59 Comment(2)
You'll have better luck asking on the sqlite mailing list where the people who wrote the thing hang out, but if I'm reading the tcl script that generates the lookup table used by the unicode61 parser right, it only uses Lu and Ll category codepoints and I think a lot of your text is Lo.Laevorotation
Thanks I asked it here mail-archive.com/[email protected]/…Hedvig

© 2022 - 2024 — McMap. All rights reserved.