I will be getting document written in Chinese language for which I have to tokenize and keep it in database table. I was trying the CJKBigramFilter of Lucene but all it does is unite the 2 character together for which the meaning is different then what is there in document. Suppose this is a line in the file "Hello My name is Pradeep" which in chinese tradition is "你好我的名字是普拉迪普". When I tokenize it, it gets converted to the 2 letter words below. 你好 - Hello 名字 - Name 好我 - Well I 字是 - Word is 我的 - My 拉迪 - Radi 是普 - Is the S & P 普拉 - Pula 的名 - In the name of 迪普 - Dipp. All I want is it to convert to same English translation. I am using Lucene for this...if you have any other favourable opne source please direct me to that. Thanks in Advance
How to tokenize Chinese language document
This post might be helpful #7627412 –
Gibbons
Well it is totally different in the sense that Stanford have their own setup for tokenizing chinese character which I cannot use as I am using Lucene. I jsut wanted to know that in Lucene how can i tokenize Chinese character as such describe above in my problem statment. –
Cubit
Though may be too late, you might try U-Tokenizer which is an online API, it is available for free. See http://tokenizer.tool.uniwits.com/
Can you please add a bit more to your answer and explain how one could use the site. –
Silicon
Please read tokenizer.tool.uniwits.com/qx-cmd-api.html for a guide. If you have detailed questions, I will try to answer specifically. –
Natheless
it's dead, gone –
Mccune
Aaand, this is exactly why link-only answers are bad practice. The TLD is just a "build your own homepage" landing page. It would have been nice to know more about the "tokenizer.tool" in case it's worth chasing after it elsewhere... –
Paternity
If you want a full blown NLP parser, checkout out http://nlp.stanford.edu
If you want a simple, one-off solution for Chinese, here is what I used.
First load a Chinese dictionary into a Trie (Prefix-Tree) to reduce memory footprint. I then walked through the sentences a character at a time observing wither substrings existed in the dictionary. If they did, I would parse it as a token. The algorithm could likely be improved greatly, but this has served me well. :)
public class ChineseWordTokenizer implements WordTokenizer {
private static final int MAX_MISSES = 6;
// example implementation: http://www.kennycason.com/posts/2012-03-20-java-trie-prefix-tree.html
private StringTrie library;
private boolean loadTraditional;
public ChineseWordTokenizer() {
this(true);
}
public ChineseWordTokenizer(boolean loadTraditional) {
loadLibrary();
this.loadTraditional = loadTraditional;
}
@Override
public String[] parse(String sentence) {
final List<String> words = new ArrayList<>();
String word;
for (int i = 0; i < sentence.length(); i++) {
int len = 1;
boolean loop = false;
int misses = 0;
int lastCorrectLen = 1;
boolean somethingFound = false;
do {
word = sentence.substring(i, i + len);
if (library.contains(word)) {
somethingFound = true;
lastCorrectLen = len;
loop = true;
} else {
misses++;
loop = misses < MAX_MISSES;
}
len++;
if(i + len > sentence.length()) {;
loop = false;
}
} while (loop);
if(somethingFound) {
word = sentence.substring(i, i + lastCorrectLen);
if (StringUtils.isNotBlank(word)) {
words.add(word);
i += lastCorrectLen - 1;
}
}
}
return words.toArray(new String[words.size()]);
}
private void loadLibrary() {
library = new StringTrie();
library.loadFile("classify/nlp/dict/chinese_simple.list");
if(loadTraditional) {
library.loadFile("classify/nlp/dict/chinese_traditional.list");
}
}
}
Here is a Unit Test
public class TestChineseWordTokenizer {
@Test
public void test() {
long time = System.currentTimeMillis();
WordTokenizer tokenizer = new ChineseWordTokenizer();
System.out.println("load time: " + (System.currentTimeMillis() - time) + " ms");
String[] words = tokenizer.tokenize("弹道导弹");
print(words);
assertEquals(1, words.length);
words = tokenizer.tokenize("美国人的文化.dog");
print(words);
assertEquals(3, words.length);
words = tokenizer.tokenize("我是美国人");
print(words);
assertEquals(3, words.length);
words = tokenizer.tokenize("政府依照法律行使执法权,如果超出法律赋予的权限范围,就是“滥用职权”;如果没有完全行使执法权,就是“不作为”。两者都是政府的错误。");
print(words);
words = tokenizer.tokenize("国家都有自己的政府。政府是税收的主体,可以实现福利的合理利用。");
print(words);
}
private void print(String[] words) {
System.out.print("[ ");
for(String word : words) {
System.out.print(word + " ");
}
System.out.println("]");
}
}
And Here are the results
Load Complete: 102135 Entries
load time: 236 ms
[ 弹道导弹 ]
[ 美国人 的 文化 ]
[ 我 是 美国人 ]
[ 政府 依照 法律 行使 执法 权 如果 超出 法律 赋予 的 权限 范围 就是 滥用职权 如果 没有 完全 行使 执法 权 就是 不 作为 两者 都 是 政府 的 错误 ]
[ 国家 都 有 自己 的 政府 政府 是 税收 的 主体 可以 实现 福利 的 合理 利用 ]
Hi Kenny! I'd like to give your solution a try and compare its performance with my current tokenizer. Could you point me to where to search for a usable dictionary? I failed to find it in your github repository as well as in stanford nlp data... Thanks a lot in advance! –
Eikon
Try this resource, its a chinese tokenizer using CC-CEDICT
© 2022 - 2024 — McMap. All rights reserved.