How does tokenization and pattern matching work in Chinese.?
Asked Answered
K

2

10

This question involves computing as well as knowledge of Chinese. I have chinese queries and I have a separate list of phrases in Chinese I need to be able to find which of these queries have any of these phrases.

In english, it is a very simple task. I don't understand Chinese at all, its semantics, grammar rules etc. and if somebody in this forum who also understands Chinese can help me with some basic understanding and how pattern matching is done for Chinese.

I have a basic perception that in Chinese one unit (without any space in between) can actually mean more than one word(Is this correct?). So are there any rules on how more than one word combine among themselves to stand out as a unit. It is confusing because there are spaces in Chinese writing yet even a unit without space has more than one word in it.

Any links which explain Chinese from computational point of view, pattern matching etc would be very useful..

Khat answered 2/10, 2011 at 14:21 Comment(6)
didn't understand : spaces are used only with punctuation??Khat
One Chinese character is not equivalent to one English word; many words are made up of two characters, like "guo1ji4", "international". In addition, one Chinese character may mean something different depending on surrounding characters (contextually-dependent).Carnotite
The comment I replied to is no longer there.Carnotite
@p2pnode You don't often find spaces in Chinese text except after punctuation, I think is what that comment means to say. It's unfortunate it was deleted. In any case, I'd probably aim at research papers on the topic, because it's... complicated, but a native Chinese speaker will have more useful input :)Carnotite
Take a look at this question : Is there any good open-source or freely available Chinese segmentation algorithm available?Deutero
The best way is probably a dictionary unfortunately because words like President Clinton are ke lin dun zong tong where ke, lin and dun are 3 characters that can form other words but in this case mean clinton. Anything else will not be as accurate as may be needed.Team
A
10

I have a basic perception that in Chinese one unit (without any space in between) can actually mean more than one word(Is this correct?).

In Chinese spaces are rarely used, eg:

递归(英语:Recursion),又譯為遞迴,在数学与计算机科学中,是指在函数的定义中使用函数自身的方法。递归一词还较常用于描述以自相似方法重复事物的过程。例如,当两面镜子相互之间近似平行时,镜中嵌套的图像是以无限递归的形式出现的。

You'll notice what appear to be spaces actually are just Chinese punctuation characters, which just have more padding than usual.

So are there any rules on how more than one word combine among themselves to stand out as a unit. It is confusing because there are spaces in Chinese writing yet even a unit without space has more than one word in it.

Think of it this way: one Chinese character is very, very roughly similar to one English word. Often times two or more characters need to be combined to form one word, and each separate character may mean something completely different depending on context.

To meaningfully tokenize Chinese text you'd have to segment words taking that in consideration.

See Chinese Natural Language Processing and Speech Processing, from the Stanford NLP group.

Amphipod answered 2/10, 2011 at 14:40 Comment(7)
Perhaps from before you edited to include the stuff I said in my comments; it was wronger before the edits.Carnotite
Also, if you are aware..what is the basic grammar rule like? Subject verb Object??Khat
@DaveNewton Does that imply it's still wrong now? If so, what is wrong with it?Amphipod
@p2pnode rci.rutgers.edu/~rsimmon/chingram For non-programming-related questions your best bet is the web, not SO.Carnotite
@NullUserExceptionఠ_ఠ Seems reasonable now, IMO, after adding what was already said.Carnotite
@p2pnode For additional resources, you can Google "Chinese NLP"Amphipod
@DaveNewton Hmmm, good. This is particularly embarrassing because I am a native Chinese speaker, but my Chinese skills have been steadily deteriorating over the years. My English, Portuguese, and Spanish are probably better than my Chinese at this point.Amphipod
S
1

Ken Lunde's book CJKV Information Processing is probably worth a look. The basic word order is subject - verb - object, but see also "Topic prominence" in http://en.wikipedia.org/wiki/Chinese_grammar

Shogun answered 2/10, 2011 at 15:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.