Tokenizing texts in both Chinese and English improperly splits English words into letters
Asked Answered
K

2

7

When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code:

from nltk.tokenize.stanford_segmenter import StanfordSegmenter
segmenter = StanfordSegmenter()
segmenter.default_config('zh')
print(segmenter.segment('哈佛大学的Melissa Dell'))

The output will be 哈佛大学 的 M e l i s s a D e l l. How do I modify this behavior?

Knit answered 29/8, 2017 at 13:59 Comment(0)
S
4

You could try jieba.

import jieba
jieba.lcut('哈佛大学的Melissa Dell')
['哈佛大学', '的', 'Melissa', ' ', 'Dell']
Scatter answered 7/2, 2020 at 11:18 Comment(0)
M
0

I can't speak for nltk , but Stanford CoreNLP will not exhibit this behavior if run on this sentence.

If you issue this command on your example you get proper tokenization:

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file example.txt -outputFormat text

You might want to look into using stanza if you want to access Stanford CoreNLP via Python.

More info here: https://github.com/stanfordnlp/stanza

Murther answered 31/8, 2017 at 0:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.