Tokenizing texts in both Chinese and English improperly splits English words into letters

About

Asked 29/8, 2017 at 13:59 Answered 7/2, 2020 at 11:18

python-3.x nlp nltk stanford-nlp tokenize

When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code:

from nltk.tokenize.stanford_segmenter import StanfordSegmenter
segmenter = StanfordSegmenter()
segmenter.default_config('zh')
print(segmenter.segment('哈佛大学的Melissa Dell'))

The output will be 哈佛大学的 M e l i s s a D e l l. How do I modify this behavior?

Knit answered 29/8, 2017 at 13:59 Comment(0)

You could try jieba.

import jieba
jieba.lcut('哈佛大学的Melissa Dell')
['哈佛大学', '的', 'Melissa', ' ', 'Dell']

Scatter answered 7/2, 2020 at 11:18 Comment(0)

I can't speak for nltk , but Stanford CoreNLP will not exhibit this behavior if run on this sentence.

If you issue this command on your example you get proper tokenization:

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file example.txt -outputFormat text

You might want to look into using stanza if you want to access Stanford CoreNLP via Python.

More info here: https://github.com/stanfordnlp/stanza

Murther answered 31/8, 2017 at 0:32 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags