When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code:
from nltk.tokenize.stanford_segmenter import StanfordSegmenter
segmenter = StanfordSegmenter()
segmenter.default_config('zh')
print(segmenter.segment('哈佛大学的Melissa Dell'))
The output will be 哈佛大学 的 M e l i s s a D e l l
. How do I modify this behavior?