How to define special "untokenizable" words for nltk.word_tokenize

About

Asked 10/8, 2017 at 16:3 Answered 10/7, 2018 at 12:4

I'm using nltk.word_tokenize for tokenizing some sentences which contain programming languages, frameworks, etc., which get incorrectly tokenized.

For example:

>>> tokenize.word_tokenize("I work with C#.")
['I', 'work', 'with', 'C', '#', '.']

Is there a way to enter a list of "exceptions" like this to the tokenizer? I already have compiled a list of all the things (languages, etc.) that I don't want to split.

Juju answered 10/8, 2017 at 16:3 Comment(2)

possible duplicate of Modify python nltk.word_tokenize to exclude “#” as delimiter – Matron 10/8, 2017 at 17:24

The difference with that question is that OP is not asking to keep the # always, but instead just in C# and possibly in hundreds of other particular words such as F# and similar technical names. – Plait 11/8, 2017 at 8:6

The Multi Word Expression Tokenizer should be what you need.

You add the list of exceptions as tuples and pass it the already tokenized sentences:

tokenizer = nltk.tokenize.MWETokenizer()
tokenizer.add_mwe(('C', '#'))
tokenizer.add_mwe(('F', '#'))
tokenizer.tokenize(['I', 'work', 'with', 'C', '#', '.'])
['I', 'work', 'with', 'C_#', '.']
tokenizer.tokenize(['I', 'work', 'with', 'F', '#', '.'])
['I', 'work', 'with', 'F_#', '.']

Mardellmarden answered 10/7, 2018 at 12:4 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags