I'm using nltk.word_tokenize
for tokenizing some sentences which contain programming languages, frameworks, etc., which get incorrectly tokenized.
For example:
>>> tokenize.word_tokenize("I work with C#.")
['I', 'work', 'with', 'C', '#', '.']
Is there a way to enter a list of "exceptions" like this to the tokenizer? I already have compiled a list of all the things (languages, etc.) that I don't want to split.
#
always, but instead just inC#
and possibly in hundreds of other particular words such asF#
and similar technical names. – Plait