I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to be tokenized as a single token, instead of the regular tokenization.
For example, given the sentence "The West Wing is an American television serial drama created by Aaron Sorkin that was originally broadcast on NBC from September 22, 1999 to May 14, 2006," and adding the phrase to the tokenizer "the west wing," the resulting tokens would be:
- the west wing
- is
- an
- american
- ...
What's the best way to accomplish this? I'd prefer to stay within the bounds of tools like NLTK.
regex_tokenize
andchapter 7 of the NLTK book
links require a login and password. – Schreiber