Python: Tokenizing with phrases

T

3

10

I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to be tokenized as a single token, instead of the regular tokenization.

For example, given the sentence "The West Wing is an American television serial drama created by Aaron Sorkin that was originally broadcast on NBC from September 22, 1999 to May 14, 2006," and adding the phrase to the tokenizer "the west wing," the resulting tokens would be:

the west wing
is
an
american
...

What's the best way to accomplish this? I'd prefer to stay within the bounds of tools like NLTK.

Tugboat answered 3/4, 2011 at 20:42 Comment(0)

M

3

If you have a fixed set of phrases that you're looking for, then the simple solution is to tokenize your input and "reassemble" the multi-word tokens. Alternatively, do a regexp search & replace before tokenizing that turns The West Wing into The_West_Wing.

For more advanced options, use regexp_tokenize or see chapter 7 of the NLTK book.

Micrometry answered 3/4, 2011 at 21:4 Comment(1)

regex_tokenize and chapter 7 of the NLTK book links require a login and password. – Schreiber 10/1, 2015 at 3:19

H

10

You can use the Multi-Word Expression Tokenizer MWETokenizer of NLTK:

from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer()
tokenizer.add_mwe(('the', 'west', 'wing'))
tokenizer.tokenize('Something about the west wing'.split())

You will get:

['Something', 'about', 'the_west_wing']

Halfmast answered 1/12, 2016 at 16:4 Comment(0)

M

3

If you have a fixed set of phrases that you're looking for, then the simple solution is to tokenize your input and "reassemble" the multi-word tokens. Alternatively, do a regexp search & replace before tokenizing that turns The West Wing into The_West_Wing.

For more advanced options, use regexp_tokenize or see chapter 7 of the NLTK book.

Micrometry answered 3/4, 2011 at 21:4 Comment(1)

regex_tokenize and chapter 7 of the NLTK book links require a login and password. – Schreiber 10/1, 2015 at 3:19

H

1

If you don't know the particular phrases in advance, you could possibly use scikit's CountVectorizer() class. It has the option to specify larger n-gram ranges (ngram_range) and then ignore any words that do not appear in enough documents (min_df). You might identfy a few phrases that you had not realized were common, but you might also find some that are meaningless. It also has the option to filter out english stopwords (meaningless words like 'is') using the stop_words parameter.

Hoey answered 1/12, 2016 at 16:11 Comment(0)

Recommended topics

Hot tags