Combining text stemming and removal of punctuation in NLTK and scikit-learn

I am using a combination of NLTK and scikit-learn's CountVectorizer for stemming words and tokenization.

Below is an example of the plain usage of the CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)

sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

Which will print

Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]

Now, let's say I want to remove stop words and stem the words. One option would be to do it like so:

from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer

#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
######## 

vect = CountVectorizer(tokenizer=tokenize, stop_words='english') 

vect.fit(vocab)

sentence1 = vect.transform(['The swimmer likes swimming.'])
sentence2 = vect.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vect.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

Which prints:

Vocabulary: ['.', 'like', 'swim', 'swimmer']
Sentence 1: [[1 1 1 1]]
Sentence 2: [[1 0 1 1]]

But how would I best get rid of the punctuation characters in this second version?

import string def tokenize(text): text = "".join([ch for ch in text if ch not in string.punctuation]) tokens = nltk.word_tokenize(text) stems = stem_tokens(tokens, stemmer) return stems

EDITED

The above code will work but it's rather slow because it's looping through the same text multiple times:

Once to remove punctuation

Second time to tokenize

Third time to stem.

If you have more steps like removing digits or removing stopwords or lowercasing, etc.

It would be better to lump the steps together as much as possible, here's several better answers that is more efficient if your data requires more pre-processing steps:

EDITED

Recommended topics

Hot tags