Combining text stemming and removal of punctuation in NLTK and scikit-learn
Asked Answered
Q

1

24

I am using a combination of NLTK and scikit-learn's CountVectorizer for stemming words and tokenization.

Below is an example of the plain usage of the CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)

sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

Which will print

Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]

Now, let's say I want to remove stop words and stem the words. One option would be to do it like so:

from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer

#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
######## 

vect = CountVectorizer(tokenizer=tokenize, stop_words='english') 

vect.fit(vocab)

sentence1 = vect.transform(['The swimmer likes swimming.'])
sentence2 = vect.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vect.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

Which prints:

Vocabulary: ['.', 'like', 'swim', 'swimmer']
Sentence 1: [[1 1 1 1]]
Sentence 2: [[1 0 1 1]]

But how would I best get rid of the punctuation characters in this second version?

Quotable answered 30/9, 2014 at 17:14 Comment(0)
S
32

There are several options, try remove the punctuation before tokenization. But this would mean that don't -> dont

import string

def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

Or try removing punctuation after tokenization.

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [i for i in tokens if i not in string.punctuation]
    stems = stem_tokens(tokens, stemmer)
    return stems

EDITED

The above code will work but it's rather slow because it's looping through the same text multiple times:

  • Once to remove punctuation
  • Second time to tokenize
  • Third time to stem.

If you have more steps like removing digits or removing stopwords or lowercasing, etc.

It would be better to lump the steps together as much as possible, here's several better answers that is more efficient if your data requires more pre-processing steps:

Stele answered 1/10, 2014 at 1:1 Comment(3)
Simple yet effective. Thanks!Quotable
Note that the second won't catch ... or other multi-char punctuation symbols.Katz
@FredFoo and others: How do you rate GENSIM and Scikit for the extracted keywords rather the plain documents? Can you answer me? https://mcmap.net/q/583149/-rake-with-gensimIndifferentism

© 2022 - 2024 — McMap. All rights reserved.