Stemming unstructured text in NLTK
Asked Answered
P

1

7

I tried the regex stemmer, but I get hundreds of unrelated tokens. I'm just interested in the "play" stem. Here is the code I'm working with:

import nltk
from nltk.book import *
f = open('tupac_original.txt', 'rU')
text = f.read()
text1 = text.split()
tup = nltk.Text(text1)
lowtup = [w.lower() for w in tup if w.isalpha()]
import sys, re
tupclean = [w for w in lowtup if not w in nltk.corpus.stopwords.words('english')]
from nltk import stem
tupstem = stem.RegexpStemmer('az$|as$|a$')
[tupstem.stem(i) for i in tupclean] 

The result of the above is;

['like', 'ed', 'young', 'black', 'like'...]

I'm trying to clean up .txt files (all lowercase, remove stopwords, etc), normalize multiple spellings of a word into one and do a frequency dist/count. I know how to do FreqDist, but any suggestions as to where I'm going wrong with the stemming?

Partitive answered 26/9, 2013 at 18:49 Comment(2)
Isn't stemming the normalization you are looking for? You say you are having trouble.. what have you tried?Yarmouth
What is your expected output? depending on what's your task, you might need a lemmatizer instead of a stemmer, see https://mcmap.net/q/189870/-stemmers-vs-lemmatizersDrafty
D
12

There are several pre-coded well-known stemmers in NLTK, see http://nltk.org/api/nltk.stem.html and below shows an example.

>>> from nltk import stem
>>> porter = stem.porter.PorterStemmer()
>>> lancaster = stem.lancaster.LancasterStemmer()
>>> snowball = stem.snowball.EnglishStemmer()
>>> tokens =  ['player', 'playa', 'playas', 'pleyaz'] 
>>> [porter(i) for i in tokens]
>>> [porter.stem(i) for i in tokens]
['player', 'playa', 'playa', 'pleyaz']
>>> [lancaster.stem(i) for i in tokens]
['play', 'play', 'playa', 'pleyaz']
>>> [snowball.stem(i) for i in tokens]
[u'player', u'playa', u'playa', u'pleyaz']

But what you probably need is some sort of a regex stemmer,

>>> from nltk import stem
>>> rxstem = stem.RegexpStemmer('er$|a$|as$|az$')
>>> [rxstem.stem(i) for i in tokens]
['play', 'play', 'play', 'pley']
Drafty answered 27/9, 2013 at 7:23 Comment(6)
I edited my question. Y=I tried your regexStem and got multiple tokens back. Not sure where I'm going wrong.Partitive
change your last line to [tupstem.stem(i) for i in tupclean if "pl" in tupclean and "y" in tupstem.stem(i)]. In linguistics, vowel shift occurs and assuming that the diphthongs remains and as well as the onset, then the consonant cluster "pl" will also be present in orthography.Drafty
tried this but it didn't really do what i was hoping it would do. thanks anyway!Partitive
I have nltk installed and can use it in other cases, but I'm getting module import errors on all the above---`>>> from nltk import stem >>> snowball = stem.snowball.EnglishStemmer() >>> [snowball(i) for i in ['Playing', "swimming", "dancing"]] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'EnglishStemmer' object is not callable ```Monodrama
have you downloaded all the packages when you do >>> import nltk and then >>> nltk.download()?Drafty
nice choice of examples that show interesting corner cases for the nltk stemmersFabre

© 2022 - 2024 — McMap. All rights reserved.