Stemming unstructured text in NLTK

About

Asked 26/9, 2013 at 18:49 Answered 27/9, 2013 at 7:23

nltk tokenize text-analysis lemmatization

I tried the regex stemmer, but I get hundreds of unrelated tokens. I'm just interested in the "play" stem. Here is the code I'm working with:

import nltk
from nltk.book import *
f = open('tupac_original.txt', 'rU')
text = f.read()
text1 = text.split()
tup = nltk.Text(text1)
lowtup = [w.lower() for w in tup if w.isalpha()]
import sys, re
tupclean = [w for w in lowtup if not w in nltk.corpus.stopwords.words('english')]
from nltk import stem
tupstem = stem.RegexpStemmer('az$|as$|a$')
[tupstem.stem(i) for i in tupclean]

The result of the above is;

['like', 'ed', 'young', 'black', 'like'...]

I'm trying to clean up .txt files (all lowercase, remove stopwords, etc), normalize multiple spellings of a word into one and do a frequency dist/count. I know how to do FreqDist, but any suggestions as to where I'm going wrong with the stemming?

Partitive answered 26/9, 2013 at 18:49 Comment(2)

Isn't stemming the normalization you are looking for? You say you are having trouble.. what have you tried? – Yarmouth 26/9, 2013 at 20:22

What is your expected output? depending on what's your task, you might need a lemmatizer instead of a stemmer, see https://mcmap.net/q/189870/-stemmers-vs-lemmatizers – Drafty 27/9, 2013 at 7:24

There are several pre-coded well-known stemmers in NLTK, see http://nltk.org/api/nltk.stem.html and below shows an example.

>>> from nltk import stem
>>> porter = stem.porter.PorterStemmer()
>>> lancaster = stem.lancaster.LancasterStemmer()
>>> snowball = stem.snowball.EnglishStemmer()
>>> tokens =  ['player', 'playa', 'playas', 'pleyaz'] 
>>> [porter(i) for i in tokens]
>>> [porter.stem(i) for i in tokens]
['player', 'playa', 'playa', 'pleyaz']
>>> [lancaster.stem(i) for i in tokens]
['play', 'play', 'playa', 'pleyaz']
>>> [snowball.stem(i) for i in tokens]
[u'player', u'playa', u'playa', u'pleyaz']

But what you probably need is some sort of a regex stemmer,

>>> from nltk import stem
>>> rxstem = stem.RegexpStemmer('er$|a$|as$|az$')
>>> [rxstem.stem(i) for i in tokens]
['play', 'play', 'play', 'pley']

Drafty answered 27/9, 2013 at 7:23 Comment(6)

I edited my question. Y=I tried your regexStem and got multiple tokens back. Not sure where I'm going wrong. – Partitive 27/9, 2013 at 19:46

change your last line to [tupstem.stem(i) for i in tupclean if "pl" in tupclean and "y" in tupstem.stem(i)]. In linguistics, vowel shift occurs and assuming that the diphthongs remains and as well as the onset, then the consonant cluster "pl" will also be present in orthography. – Drafty 28/9, 2013 at 4:8

tried this but it didn't really do what i was hoping it would do. thanks anyway! – Partitive 30/9, 2013 at 16:43

I have nltk installed and can use it in other cases, but I'm getting module import errors on all the above---`>>> from nltk import stem >>> snowball = stem.snowball.EnglishStemmer() >>> [snowball(i) for i in ['Playing', "swimming", "dancing"]] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'EnglishStemmer' object is not callable ``` – Monodrama 25/11, 2013 at 16:28

have you downloaded all the packages when you do >>> import nltk and then >>> nltk.download()? – Drafty 25/11, 2013 at 16:46

nice choice of examples that show interesting corner cases for the nltk stemmers – Fabre 8/2, 2014 at 1:58

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags