nltk stemmer: string index out of range

Asked 7/1, 2017 at 3:48 Answered 7/1, 2017 at 20:45

I have a set of pickled text documents which I would like to stem using nltk's PorterStemmer. For reasons specific to my project, I would like to do the stemming inside of a django app view.

However, when stemming the documents inside the django view, I receive an IndexError: string index out of range exception from PorterStemmer().stem() for the string 'oed'. As a result, running the following:

# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer

def get_results(request):
    s = PorterStemmer()
    s.stem('oed')
    return render(request, 'list.html')

raises the mentioned error:

Traceback (most recent call last):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
    response = get_response(request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
    s.stem('oed')
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
    stem = self._step1b(stem)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
    lambda stem: (self._measure(stem) == 1 and
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
    if suffix == '*d' and self._ends_double_consonant(word):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
    word[-1] == word[-2] and
IndexError: string index out of range

Now what is really odd is running the same stemmer on the same string outside django (be it a seperate python file or an interactive python console) produces no error. In other words:

# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')

followed by:

python test.py
# successfully prints 'o'

what is causing this issue?

Lakeshialakey answered 7/1, 2017 at 3:48 Comment(8)

Are you using Python 2? It just might be a character set difference-- just guessing though. – Minatory 7/1, 2017 at 8:42

What version of NLTK are you using? You can check it with nltk.__version__ once you have imported it. Maybe you use two different versions for django and external python. Could you also check the python version that you use in django and to run the external script? I suppose it's always 2.7, given the print statement. – Gretchengrete 7/1, 2017 at 10:28

Almost unrelated to the issue, s = PorterStemmer() should be put somewhere in your global variables are. Putting them in the view means loading the PorterStemmer object for every page that loads this view function. – Fagaly 7/1, 2017 at 11:27

Within get_result, can you do a x = 'oed' and then print x and see what you get on your console where you use python manage.py runserver? I suspect it's django swallowing words. – Fagaly 7/1, 2017 at 11:30

Also, try in your views.py add this: # coding: utf-8 in the first line and from __future__ import unicode_literals. The django and nltk version should also be reported in the OP as well as the github issue. – Fagaly 7/1, 2017 at 11:33

Somehow this is also the case when Django gobbles up some str or char in #41503627 =( – Fagaly 7/1, 2017 at 11:35

@KurtBourbaki turns out I was using two different versions of nltk. I was using version 3.2.2 in my django project's virtual environment //anaconda/envs/xkcd/bin/ but I had been running test.py using ipython, not python as stated above. The ipython installation was defined my root environment //anaconda/bin/ipython which must have given it access to the nltk version specified in my root environment (version 3.2.0). I downgraded my virtual environment's nltk to version 3.2.0 and ran the code successfully on the django app. Does this mean it is an issue with nltk 3.2.2? – Lakeshialakey 7/1, 2017 at 18:32

@KurtBourbaki also any ideas as to why I was able to access the ipython installation specified in my root environment despite having a project environment activated which did not have ipython? – Lakeshialakey 7/1, 2017 at 18:33

This is an NLTK bug specific to NLTK version 3.2.2, for which I am to blame. It was introduced by PR https://github.com/nltk/nltk/pull/1261 which rewrote the Porter stemmer.

I wrote a fix which went out in NLTK 3.2.3. If you're on version 3.2.2 and want the fix, just upgrade - e.g. by running

pip install -U nltk

Siren answered 7/1, 2017 at 20:45 Comment(2)

As it stands, this answer is at +20 and so I've effectively received 200 Stack Overflow rep as a reward for breaking an open source library. I feel rather guilty. – Siren 10/5, 2017 at 11:11

Don't be guilty, this is one way to incentivize OSS =) – Fagaly 13/5, 2017 at 3:13

I debugged nltk.stem.porter module using pdb. After a few iterations, in _apply_rule_list() you get:

>>> rule
(u'at', u'ate', None)
>>> word
u'o'

At this point the _ends_double_consonant() method tries to do word[-1] == word[-2] and it fails.

If I'm not mistaken, in NLTK 3.2 the relative method was the following:

def _doublec(self, word):
    """doublec(word) is TRUE <=> word ends with a double consonant"""
    if len(word) < 2:
        return False
    if (word[-1] != word[-2]):      
        return False        
    return self._cons(word, len(word)-1)

As far as I can see, the len(word) < 2 check is missing in the new version.

Changing _ends_double_consonant() to something like this should work:

def _ends_double_consonant(self, word):
      """Implements condition *d from the paper

      Returns True if word ends with a double consonant
      """
      if len(word) < 2:
          return False
      return (
          word[-1] == word[-2] and
          self._is_consonant(word, len(word)-1)
      )

I just proposed this change in the related NLTK issue.

Gretchengrete answered 7/1, 2017 at 19:35 Comment(0)

Recommended topics

Hot tags