nltk stemmer: string index out of range
Asked Answered
L

2

15

I have a set of pickled text documents which I would like to stem using nltk's PorterStemmer. For reasons specific to my project, I would like to do the stemming inside of a django app view.

However, when stemming the documents inside the django view, I receive an IndexError: string index out of range exception from PorterStemmer().stem() for the string 'oed'. As a result, running the following:

# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer

def get_results(request):
    s = PorterStemmer()
    s.stem('oed')
    return render(request, 'list.html')

raises the mentioned error:

Traceback (most recent call last):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
    response = get_response(request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
    s.stem('oed')
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
    stem = self._step1b(stem)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
    lambda stem: (self._measure(stem) == 1 and
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
    if suffix == '*d' and self._ends_double_consonant(word):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
    word[-1] == word[-2] and
IndexError: string index out of range

Now what is really odd is running the same stemmer on the same string outside django (be it a seperate python file or an interactive python console) produces no error. In other words:

# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')

followed by:

python test.py
# successfully prints 'o'

what is causing this issue?

Lakeshialakey answered 7/1, 2017 at 3:48 Comment(8)
Are you using Python 2? It just might be a character set difference-- just guessing though.Minatory
What version of NLTK are you using? You can check it with nltk.__version__ once you have imported it. Maybe you use two different versions for django and external python. Could you also check the python version that you use in django and to run the external script? I suppose it's always 2.7, given the print statement.Gretchengrete
Almost unrelated to the issue, s = PorterStemmer() should be put somewhere in your global variables are. Putting them in the view means loading the PorterStemmer object for every page that loads this view function.Fagaly
Within get_result, can you do a x = 'oed' and then print x and see what you get on your console where you use python manage.py runserver? I suspect it's django swallowing words.Fagaly
Also, try in your views.py add this: # coding: utf-8 in the first line and from __future__ import unicode_literals. The django and nltk version should also be reported in the OP as well as the github issue.Fagaly
Somehow this is also the case when Django gobbles up some str or char in #41503627 =(Fagaly
@KurtBourbaki turns out I was using two different versions of nltk. I was using version 3.2.2 in my django project's virtual environment //anaconda/envs/xkcd/bin/ but I had been running test.py using ipython, not python as stated above. The ipython installation was defined my root environment //anaconda/bin/ipython which must have given it access to the nltk version specified in my root environment (version 3.2.0). I downgraded my virtual environment's nltk to version 3.2.0 and ran the code successfully on the django app. Does this mean it is an issue with nltk 3.2.2?Lakeshialakey
@KurtBourbaki also any ideas as to why I was able to access the ipython installation specified in my root environment despite having a project environment activated which did not have ipython?Lakeshialakey
S
31

This is an NLTK bug specific to NLTK version 3.2.2, for which I am to blame. It was introduced by PR https://github.com/nltk/nltk/pull/1261 which rewrote the Porter stemmer.

I wrote a fix which went out in NLTK 3.2.3. If you're on version 3.2.2 and want the fix, just upgrade - e.g. by running

pip install -U nltk
Siren answered 7/1, 2017 at 20:45 Comment(2)
As it stands, this answer is at +20 and so I've effectively received 200 Stack Overflow rep as a reward for breaking an open source library. I feel rather guilty.Siren
Don't be guilty, this is one way to incentivize OSS =)Fagaly
G
3

I debugged nltk.stem.porter module using pdb. After a few iterations, in _apply_rule_list() you get:

>>> rule
(u'at', u'ate', None)
>>> word
u'o'

At this point the _ends_double_consonant() method tries to do word[-1] == word[-2] and it fails.

If I'm not mistaken, in NLTK 3.2 the relative method was the following:

def _doublec(self, word):
    """doublec(word) is TRUE <=> word ends with a double consonant"""
    if len(word) < 2:
        return False
    if (word[-1] != word[-2]):      
        return False        
    return self._cons(word, len(word)-1)

As far as I can see, the len(word) < 2 check is missing in the new version.

Changing _ends_double_consonant() to something like this should work:

def _ends_double_consonant(self, word):
      """Implements condition *d from the paper

      Returns True if word ends with a double consonant
      """
      if len(word) < 2:
          return False
      return (
          word[-1] == word[-2] and
          self._is_consonant(word, len(word)-1)
      )

I just proposed this change in the related NLTK issue.

Gretchengrete answered 7/1, 2017 at 19:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.