NLTK v3.2: Unable to nltk.pos_tag()
Asked Answered
M

3

4

Hi text mining champions,

I'm using Anaconda with NLTK v3.2 on Windows 10.(client's environment)

When I try to POS tag, I keep getting a URLLIB2 error:

URLError: <urlopen error unknown url type: c>

It seems urllib2 is unable to recognize windows paths? How can I work around this?

The command is simple as:

nltk.pos_tag(nltk.word_tokenize("Hello World"))

edit: There is a duplicate question, however I think the answers obtained here by manan and alvas are a better fix.

Marr answered 7/3, 2016 at 5:40 Comment(2)
Possible duplicate of Python NLTK pos_tag throws URLErrorHydrus
looks like yeah. I read that post prior.Marr
L
10

EDITED

This issue has been resolved from NLTK v3.2.1. Upgrading your NLTK version would resolve the issue, e.g. pip install -U nltk.


I faced the same issue and the error encountered was as follows;

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\nltk-3.2-py2.7.egg\nltk\tag\__init__.py", line 110, in pos_tag
tagger = PerceptronTagger()
  File "C:\Python27\lib\site-packages\nltk-3.2-py2.7.egg\nltk\tag\perceptron.py", line 141, in __init__
self.load(AP_MODEL_LOC)
  File "C:\Python27\lib\site-packages\nltk-3.2-py2.7.egg\nltk\tag\perceptron.py", line 209, in load
self.model.weights, self.tagdict, self.classes = load(loc)
  File "C:\Python27\lib\site-packages\nltk-3.2-py2.7.egg\nltk\data.py", line 801, in load
opened_resource = _open(resource_url)
  File "C:\Python27\lib\site-packages\nltk-3.2-py2.7.egg\nltk\data.py", line 924, in _open
return urlopen(resource_url)
  File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 391, in open
response = self._open(req, data)
  File "C:\Python27\lib\urllib2.py", line 414, in _open
'unknown_open', req)
  File "C:\Python27\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 1206, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: c>

The URLError that you mentioned was due to a bug in the perceptron.py file within the NLTK library for Windows. In my machine, the file is at this location

C:\Python27\Lib\site-packages\nltk-3.2-py2.7.egg\nltk\tag\perceptron.py

(Basically look at an equivalent location within yours wherever you have the Python27 folder)

The bug was basically in the code to find the corresponding location for the averaged_perceptron_tagger within your machine. One can have a look at the line 801 and 924 mentioned in the data.py file regarding this.

I think the NLTK developer community recently fixed this bug in the code. Have a look at this commit made to their code a few days back.

https://github.com/nltk/nltk/commit/d3de14e58215beebdccc7b76c044109f6197d1d9#diff-26b258372e0d13c2543de8dbb1841252

The snippet where the change was made is as follows;

self.tagdict = {}
self.classes = set()
    if load:
        AP_MODEL_LOC = 'file:'+str(find('taggers/averaged_perceptron_tagger/'+PICKLE))
          self.load(AP_MODEL_LOC)
        # Initially it was:AP_MODEL_LOC = str(find('taggers/averaged_perceptron_tagger/'+PICKLE)) 

def tag(self, tokens):

Updating the file to the most recent commit worked for me and was able to use the nltk.pos_tag command. I believe this would resolve your problem as well (assuming you have everything else set up).

Laoighis answered 9/3, 2016 at 21:0 Comment(3)
Works like a dream. Thanks @LaoighisMarr
FWIW I had the same error on Win10 python 3.4 (64Bit) with nltk installed via pip and up to date as of April 2nd. Finding the percepthon.py file and making the change in the snippet above worked after a restart for good measure. Wish I had seen this post 4 hours ago though because I thought it was my tokens that were the problemShote
Sorry for adding the edit to your answer, this is to avoid cross-platform communication and NLTK users starting new issues on the github repo on this resolved issue.Hydrus
H
6

EDITED

This issue has been resolved from NLTK v3.2.1. Please upgrade your NLTK!


First read @MananVyas answer for the why:

https://mcmap.net/q/1480347/-nltk-v3-2-unable-to-nltk-pos_tag


Here's the how, without downgrading to NLTK v3.1, using NLTK 3.2, you can use this "hack":

>>> from nltk.tag import PerceptronTagger
>>> from nltk.data import find
>>> PICKLE = "averaged_perceptron_tagger.pickle"
>>> AP_MODEL_LOC = 'file:'+str(find('taggers/averaged_perceptron_tagger/'+PICKLE))
>>> tagger = PerceptronTagger(load=False)
>>> tagger.load(AP_MODEL_LOC)
>>> pos_tag = tagger.tag
>>> pos_tag('The quick brown fox jumps over the lazy dog'.split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
Hydrus answered 12/3, 2016 at 23:43 Comment(7)
I ran the above code and it worked fine but when I try to run my nltk routine it still gives <<raise URLError('unknown url type: %s' % type)>> the code I am using is in #36255791 I also ran Sarim Hussain's suggestion successfully but no luck.Profligate
try upgrading your nltk, pip install -U nltkHydrus
just tried that. Still same error. On pip command I get << writing dependency_links to nltk.egg-info\dependency_links.txt warning: manifest_maker: standard file '-c' not found reading manifest template 'MANIFEST.in' warning: no files found matching 'Makefile' under directory '*.txt' warning: no previously-included files matching '*~' found anywhere in distribution writing manifest file 'nltk.egg-info\SOURCES.txt' Successfully installed nltk-3.2>>Profligate
Which OS are you using? What is your Python version? How did you install python? How did you install NLTK? Did you install through pip or conda? Where are you running Python? From the command prompt, terminal or in some IDE? Are you running it through a server or a cloud? Are you running it on your laptop/computer? Or in some school's lab where there might be a firewall? Where are you running the python script? Did you have any other file name call nltk.py in your directory?Hydrus
After upgrading to NLTK 3.2 did you use the AP_MODEL_LOC = 'file:'+str(find('taggers/averaged_perceptron_tagger/'+PICKLE)) hack?Hydrus
Sorry for the multiple questions, your short comment isn't enough to help us debug the problems, please answer each of the questions in the previous 2 comments and we'll try to find a solution afterwards. Actually, it'll also be easier if yo ask another question and state all the answers to those questions in the comments, it looks like it's another problem.Hydrus
Thanks -I have created a new question for this at #36350255Profligate
R
1

I faced the same issue a while back. Solution:

nltk.download('averaged_perceptron_tagger')
Relent answered 22/3, 2016 at 16:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.