NLTK: set proxy server
Asked Answered
R

9

18

I'm trying to learn NLTK - Natural Language Toolkit written in Python and I want install a sample data set to run some examples.

My web connection uses a proxy server, and I'm trying to specify the proxy address as follows:

>>> nltk.set_proxy('http://proxy.example.com:3128' ('USERNAME', 'PASSWORD'))
>>> nltk.download()

But I get an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object is not callable

I decided to set up a ProxyBasicAuthHandler before calling nltk.download():

import urllib2

auth_handler = urllib2.ProxyBasicAuthHandler(urllib2.HTTPPasswordMgrWithDefaultRealm())
auth_handler.add_password(realm=None, uri='http://proxy.example.com:3128/', user='USERNAME', passwd='PASSWORD')
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)

import nltk
nltk.download()

But now I get HTTP Error 407 - Proxy Autentification Required.

The documentation says that if the proxy is set to None then this function will attempt to detect the system proxy. But it isn't working.

How can I install a sample data set for NLTK?

Radioisotope answered 17/12, 2012 at 5:12 Comment(1)
ref #41349121 set ssl to false if it give errorLetterpress
C
24

There is an error with the website where you got those lines of code for your first attempt (I have seen that same error)

The line in error is

nltk.set_proxy('http://proxy.example.com:3128' ('USERNAME', 'PASSWORD'))

You need a comma to separate the arguments. The correct line should be

nltk.set_proxy('http://proxy.example.com:3128', ('USERNAME', 'PASSWORD'))

This will work just fine.

Clad answered 24/12, 2012 at 17:16 Comment(6)
Thanks a lot! Documentation of the NLTK project contains errors.Radioisotope
You can try nltk.set_proxy('http://proxy.example.com:3128', 'USERNAME', 'PASSWORD') If your password contains special characters remember to convert them to hex values. For example %40 for @Spleeny
@Clad +1 saved me much time, thank you. seemingly nltk does not utilize os.environ['http_proxy'] or os.environ['https_proxy'] proxies as this method did not work for me.Filiate
Work fine. Tested in Python 3.5. nltk.set_proxy('proxy.example.com:3128', ('USERNAME', 'PASSWORD'))Tundra
@Clad Can you help me? For me it works if I use 'domen%5Clogin:[email protected]:3131' but fails if I try 'proxy.com:3131', ('domen%5Clogin' , 'password') . What can be wrong?Natalienatalina
@Natalienatalina I took a look but I can't say for your particular case. The function if you want to take a look for the two paths: encoding the username and password in the proxy and having the username and password as separate parameters. Perhaps the `domen\` portion is it?Clad
S
13

I run NLTK 3.2.5 and python 3.6 under Windows 10 environment. I use this script :

nltk.set_proxy('http://user:[email protected]:3128')
nltk.download()
Sheritasherj answered 16/1, 2018 at 10:12 Comment(0)
I
11

I was too getting the same error but i got a perfectly working solution.You need to download the nltk_data MANUALLY and put it in usr/lib/nltk_data directory in linux and c:\nltk_data if you use windows .
Here are the steps you need to follow :
1.Download the nltk_data zip file from this Github link
https://github.com/nltk/nltk_data/tree/gh-pages .
2.Since data is in zip form you need to extract it .
3.Specially for ubuntu users , following command to navigate the filesystem in a handy way.
sudo nautilus it makes copy/paste process handy . Now you can copy to usr/share easily or create a folder easily .
4.Now if you are a linux user than create a folder named as nltk_data in usr/share and if you use windows than create the same in c:/ .
5.Now paste all content of nltk_data-gh-pages (which you just extracted ) in nltk_data folder you just created .
6. Now form nltk_data/packages folder copy all folder and paste it to nltk_data folder. Now you are done.

Since this is my first answer i might be not able to explain the process correctly . So if you have trouble going through these steps , please do comment .

Ingar answered 9/1, 2015 at 23:8 Comment(3)
I am getting the error at https://mcmap.net/q/497933/-what-to-download-in-order-to-make-nltk-tokenize-word_tokenize-work/1352127. Please helpGunsmith
Just an extra comment to make the answer more productive.. Make a recursive extraction of the zips (because inside the packages, there will be more zips, that should be unzipped too).Lully
Directory doesn't have to be necessarily c:/drive (for win). it can be any path inside >>> nltk.data.pathChromite
M
6

The options suggested above did not work for me. Here's what worked for me in my windows environment. Try removing the round braces . it works now !

nltk.set_proxy('http://proxy.example.com:3128', 'USERNAME', 'PASSWORD')
Maressa answered 17/10, 2014 at 16:37 Comment(0)
W
2

I run NLTK 3.0 and python 3.4 in windows environment..and proxy authentication runs well if i remove the branch.. so use this script

nltk.set_proxy('http://proxy.example.com:3128', 'username', 'password')
Winther answered 27/11, 2014 at 5:16 Comment(0)
H
2

If you want to manually install NLTK Corpus.

1) Go to http://www.nltk.org/nltk_data/ and download your desired NLTK Corpus file.

2) Now in a Python shell check the value of nltk.data.path

3) Choose one of the path that exists on your machine, and unzip the data files into the corpora sub directory inside.

4) Now you can import the data from nltk.corpos import stopwords

Reference: https://medium.com/@satorulogic/how-to-manually-download-a-nltk-corpus-f01569861da9

Haemostatic answered 1/5, 2017 at 14:2 Comment(0)
M
1

Set the proxy of the system in bash also by changing proper environment variable.

Some of the proxy settings which I keep are:

http_proxy=http://127.0.0.1:3129/
ftp_proxy=http://127.0.0.1:3129/
all_proxy=socks://127.0.0.1:3129/
https_proxy=http://127.0.0.1:3129/

You can make the changes in environment variable permanent by editing your ~/.bashrc file. Sample edit:

export http_proxy=http://127.0.0.1:3129/
Merlenemerlin answered 17/12, 2012 at 5:39 Comment(2)
I'm already use http_proxy environment variable and many programs (such as eclipse, git, wget, etc.) use it. But it seems to me that NLTK downloader don't use environment variable.Radioisotope
In my system, it works perfectly. I also use a proxy. >>> import nltk >>> nltk.download() NLTK Downloader --------------------------------------------------------------------------- d) Download l) List c) Config h) Help q) Quit --------------------------------------------------------------------------- Downloader> Merlenemerlin
A
0

To be honest, the accepted solution doesn't work for me. And I'm also afraid of leaking my password since we need to specify it explicitly.

Rather than use nltk.download() inside python console, run python -m nltk.downloader all in cmd (for Windows) works super for me!

ps: For Windows user, remember to turn of your Proxy server before running the command. Go to Internet Explorer -> gear icon at the top right -> Internet Options -> Connections -> LAN settings -> uncheck "User a proxy server ... VPN connections)." -> OK

Resource is also from the official document: https://www.nltk.org/data.html#command-line-installation

Augusto answered 8/11, 2018 at 5:39 Comment(0)
A
-2

I could make it work with:

nltk.set_proxy('http://user_name:password@proxy_ip_adress:3128')
Anachronous answered 26/7, 2018 at 16:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.