Fetch a Wikipedia article with Python

H

10

40

I try to fetch a Wikipedia article with Python's urllib:

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")           
s = f.read()
f.close()

However instead of the html page I get the following response: Error - Wikimedia Foundation:

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT

Wikipedia seems to block request which are not from a standard browser.

Anybody know how to work around this?

Hygroscopic answered 23/9, 2008 at 9:37 Comment(1)

Wikipedia doesn't block requests are not from a standard browser, it blocks requests that are from standard libraries without changing their user agent. – Thimble 5/8, 2012 at 8:9

C

51

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.

Straight from the examples

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()

Coolant answered 23/9, 2008 at 9:50 Comment(2)

Wikipedia attempts to block screen scrapers for a reason. Their servers have to do a lot of work to convert wikicode to HTML, when there are easier ways to get the article content. en.wikipedia.org/wiki/… – Opsonin 12/8, 2010 at 17:49

You shouldn't try to impersonate a browser by using a user agent like Mozilla/5.0. Instead, you should use an informative user agent with some contact information. – Thimble 5/8, 2012 at 8:8

L

37

It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.

I have used it myself for two projects, and it works very well.

Letterperfect answered 23/9, 2008 at 9:49 Comment(2)

Using third party libraries for what can easily be done with buildin libraries in a couple lines of code isn't good advice. – Nessie 23/9, 2008 at 10:18

Since mwclient uses the mediawiki api it will require no parsing of the content. And I am guessing the original poster wants the content, and not the raw html with menus and all. – Variscite 23/9, 2008 at 10:52

R

15

Rather than trying to trick Wikipedia, you should consider using their High-Level API.

Roderica answered 11/6, 2009 at 11:14 Comment(2)

Which will, in turn, still block requests from urllib using the library default user-agent header. So the OP will still have the very same problem, although the API may be an easier way to interface the wiki content, depending on what are the OP goals. – Rhinitis 16/2, 2012 at 10:26

They work fine for me. Don't they work for you? Ex: en.wikipedia.org/w/… or en.wikipedia.org/w/… – Roderica 22/2, 2012 at 20:35

S

3

In case you are trying to access Wikipedia content (and don't need any specific information about the page itself), instead of using the api you should just call index.php with 'action=raw' in order to get the wikitext, like in:

'http://en.wikipedia.org/w/index.php?action=raw&title=Main_Page'

Or, if you want the HTML code, use 'action=render' like in:

'http://en.wikipedia.org/w/index.php?action=render&title=Main_Page'

You can also define a section to get just part of the content with something like 'section=3'.

You could then access it using the urllib2 module (as sugested in the chosen answer). However, if you need information about the page itself (such as revisions), you'll be better using the mwclient as sugested above.

Refer to MediaWiki's FAQ if you need more information.

Schipperke answered 12/11, 2010 at 19:16 Comment(1)

hello if I do not know the section number as 3 but I know the section title to be 'Noun', how to get that particular section? – Loosejointed 23/2, 2011 at 14:6

C

2

The general solution I use for any site is to access the page using Firefox and, using an extension such as Firebug, record all details of the HTTP request including any cookies.

In your program (in this case in Python) you should try to send a HTTP request as similar as necessary to the one that worked from Firefox. This often includes setting the User-Agent, Referer and Cookie fields, but there may be others.

Cowgirl answered 23/9, 2008 at 9:51 Comment(0)

O

2

requests is awesome!

Here is how you can get the html content with requests:

import requests
html = requests.get('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes').text

Done!

Oaf answered 19/9, 2014 at 5:37 Comment(0)

P

1

Try changing the user agent header you are sending in your request to something like: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)

Postwar answered 23/9, 2008 at 9:41 Comment(0)

T

1

You don't need to impersonate a browser user-agent; any user-agent at all will work, just not a blank one.

Truncheon answered 23/9, 2008 at 9:48 Comment(2)

urllib and urllib2 both send a user agent – Litre 23/9, 2008 at 9:58

s/blank/blank or default/ — the idea is exactly that you should somehow identify your bot through the user-agent header. That's why they block the urllib default one. – Rhinitis 16/2, 2012 at 10:29

F

1

Requesting the page with ?printable=yes gives you an entire relatively clean HTML document. ?action=render gives you just the body HTML. Requesting to parse the page through the MediaWiki action API with action=parse likewise gives you just the body HTML but would be good if you want finer control, see parse API help.

If you just want the page HTML so you can render it, it's faster and better is to use the new RESTBase API, which returns a cached HTML representation of the page. In this case, https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein.

As of November 2015, you don't have to set your user-agent, but it's strongly encouraged. Also, nearly all Wikimedia wikis require HTTPS, so avoid a 301 redirect and make https requests.

Frenzy answered 11/11, 2015 at 5:56 Comment(0)

S

0

import urllib
s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()

This seems to work for me without changing the user agent. Without the "action=raw" it does not work for me.

Sunward answered 25/1, 2011 at 15:2 Comment(0)

Recommended topics

Hot tags