urllib2 HTTP error 429
Asked Answered
D

3

10

So I have a list of sub-reddits and I'm using urllib to open them. As I go through them eventually urllib fails with:

urllib2.HTTPError: HTTP Error 429: Unknown

Doing some research I found that reddit limits the ammount of requests to their servers by IP:

Make no more than one request every two seconds. There's some allowance for bursts of requests, but keep it sane. In general, keep it to no more than 30 requests in a minute.

So I figured I'd use time.sleep() to limit my requests to one page each 10 seconds. This ends up failing just as well.

The quote above is grabbed from the reddit API page. I am not using the reddit API. At this point I'm thinking two things. Either that limit applies only to the reddit API or urllib also has a limit.

Does anyone know which one of these two things it is? Or how I could go around this issue?

Disorganization answered 3/11, 2012 at 20:10 Comment(2)
There's no limit in urllib2, as you could have found out by trying some other webpage. They might have blocked your IP from API access, try sending them an email.Deformity
@larsmans The reqeusts go through randomly. I get some then it fails for a while and then it works again. Also they could not have blocked me from their API as I am not using their API.Disorganization
S
19

From https://github.com/reddit/reddit/wiki/API:

Many default User-Agents (like "Python/urllib" or "Java") are drastically limited to encourage unique and descriptive user-agent strings.

This applies to regular requests as well. You need to supply your own user agent header when making the request.

#TODO: change user agent string
hdr = { 'User-Agent' : 'super happy flair bot by /u/spladug' }
req = urllib2.Request(url, headers=hdr)
html = urllib2.urlopen(req).read()

However, this will create a new connection for every request. I suggest using another library that is capable of re-using connections, httplib or Request, for example. It will put less stress on the server and speed up the requests:

import httplib
import time

lst = """
science
scifi
"""

hdr= { 'User-Agent' : 'super happy flair bot by /u/spladug' }
conn = httplib.HTTPConnection('www.reddit.com')
for name in lst.split():
    conn.request('GET', '/r/'+name, headers=hdr)
    print conn.getresponse().read()
    time.sleep(2)
conn.close()
Snodgrass answered 3/11, 2012 at 22:13 Comment(2)
The only reason this appears to work is because you are not using a common user-agent. By the API rules, however, you still need to set a unique user-agent and this solution may eventually still cause you to receive 429 errors.Cheatham
Thank you for correcting my false assessment. I have changed my answer to reflect this.Snodgrass
C
5

reddit performs rate limiting by request (not connection as suggested by Anonymous Coward) for both IP addresses and user agents. The issue you are running into is that everyone who attempts to access reddit using urllib2 will be rate limited as a single user.

The solution is to set a user agent which you can find an answer in this question.

Alternatively, forgo writing your own code to crawl reddit and use PRAW instead. It supports almost all the features of reddit's API and you needn't worry about following any of the API rules as it takes care of that for you.

Cheatham answered 4/11, 2012 at 7:8 Comment(1)
Thanks bboe. I caught you on the reddit IRC and you told me about PRAW. Cheers again.Disorganization
H
0

I ran into the same error. changing the code from

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(url)
bsObj = BeautifulSoup(html)

to

from urllib.request import urlopen
from bs4 import BeautifulSoup
import urllib.request

webRequest = urllib.request.Request(url, headers={"User-Agent": <you username in case you are scraping reddit>})
html = urlopen(webRequest)
bsObj = BeautifulSoup(html)

resolved the issue

Homey answered 6/6, 2018 at 6:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.