`open_http': 403 Forbidden (OpenURI::HTTPError) for the string "Steve_Jobs" but not for any other string

Asked 7/6, 2012 at 4:16 Answered 18/6, 2012 at 17:49

I was going through the Ruby tutorials provided at http://ruby.bastardsbook.com/ and I encountered the following code:

require "open-uri"

remote_base_url = "http://en.wikipedia.org/wiki"
r1 = "Steve_Wozniak"
r2 = "Steve_Jobs"
f1 = "my_copy_of-" + r1 + ".html"
f2 = "my_copy_of-" + r2 + ".html"

# read the first url
remote_full_url = remote_base_url + "/" + r1
rpage = open(remote_full_url).read

# write the first file to disk
file = open(f1, "w")
file.write(rpage)
file.close

# read the first url
remote_full_url = remote_base_url + "/" + r2
rpage = open(remote_full_url).read

# write the second file to disk
file = open(f2, "w")
file.write(rpage)
file.close

# open a new file:
compiled_file = open("apple-guys.html", "w")

# reopen the first and second files again
k1 = open(f1, "r")
k2 = open(f2, "r")

compiled_file.write(k1.read)
compiled_file.write(k2.read)

k1.close
k2.close
compiled_file.close

The code fails with the following trace:

/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:277:in `open_http': 403 Forbidden (OpenURI::HTTPError)
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:616:in `buffer_open'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:162:in `catch'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:518:in `open'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:30:in `open'
    from /Users/arkidmitra/tweetfetch/samecode.rb:11

My problem is not that the code fails but that whenever I change r2 to anything other than Steve_Jobs, it works. What is happening here?

Chalutz answered 7/6, 2012 at 4:16 Comment(5)

Got a proxy or something that might be filtering urls? Have you tried hitting the 'bad' url via something else on the same machine, e.g. the lynx browser? – Gentile 7/6, 2012 at 4:19

Nothing as such. Works even with a wget "en.wikipedia.org/wiki/Steve_Jobs". I am amazed. – Chalutz 7/6, 2012 at 4:27

Can you try setting the user-agent like open(remote_full_url, "User-Agent" => "Mozilla/5.0 (Windows NT 6.0; rv:12.0) Gecko/20100101 Firefox/12.0 FirePHP/0.7.1") on your side? – Leodora 7/6, 2012 at 4:34

Yes, it works now. Can you please explain what was the problem? Should i close this question or will you be providing the answer and not a comment? – Chalutz 7/6, 2012 at 4:40

Well in the API-wiki it says that requests without user-agent are blocked and 403 is returned. But I can't really explain why this only applies to the "Steve_Jobs" article (which isn't even accessed using the API). They also have an user-agent policy but nothing there indicates that a 403-error code is used. So I don't really have an answer which explains this behaviour. – Leodora 7/6, 2012 at 5:1

I think this happens for locked down entries like "Steve Jobs", "Al-Gore" etc. This is specified in the same book that you are referring to:

For some pages – such as Al Gore's locked-down entry – Wikipedia will not respond to a web request if a User-Agent isn't specified. The "User-Agent" typically refers to your browser, and you can see this by inspecting the headers you send for any page request in your browser. By providing a "User-Agent" key-value pair, (I basically use "Ruby" and it seems to work), we can pass it as a hash (I use the constant HEADERS_HASH in the example) as the second argument of the method call.

It is specified later at http://ruby.bastardsbook.com/chapters/web-crawling/

Selby answered 18/6, 2012 at 17:49 Comment(0)

Your code runs fine for me (Ruby MRI 1.9.3) when I request a wiki page that exists.

When I request a wiki page that does NOT exist, I get a mediawiki 404 error code.

Steve_Jobs => success
Steve_Austin => success
Steve_Rogers => success
Steve_Foo => error

Wikipedia does a ton of caching, so if you see reponses for "Steve_Jobs" that are different than other people who do exist, then best-guess this is because wikipedia is caching the Steve Jobs article because he's famous, and potentially adding extra checks/verifications to protect the article from rapid changes, defacings, etc.

The solution for you: always open the url with a User Agent string.

rpage = open(remote_full_url, "User-Agent" => "Whatever you want here").read

Details from the Mediawiki docs: "When you make HTTP requests to the MediaWiki web service API, be sure to specify a User-Agent header that properly identifies your client. Don't use the default User-Agent provided by your client library, but make up a custom header that includes the name and the version number of your client: something like "MyCuteBot/0.1".

On Wikimedia wikis, if you don't supply a User-Agent header, or you supply an empty or generic one, your request will fail with an HTTP 403 error. See our User-Agent policy."

Evanish answered 10/6, 2012 at 0:56 Comment(2)

Thus, I'm betting your initial testing on the other names was done with a browser, and you're seeing cached results for those. When you hit "Steve_Jobs", it is not cached, and since you were using no UA string, you got the 403. – Rett 15/6, 2012 at 13:44

I can consistently reproduce this with curl. The Jobs page returns 403 w/o UA. If a UA is provided, then it returns a normal 200 response. I tried a few other pages and none had this behavior. Weird... – Thusly 16/6, 2012 at 6:34

I think this happens for locked down entries like "Steve Jobs", "Al-Gore" etc. This is specified in the same book that you are referring to:

For some pages – such as Al Gore's locked-down entry – Wikipedia will not respond to a web request if a User-Agent isn't specified. The "User-Agent" typically refers to your browser, and you can see this by inspecting the headers you send for any page request in your browser. By providing a "User-Agent" key-value pair, (I basically use "Ruby" and it seems to work), we can pass it as a hash (I use the constant HEADERS_HASH in the example) as the second argument of the method call.

It is specified later at http://ruby.bastardsbook.com/chapters/web-crawling/

Selby answered 18/6, 2012 at 17:49 Comment(0)

Recommended topics

Hot tags