`open_http': 403 Forbidden (OpenURI::HTTPError) for the string "Steve_Jobs" but not for any other string
Asked Answered
C

2

9

I was going through the Ruby tutorials provided at http://ruby.bastardsbook.com/ and I encountered the following code:

require "open-uri"

remote_base_url = "http://en.wikipedia.org/wiki"
r1 = "Steve_Wozniak"
r2 = "Steve_Jobs"
f1 = "my_copy_of-" + r1 + ".html"
f2 = "my_copy_of-" + r2 + ".html"

# read the first url
remote_full_url = remote_base_url + "/" + r1
rpage = open(remote_full_url).read

# write the first file to disk
file = open(f1, "w")
file.write(rpage)
file.close

# read the first url
remote_full_url = remote_base_url + "/" + r2
rpage = open(remote_full_url).read

# write the second file to disk
file = open(f2, "w")
file.write(rpage)
file.close

# open a new file:
compiled_file = open("apple-guys.html", "w")

# reopen the first and second files again
k1 = open(f1, "r")
k2 = open(f2, "r")

compiled_file.write(k1.read)
compiled_file.write(k2.read)

k1.close
k2.close
compiled_file.close

The code fails with the following trace:

/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:277:in `open_http': 403 Forbidden (OpenURI::HTTPError)
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:616:in `buffer_open'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:162:in `catch'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:518:in `open'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:30:in `open'
    from /Users/arkidmitra/tweetfetch/samecode.rb:11

My problem is not that the code fails but that whenever I change r2 to anything other than Steve_Jobs, it works. What is happening here?

Chalutz answered 7/6, 2012 at 4:16 Comment(5)
Got a proxy or something that might be filtering urls? Have you tried hitting the 'bad' url via something else on the same machine, e.g. the lynx browser?Gentile
Nothing as such. Works even with a wget "en.wikipedia.org/wiki/Steve_Jobs". I am amazed.Chalutz
Can you try setting the user-agent like open(remote_full_url, "User-Agent" => "Mozilla/5.0 (Windows NT 6.0; rv:12.0) Gecko/20100101 Firefox/12.0 FirePHP/0.7.1") on your side?Leodora
Yes, it works now. Can you please explain what was the problem? Should i close this question or will you be providing the answer and not a comment?Chalutz
Well in the API-wiki it says that requests without user-agent are blocked and 403 is returned. But I can't really explain why this only applies to the "Steve_Jobs" article (which isn't even accessed using the API). They also have an user-agent policy but nothing there indicates that a 403-error code is used. So I don't really have an answer which explains this behaviour.Leodora
S
2

I think this happens for locked down entries like "Steve Jobs", "Al-Gore" etc. This is specified in the same book that you are referring to:

For some pages – such as Al Gore's locked-down entry – Wikipedia will not respond to a web request if a User-Agent isn't specified. The "User-Agent" typically refers to your browser, and you can see this by inspecting the headers you send for any page request in your browser. By providing a "User-Agent" key-value pair, (I basically use "Ruby" and it seems to work), we can pass it as a hash (I use the constant HEADERS_HASH in the example) as the second argument of the method call.

It is specified later at http://ruby.bastardsbook.com/chapters/web-crawling/

Selby answered 18/6, 2012 at 17:49 Comment(0)
E
11

Your code runs fine for me (Ruby MRI 1.9.3) when I request a wiki page that exists.

When I request a wiki page that does NOT exist, I get a mediawiki 404 error code.

  • Steve_Jobs => success
  • Steve_Austin => success
  • Steve_Rogers => success
  • Steve_Foo => error

Wikipedia does a ton of caching, so if you see reponses for "Steve_Jobs" that are different than other people who do exist, then best-guess this is because wikipedia is caching the Steve Jobs article because he's famous, and potentially adding extra checks/verifications to protect the article from rapid changes, defacings, etc.

The solution for you: always open the url with a User Agent string.

rpage = open(remote_full_url, "User-Agent" => "Whatever you want here").read

Details from the Mediawiki docs: "When you make HTTP requests to the MediaWiki web service API, be sure to specify a User-Agent header that properly identifies your client. Don't use the default User-Agent provided by your client library, but make up a custom header that includes the name and the version number of your client: something like "MyCuteBot/0.1".

On Wikimedia wikis, if you don't supply a User-Agent header, or you supply an empty or generic one, your request will fail with an HTTP 403 error. See our User-Agent policy."

Evanish answered 10/6, 2012 at 0:56 Comment(2)
Thus, I'm betting your initial testing on the other names was done with a browser, and you're seeing cached results for those. When you hit "Steve_Jobs", it is not cached, and since you were using no UA string, you got the 403.Rett
I can consistently reproduce this with curl. The Jobs page returns 403 w/o UA. If a UA is provided, then it returns a normal 200 response. I tried a few other pages and none had this behavior. Weird...Thusly
S
2

I think this happens for locked down entries like "Steve Jobs", "Al-Gore" etc. This is specified in the same book that you are referring to:

For some pages – such as Al Gore's locked-down entry – Wikipedia will not respond to a web request if a User-Agent isn't specified. The "User-Agent" typically refers to your browser, and you can see this by inspecting the headers you send for any page request in your browser. By providing a "User-Agent" key-value pair, (I basically use "Ruby" and it seems to work), we can pass it as a hash (I use the constant HEADERS_HASH in the example) as the second argument of the method call.

It is specified later at http://ruby.bastardsbook.com/chapters/web-crawling/

Selby answered 18/6, 2012 at 17:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.