404 not found, but can access normally from web browser
Asked Answered
C

3

6

I tried many URLs on this and they seem to be fine until I came across this particular one:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html"))
puts doc

This is the result:

/Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:353:in `open_http': 404 Not Found (OpenURI::HTTPError)
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:709:in `buffer_open'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:210:in `block in open_loop'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:208:in `catch'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:208:in `open_loop'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:149:in `open_uri'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:689:in `open'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:34:in `open'
    from test.rb:5:in `<main>'  

I can access this from a web browser, I just don't get it at all.

What is going on, and how can I deal with this kind of error? Can I ignore it and let the rest do their work?

Creigh answered 5/9, 2014 at 18:42 Comment(1)
You're using Ruby 2+ so it's not necessary to use require 'rubygems'. That requirement disappeared back in Ruby 1.9.Total
T
5

You're getting 404 Not Found (OpenURI::HTTPError), so, if you want to allow your code to continue, rescue for that exception. Something like this should work:

require 'nokogiri'
require 'open-uri'

URLS = %w[
  http://www.moxyst.com/fashion/men-clothing/underwear.html
]

URLs.each do |url|
  begin
    doc = Nokogiri::HTML(open(url))
  rescue OpenURI::HTTPError => e
    puts "Can't access #{ url }"
    puts e.message
    puts
    next
  end
  puts doc.to_html
end

You can use more generic exceptions, but then you run into problems getting weird output or might handle an unrelated problem in a way that causes more problems, so you'll need to figure out the granularity you need.

You could even sniff either the HTTPd headers, the status of the response, or look at the exception message if you want even more control and want to do something different for a 401 or a 404.

I can access this from a web browser, I just don't get it at all.

Well, that could be something happening on the server side: Perhaps they don't like the UserAgent string you're sending? The OpenURI documentation shows how to change that header:

Additional header fields can be specified by an optional hash argument.

open("http://www.ruby-lang.org/en/",
  "User-Agent" => "Ruby/#{RUBY_VERSION}",
  "From" => "[email protected]",
  "Referer" => "http://www.ruby-lang.org/") {|f|
  # ...
}
Total answered 5/9, 2014 at 19:11 Comment(0)
F
5

You might need to pass 'User-Agent' as parameter to open method. Some sites require a valid User-Agent otherwise they simply don't respond or show a 404 not found error.

doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html", "User-Agent" => "MyCrawlerName (http://mycrawler-url.com)"))
Frontality answered 16/10, 2015 at 9:9 Comment(0)
K
2

So what is going on and how can I deal with this kind of error.

No clue what's going on, but you can deal with it by catching the error.

begin
  doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html"))
  puts doc
rescue => e
  puts "I failed: #{e}"
end

Can I just ignore it and let the rest do their work?

Sure! Maybe? Not sure. We don't know your requirements.

Kirsti answered 5/9, 2014 at 18:50 Comment(1)
But what happens to me is that I get that the next is invalidMichelmichelangelo

© 2022 - 2024 — McMap. All rights reserved.