how to fix HTTP error fetching URL. Status=500 in java while crawling?
Asked Answered
J

1

8

I am trying to crawl the user's ratings of cinema movies of imdb from the review page: (number of movies in my database is about 600,000). I used jsoup to parse pages as below: (sorry, I didn't write the whole code here since it is too long)

try {
  //connecting to mysql db
  ResultSet res = st
        .executeQuery("SELECT id, title, production_year " +
                "FROM title " +
                "WHERE kind_id =1 " +
                "LIMIT 0 , 100000");
  while (res.next()){
       .......
       .......
     String baseUrl = "http://www.imdb.com/search/title?release_date=" +
            ""+year+","+year+"&title="+movieName+"" +
            "&title_type=feature,short,documentary,unknown";
    Document doc = Jsoup.connect(baseUrl)
            .userAgent("Mozilla")
            .timeout(0).get();
      .....
      ..... 
//insert ratings into database
      ...

I tested it for the first 100, then first 500 and also for the first 2000 movies in my db and it worked well. But the problem is that when I tested for 100,000 movies I got this error:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500,   URL=http://www.imdb.com/search/title?release_date=1899,1899&title='Columbia'%20Close%20to%20the%20Wind&title_type=feature,short,documentary,unknown
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at imdb.main(imdb.java:47)

I searched a lot for this error and I found it is a server side error with 5xx error number.

Then I decided to set a condition that when connection fails, it tries 2 more times and then if still couldn't connect, does not stop and goes to the next url. since I am new to java I tried to search for similar questions and read these answers in stackoverflow:

Exceptions while I am extracting data from a Web site

Jsoup error handling when couldn't connect to website

Handling connection errors and JSoup

but, when I try with "Connection.Response" as they suggest, it tells me that "Connection.Response cannot be resolved to a type".

I appreciate if someone could help me, since I am just a newbie and I know it might be simple but I don't know how to fix it.


Well, I could fix the http error status 500 by just adding "ignoreHttpError(true)" as below:

org.jsoup.Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21");
con.timeout(180000).ignoreHttpErrors(true).followRedirects(true);
Response resp = con.execute();
Document doc = null;

if (resp.statusCode() == 200) {
    doc = con.get();
......

hope it can help those have the same error.

however, after crawling review pages of 22907 movies (about 12 hours), I got another error:
"READ TIMED OUT".

I appreciate any suggestion to fix this error.

Joiner answered 18/2, 2014 at 15:49 Comment(10)
What about org.jsoup.Connection.Response?Loci
I tried, but I receive this errror: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)Joiner
I tried this and it gets me the output: Connection.Response con = Jsoup .connect( "http://www.imdb.com/search/title?release_date=1899,1899&title='Columbia'%20Close%20to%20the%20Wind&title_type=feature,short,documentary,unknown") .userAgent( "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21") .timeout(10000).execute(); System.out.println(con.body());Loci
@PopoFibo: sorry, the error is this: HTTP error fetching URL. Status=403, URL=imdb.com/search/… at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)Joiner
403 means forbidden, some sites do not allow robots, hence you must use .useragent() while fetching the response. Try copying the code in my comment above and see if you see any javascript or html like code in your console, which would mean the connection was a successLoci
I tried your code and it worked, it returned html.. so the connection was successful. but why in my program it does not work?:(Joiner
Take a look at this post, maybe it helps you.Gumbotil
for me the problem is at the next line which is: Document doc = Jsoup.connect(baseUrl).get();Joiner
yes, that is because you haven't broken it down into connection, response and document parts, let me upgrade this comment to an answerLoci
@eltado: thanks for yr comment. but I read that before and i imported data from imdb.com/interfaces. however,imdb dataset has only rating of movies. but i needed ratings of each specific user for eavch movie... thats why i started to crawl the review pages (where i could access their ratings)Joiner
L
12

Upgrading my comments to an answer:

Connection.Response is org.jsoup.Connection.Response

To allow document instance only when there is a valid http code (200), break your call into 3 parts; Connection, Response, Document

Hence, your part of the code above gets modified to:

while (res.next()){
       .......
       .......
       String baseUrl = "http://www.imdb.com/search/title?release_date=" + ""
                + year + "," + year + "&title=" + movieName + ""
                + "&title_type=feature,short,documentary,unknown";
       Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21").timeout(10000);
       Connection.Response resp = con.execute();
       Document doc = null;
        if (resp.statusCode() == 200) {
            doc = con.get();
                    ....
        }
Loci answered 18/2, 2014 at 16:59 Comment(4)
@PopoFico: Thanks a lot your answer, it helped a lot. since Connection.Response did not work for me, as you suggested before I tried this for the first 50 movies in my db and it worked: org.jsoup.Connection con = Jsoup.connect(baseUrl).userAgent("....").timeout(10000); Response resp = con.execute(); now, i am testing it for 10,000 movies to see the results;) thanks again for your great help:)Joiner
Well, unfortunately, I received error status 500 again:(Joiner
@monamona well that was the whole idea behind using Connection.Response, to get a handle on the status code and if it's anything other than 200 (like in your case 500) do not continue with the Document instance and move on to the next oneLoci
@monamona alternatively, try increasing the timeout from 10000 to say, 60000 (a minute) if that's feasibleLoci

© 2022 - 2024 — McMap. All rights reserved.