I am trying to crawl the user's ratings of cinema movies of imdb from the review page: (number of movies in my database is about 600,000). I used jsoup to parse pages as below: (sorry, I didn't write the whole code here since it is too long)
try {
//connecting to mysql db
ResultSet res = st
.executeQuery("SELECT id, title, production_year " +
"FROM title " +
"WHERE kind_id =1 " +
"LIMIT 0 , 100000");
while (res.next()){
.......
.......
String baseUrl = "http://www.imdb.com/search/title?release_date=" +
""+year+","+year+"&title="+movieName+"" +
"&title_type=feature,short,documentary,unknown";
Document doc = Jsoup.connect(baseUrl)
.userAgent("Mozilla")
.timeout(0).get();
.....
.....
//insert ratings into database
...
I tested it for the first 100, then first 500 and also for the first 2000 movies in my db and it worked well. But the problem is that when I tested for 100,000 movies I got this error:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500, URL=http://www.imdb.com/search/title?release_date=1899,1899&title='Columbia'%20Close%20to%20the%20Wind&title_type=feature,short,documentary,unknown
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at imdb.main(imdb.java:47)
I searched a lot for this error and I found it is a server side error with 5xx error number.
Then I decided to set a condition that when connection fails, it tries 2 more times and then if still couldn't connect, does not stop and goes to the next url. since I am new to java I tried to search for similar questions and read these answers in stackoverflow:
Exceptions while I am extracting data from a Web site
Jsoup error handling when couldn't connect to website
Handling connection errors and JSoup
but, when I try with "Connection.Response" as they suggest, it tells me that "Connection.Response cannot be resolved to a type".
I appreciate if someone could help me, since I am just a newbie and I know it might be simple but I don't know how to fix it.
Well, I could fix the http error status 500 by just adding "ignoreHttpError(true)" as below:
org.jsoup.Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21");
con.timeout(180000).ignoreHttpErrors(true).followRedirects(true);
Response resp = con.execute();
Document doc = null;
if (resp.statusCode() == 200) {
doc = con.get();
......
hope it can help those have the same error.
however, after crawling review pages of 22907 movies (about 12 hours), I got another error:
"READ TIMED OUT".
I appreciate any suggestion to fix this error.
org.jsoup.Connection.Response
? – LociConnection.Response con = Jsoup .connect( "http://www.imdb.com/search/title?release_date=1899,1899&title='Columbia'%20Close%20to%20the%20Wind&title_type=feature,short,documentary,unknown") .userAgent( "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21") .timeout(10000).execute(); System.out.println(con.body());
– Loci.useragent()
while fetching the response. Try copying the code in my comment above and see if you see any javascript or html like code in your console, which would mean the connection was a success – Loci