R: Check existence of url, problems with httr:GET() and url.exists()

Asked 15/7, 2015 at 1:52 Answered 21/1, 2019 at 12:36

I have a list of about 13,000 URLs that I want to extract info from, however, not every URL actually exists. In fact the majority don't. I have just tried passing all 13,000 urls through html() but it takes a long time. I am trying to work out how to see if the urls actually exist before parsing them to html(). I have tried using httr and GET() functions, as well as rcurls and url.exists() functions. For some reason url.exist() always returns FALSE values even when the URL does exist, and the way I am using GET() always returns a success, I think this is because the page is being redirected.

The following URLs represent the type of pages I am parsing, the first does not exist

urls <- data.frame('site' = 1:3, 'urls' = c('https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010', 
                            'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
                            'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'))

urls$urls <- as.character(urls$urls)

For GET(), the problem is that the second URL doesn't actually exist but it is redirected and therefore returns a "success".

 urls$urlExists <- sapply(1:length(urls[,1]), 
                     function(x) ifelse(http_status(GET(urls[x, 'urls']))[[1]] == "success", 1, 0))

For url.exists(), I get three FALSE returned even though the first and third urls do exist.

 urls$urlExists2 <- sapply(1:length(urls[,1]), function(x) url.exists(urls[x, 'urls']))

I checked these two posts 1, 2 but I would prefer not to use a useragent simply because I am not sure how to find mine or whether it would change for different people using this code on other computers. Therefore making the code harder to pick up and use by others. Both posts answers suggest using GET() in httr. It seems that GET() is probably the preferred method but I would need to figure out how to deal with the redirection issue.

Can anyone suggest a good way in R to test the existence of a URL before parsing them to html()? I would also be happy for any other suggested work around for this issue.

UPDATE:

After looking into the returned value from GET() I figured out a work around, see answers for details.

Sidnee answered 15/7, 2015 at 1:52 Comment(2)

You have a conceptual problem here. With many web servers, if you try to access a page which does not exist, you will still get a page! What you really want to do is to check for a 404 error coming back. – Aiaia 15/7, 2015 at 2:1

Thanks Tim, your comment helped me look into what I was getting back from the GET() function. I think I figured out a work around. I have added it to the bottom of the question. – Sidnee 15/7, 2015 at 2:32

With httr, use url_success() and redirect following turned off:

library(httr)

urls <- c(
  'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010', 
  'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
  'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'
)

sapply(urls, url_success, config(followlocation = 0L), USE.NAMES = FALSE)

Irish answered 20/7, 2015 at 20:15 Comment(1)

Just leaving a note since I came across the same issue. With the present version (1.2.1), we use http_error instead of url_success. – Practise 4/11, 2016 at 2:1

url_success(x) is deprecated; please use !http_error(x) instead.

So update the solution from hadley.

> library(httr)
> 
> urls <- c(  
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'
> )
> 
> !sapply(urls, http_error)

Nigritude answered 21/1, 2019 at 12:36 Comment(0)

After a suggestion from @TimBiegeleisen I looked at what was returned from the funtion GET(). It seems that if the url exists GET() will return this url as a value, but if it is redirected a different url is returned. I just changed the code to look at whether the url returned by GET() matched the one I submitted.

urls$urlExists <- sapply(1:length(urls[,1]), function(x) ifelse(GET(urls[x, 'urls'])[[1]] == urls[x,'urls'], 1, 0))

I would be interested in learning about any better methods that people use for the same thing.

Sidnee answered 15/7, 2015 at 2:44 Comment(0)

Recommended topics

Hot tags