I have a list of about 13,000 URLs that I want to extract info from, however, not every URL actually exists. In fact the majority don't. I have just tried passing all 13,000 urls through html()
but it takes a long time. I am trying to work out how to see if the urls actually exist before parsing them to html()
. I have tried using httr
and GET()
functions, as well as rcurls
and url.exists()
functions. For some reason url.exist()
always returns FALSE
values even when the URL does exist, and the way I am using GET()
always returns a success, I think this is because the page is being redirected.
The following URLs represent the type of pages I am parsing, the first does not exist
urls <- data.frame('site' = 1:3, 'urls' = c('https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'))
urls$urls <- as.character(urls$urls)
For GET()
, the problem is that the second URL doesn't actually exist but it is redirected and therefore returns a "success".
urls$urlExists <- sapply(1:length(urls[,1]),
function(x) ifelse(http_status(GET(urls[x, 'urls']))[[1]] == "success", 1, 0))
For url.exists()
, I get three FALSE returned even though the first and third urls do exist.
urls$urlExists2 <- sapply(1:length(urls[,1]), function(x) url.exists(urls[x, 'urls']))
I checked these two posts 1, 2 but I would prefer not to use a useragent simply because I am not sure how to find mine or whether it would change for different people using this code on other computers. Therefore making the code harder to pick up and use by others. Both posts answers suggest using GET()
in httr
. It seems that GET()
is probably the preferred method but I would need to figure out how to deal with the redirection issue.
Can anyone suggest a good way in R to test the existence of a URL before parsing them to html()
? I would also be happy for any other suggested work around for this issue.
UPDATE:
After looking into the returned value from GET()
I figured out a work around, see answers for details.
GET()
function. I think I figured out a work around. I have added it to the bottom of the question. – Sidnee