Check if URL exists in R
Asked Answered
W

4

8

I want to loop over a list of URLs and I want to find out if these URLs exist or not.

RCurl provides the url.exists() function. However, the output doesn't seem to be right, because for example it says that amazon.com is not registered (it does so because the url.exists()-function doesn't return a value in the 200 range, in the case of amazon.com it's 405 ("method not allowed").

I also tried HEAD() and GET() provided by the httr package. But sometimes I get error messages here, for example for timeouts or because the URL is not registered.

Error messages look like this:

Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: Connection timed out after 10000 milliseconds

Error in curl::curl_fetch_memory(url, handle = handle) : Could not resolve host: afsadadssadasf.com

When I get such an error, the whole for loop stops. Is it possible to continue the for loop? I tried tryCatch(), but to my knowledge this can only help when the problem is in the dataframe itself.

Winch answered 21/10, 2018 at 3:9 Comment(0)
W
15

pingr::ping() only uses ICMP which is blocked on sane organizational networks since attackers used ICMP as a way to exfiltrate data and communicate with command-and-control servers.

pingr::ping_port() doesn't use the HTTP Host: header so the IP address may be responding but the target virtual web host may not be running on it and it definitely doesn't validate that the path exists at the target URL.

You should clarify what you want to happen when there are only non-200:299 range HTTP status codes. The following makes an assumption.

NOTE: You used Amazon as an example and I'm hoping that's the first site that just "came to mind" since it's unethical and a crime to scrape Amazon and I would appreciate my code not being brought into your universe if you are in fact just a brazen content thief. If you are stealing content, it's unlikely you'd be up front here about that, but on the outside chance you are both stealing and have a conscience, please let me know so I can delete this answer so at least other content thieves can't use it.

Here's a self-contained function for checking URLs:

#' @param x a single URL
#' @param non_2xx_return_value what to do if the site exists but the
#'        HTTP status code is not in the `2xx` range. Default is to return `FALSE`.
#' @param quiet if not `FALSE`, then every time the `non_2xx_return_value` condition
#'        arises a warning message will be displayed. Default is `FALSE`.
#' @param ... other params (`timeout()` would be a good one) passed directly
#'        to `httr::HEAD()` and/or `httr::GET()`
url_exists <- function(x, non_2xx_return_value = FALSE, quiet = FALSE,...) {

  suppressPackageStartupMessages({
    require("httr", quietly = FALSE, warn.conflicts = FALSE)
  })

  # you don't need thse two functions if you're alread using `purrr`
  # but `purrr` is a heavyweight compiled pacakge that introduces
  # many other "tidyverse" dependencies and this doesnt.

  capture_error <- function(code, otherwise = NULL, quiet = TRUE) {
    tryCatch(
      list(result = code, error = NULL),
      error = function(e) {
        if (!quiet)
          message("Error: ", e$message)

        list(result = otherwise, error = e)
      },
      interrupt = function(e) {
        stop("Terminated by user", call. = FALSE)
      }
    )
  }

  safely <- function(.f, otherwise = NULL, quiet = TRUE) {
    function(...) capture_error(.f(...), otherwise, quiet)
  }

  sHEAD <- safely(httr::HEAD)
  sGET <- safely(httr::GET)

  # Try HEAD first since it's lightweight
  res <- sHEAD(x, ...)

  if (is.null(res$result) || 
      ((httr::status_code(res$result) %/% 200) != 1)) {

    res <- sGET(x, ...)

    if (is.null(res$result)) return(NA) # or whatever you want to return on "hard" errors

    if (((httr::status_code(res$result) %/% 200) != 1)) {
      if (!quiet) warning(sprintf("Requests for [%s] responded but without an HTTP status code in the 200-299 range", x))
      return(non_2xx_return_value)
    }

    return(TRUE)

  } else {
    return(TRUE)
  }

}

Give it a go:

c(
  "http://content.thief/",
  "http://rud.is/this/path/does/not_exist",
  "https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=content+theft", 
  "https://www.google.com/search?num=100&source=hp&ei=xGzMW5TZK6G8ggegv5_QAw&q=don%27t+be+a+content+thief&btnK=Google+Search&oq=don%27t+be+a+content+thief&gs_l=psy-ab.3...934.6243..7114...2.0..0.134.2747.26j6....2..0....1..gws-wiz.....0..0j35i39j0i131j0i20i264j0i131i20i264j0i22i30j0i22i10i30j33i22i29i30j33i160.mY7wCTYy-v0", 
  "https://rud.is/b/2018/10/10/geojson-version-of-cbc-quebec-ridings-hex-cartograms-with-example-usage-in-r/"
) -> some_urls

data.frame(
  exists = sapply(some_urls, url_exists, USE.NAMES = FALSE),
  some_urls,
  stringsAsFactors = FALSE
) %>% dplyr::tbl_df() %>% print()
##  A tibble: 5 x 2
##   exists some_urls                                                                           
##   <lgl>  <chr>                                                                               
## 1 NA     http://content.thief/                                                               
## 2 FALSE  http://rud.is/this/path/does/not_exist                                              
## 3 TRUE   https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=con…
## 4 TRUE   https://www.google.com/search?num=100&source=hp&ei=xGzMW5TZK6G8ggegv5_QAw&q=don%27t…
## 5 TRUE   https://rud.is/b/2018/10/10/geojson-version-of-cbc-quebec-ridings-hex-cartograms-wi…
## Warning message:
## In FUN(X[[i]], ...) :
##   Requests for [http://rud.is/this/path/does/not_exist] responded but without an HTTP status code in the 200-299 range
Wiencke answered 21/10, 2018 at 12:20 Comment(3)
Thank you very much for your detailed explanation. I'm not a content thief and actually I didn't come up with amazon.com first. Another user just used amazon.com to demonstrate the ping()-function. I do this just for a little university project and scraping is not my intetion. However, your provided code helps a lot for me as a beginner in R! Unfortunately, I still get an error when a URL does not exist: Error in UseMethod("status_code") : no applicable method for 'status_code' applied to an object of class "NULL" For some reason, the code still breaks before the if (is.null(res)) statementWinch
Errant logical operator fixed and additional functionality added despite the questionable scraping claims.Wiencke
I'm sorry, of course you are right. I have mentioned it at first. Sorry, it was late at night and today I just looked at the formatted text to find an amazon url in my initial post. It's 100% my bad. If I were a content thief who would like to scrape amazon, I probably wouldn't point to that particular website. However, thank you again for your support, even though you see a potential content thief in me.Winch
B
14

Here is a simple solution to the problem.

urls <-   c("http://www.amazon.com",
            "http://this.isafakelink.biz",
            "https://stackoverflow.com")

valid_url <- function(url_in,t=2){
  con <- url(url_in)
  check <- suppressWarnings(try(open.connection(con,open="rt",timeout=t),silent=T)[1])
  suppressWarnings(try(close.connection(con),silent=T))
  ifelse(is.null(check),TRUE,FALSE)
}

sapply(urls,valid_url)
Ben answered 11/3, 2020 at 1:8 Comment(1)
Use HEAD method might be a faster way to connect a siteSheaff
C
4

Try the ping function in the pingr package. It gives the timings of pings.

library(pingr)

ping("amazon.com") # good site
## [1] 45 46 45

ping("xxxyyyzzz.com") # bad site
## [1] NA NA NA
Concurrence answered 21/10, 2018 at 3:24 Comment(2)
Thanks, but using ping() I get only NAs for any site. Even for ping("amazon.com"). Following the pingr documentation, this one works fine: ping_port("www.google.com", port = 80, count = 1). I will try again later with ping()Winch
ICMP is likely (rightfully so) blocked on your network.Wiencke
M
-1

Here's a function that evaluates an expression and returns TRUE if it works and FALSE if it doesn't. You can also assign variables inside the expression.

try_catch <- function(exprs) {!inherits(try(eval(exprs)), "try-error")}

try_catch(out <- log("a")) # returns FALSE
out # Error: object 'out' not found

try_catch(out <- log(1)) # returns TRUE
out # out = 0

You can use the expression to check for success.

done <- try_catch({
    # try something
})
if(!done) {
    done <- try_catch({
        # try something else
    })
}
if(!done) {
    # default expression
}
Margarite answered 14/8, 2021 at 16:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.