Handling error response to empty webpage from read_html
Asked Answered
B

1

3

Trying to scrape a web page title but running into a problem with a website called "tweg.com"

library(httr)
library(rvest)
page.url <- "tweg.com"
page.get <- GET(page.url) # from httr
pg <- read_html(page.get) # from rvest
page.title <- html_nodes(pg, "title") %>% 
  html_text() # from rvest

read_html stops with an error message: "Error: Failed to parse text". Looking into page.get$content, find that it is empty (raw(0)).

Certainly, can write a simple check to take this into account and avoid parsing using read_html. However, feel that a more elegant solution would be to get something back from read_html and then based on it return an empty page title (i.e., ""). Tried passing "options" to read_html, such as RECOVER, NOERROR and NOBLANKS, but no success. Any ideas how to get back "empty page" response from read_html?

Banquet answered 12/12, 2016 at 4:19 Comment(4)
You could use tryCatch: tryCatch(read_html('http://tweg.com'), error = function(e){'empty page'}) or its tidyverse (purrr) versions, possibly and safely.Butyl
tryCatch does solve the issue, but also opens a can of worms because what if another error is returned and it catches it as "empty page"? Will look into purrr, if that might be a more comprehensive solution.Banquet
You can store the error with something like tryCatch(read_html('http://tweg.com'), error = function(e){list(result = 'empty page', error = e)}), which returns the same thing as safely.Butyl
Yep, this is a good solution! Thank you! Do you want to add it as an answer?Banquet
B
3

You can use tryCatch to catch errors and return something in particular (just try(read_html('http://tweg.com'), silent = TRUE) will work if you just want to return the error and continue). You'll need to pass tryCatch a function for what to return when error is caught, which you can structure as you like.

library(rvest)


tryCatch(read_html('http://tweg.com'), 
         error = function(e){'empty page'})    # just return "empty page"
#> [1] "empty page"

tryCatch(read_html('http://tweg.com'), 
         error = function(e){list(result = 'empty page', 
                                  error = e)})    # return error too
#> $result
#> [1] "empty page"
#> 
#> $error
#> <Rcpp::exception in eval(substitute(expr), envir, enclos): Failed to parse text>

The purrr package also contains two functions possibly and safely that do the same thing, but accept more flexible function definitions. Note that they are adverbs, and thus return a function that still must be called, which is why the URL is in parentheses after the call.

library(purrr)

possibly(read_html, 'empty page')('http://tweg.com')
#> [1] "empty page"

safely(read_html, 'empty page')('http://tweg.com')
#> $result
#> [1] "empty page"
#> 
#> $error
#> <Rcpp::exception in eval(substitute(expr), envir, enclos): Failed to parse text>

A typical usage would be to map the resulting function across a vector of URLs:

c('http://tweg.com', 'http://wikipedia.org') %>% 
    map(safely(read_html, 'empty page'))
#> [[1]]
#> [[1]]$result
#> [1] "empty page"
#> 
#> [[1]]$error
#> <Rcpp::exception in eval(substitute(expr), envir, enclos): Failed to parse text>
#> 
#> 
#> [[2]]
#> [[2]]$result
#> {xml_document}
#> <html lang="mul" dir="ltr" class="no-js">
#> [1] <head>\n  <meta charset="utf-8"/>\n  <title>Wikipedia</title>\n  <me ...
#> [2] <body id="www-wikipedia-org">\n<h1 class="central-textlogo" style="f ...
#> 
#> [[2]]$error
#> NULL
Butyl answered 13/12, 2016 at 0:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.