My question is similar to this one, but the latter did not receive an answer I can work with. I am scraping thousands of urls with xml2::read_html
. This works fine. But when I try and parse the resulting html documents using purrr::map_df
and html_nodes
, I get the following error:
Error in doc_namespaces(doc) : external pointer is not valid
For some reason, I am unable to reproduce the error using examples. The example below is not good, because it works totally fine. But if someone could explain me conceptually what the error means and how to solve it, that would be great (here is a github thread on a similar problem, but I don't follow all the technicalities).
library(rvest)
library(purrr)
urls_test <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
"https://en.wikipedia.org/wiki/Rome")
h <- urls_test %>% map(~{
Sys.sleep(sample(seq(1, 3, by=0.001), 1))
read_html(.x)})
out <- h %>% map_df(~{
a <- html_nodes(., "#firstHeading") %>% html_text()
a <- if (length(a) == 0) NA else a
b <- html_nodes(., ".toctext") %>% html_text()
b <- if (length(b) == 0) NA else b
df <- tibble(a, b)
})
Session info:
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Devuan GNU/Linux ascii