r rvest error: "Error in doc_namespaces(doc) : external pointer is not valid"
Asked Answered
T

1

6

My question is similar to this one, but the latter did not receive an answer I can work with. I am scraping thousands of urls with xml2::read_html. This works fine. But when I try and parse the resulting html documents using purrr::map_df and html_nodes, I get the following error:

Error in doc_namespaces(doc) : external pointer is not valid

For some reason, I am unable to reproduce the error using examples. The example below is not good, because it works totally fine. But if someone could explain me conceptually what the error means and how to solve it, that would be great (here is a github thread on a similar problem, but I don't follow all the technicalities).

library(rvest)
library(purrr)
urls_test <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
             "https://en.wikipedia.org/wiki/Rome")
h <- urls_test %>% map(~{
  Sys.sleep(sample(seq(1, 3, by=0.001), 1))
  read_html(.x)})
out <- h %>% map_df(~{
  a <- html_nodes(., "#firstHeading") %>% html_text()
  a <- if (length(a) == 0) NA else a
  b <- html_nodes(., ".toctext") %>% html_text()
  b <- if (length(b) == 0) NA else b

  df <- tibble(a, b)
})

Session info:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Devuan GNU/Linux ascii
Tailrace answered 22/5, 2019 at 16:59 Comment(2)
Seems to happen when read_html() is working from a saved environment. I solved by reading the data fresh. community.rstudio.com/t/…Lothians
@Tailrace I faced the same problem and found some (not perfect, but working) solution. #61031825North
J
9

The problem is, that R stores xml in the memory with external pointers. These external pointers are not stored in .rds files. So once you save the project and reopen it you will get the error external pointer is not valid.

Workaround: use xml2::write_html() to save the parsed html to a html file. If you want to use it later just read it with xml2::read_html()

See also more information here and for parallel processing here and Parallel processing XML nodes with R

Jape answered 3/11, 2021 at 11:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.