Is there a simple way in R to extract only the text elements of an HTML page?
Asked Answered
C

6

29

Is there a simple way in R to extract only the text elements of an HTML page?

I think this is known as 'screen scraping' but I have no experience of it, I just need a simple way of extracting the text you'd normally see in a browser when visiting a url.

Candicecandid answered 7/7, 2010 at 14:4 Comment(6)
Duplicate: #1845329Viyella
@Viyella -- The answer given on that page doesn't seem to work (at least not anymore, though I'm sure it did at the time).Candicecandid
Then we should fix it, not start a new one. Or else ask a question directly related to how that old answer no longer works.Viyella
@Shane: I didn't see that original question when I posted mine. I notice you are the same person who answered that question in that post, please know I meant no disrespect, all help is appreciated ofcourse. I think the answer below by Tony is better for what I would like to do. I am new to stackoverflow, still getting the hang of it. :)Candicecandid
No worries. Tony's answer is great. Just want to be sure that as you learn SO, that searching before posting becomes part of the routine. And in retrospect, these questions are a little different... :)Viyella
#31424431 please help this questionWalleye
W
24

I had to do this once upon time myself.

One way of doing it is to make use of XPath expressions. You will need these packages installed from the repository at http://www.omegahat.org/

library(RCurl)
library(RTidyHTML)
library(XML)

We use RCurl to connect to the website of interest. It has lots of options which allow you to access websites that the default functions in base R would have difficulty with I think it's fair to say. It is an R-interface to the libcurl library.

We use RTidyHTML to clean up malformed HTML web pages so that they are easier to parse. It is an R-interface to the libtidy library.

We use XML to parse the HTML code with our XPath expressions. It is an R-interface to the libxml2 library.

Anyways, here's what you do (minimal code, but options are available, see help pages of corresponding functions):

u <- "http://stackoverflow.com/questions/tagged?tagnames=r" 
doc.raw <- getURL(u)
doc <- tidyHTML(doc.raw)
html <- htmlTreeParse(doc, useInternal = TRUE)
txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
cat(unlist(txt))

There may be some problems with this approach, but I can't remember what they are off the top of my head (I don't think my xpath expression works with all web pages, sometimes it might not filter out script code or it may plain just not work with some other pages at all, best to experiment!)

P.S. Another way, which works almost perfectly I think at web scraping all text from html is the following (basically getting Internet Explorer to do the conversion for you):

library(RDCOMClient) 
u <- "http://stackoverflow.com/questions/tagged?tagnames=r"
ie <- COMCreate("InternetExplorer.Application") 
ie$Navigate(u)
txt <- list()
txt[[u]] <- ie[["document"]][["body"]][["innerText"]] 
ie$Quit() 
print(txt) 

HOWEVER, I've never liked doing this because not only is it slow, but if you vectorise it and apply a vector of URLs, if internet explorer crashes on a bad page, then R might hang or crash itself (I don't think ?try helps that much in this case). Also it's prone to allowing pop-ups. I don't know, it's been a while since I've done this, but thought I should point this out.

Wreckfish answered 7/7, 2010 at 14:4 Comment(1)
Great answer, though I'm having problems installing RTidyHTML; I've tried install.packages('http://www.omegahat.net/RTidyHTML/RTidyHTML_0.2-1.tar.gz', repos=NULL) and install_github('omegahat/RTidyHTML') but compilation fails on Windows 10.Liatrice
R
12

The best solution is package htm2txt.

library(htm2txt)
url <- 'https://en.wikipedia.org/wiki/Alan_Turing'
text <- gettxt(url)

For details, see https://CRAN.R-project.org/package=htm2txt.

Renewal answered 3/9, 2018 at 2:43 Comment(0)
V
3

Well it´s not exactly a R way of doing it, but it´s as simple as they come: outwit plugin for firefox. The basic version is for free and helps to extract tables and stuff.

ah and if you really wanna do it the hard way in R, this link is for you:

Veta answered 7/7, 2010 at 14:21 Comment(0)
T
3

I've had good luck with the readHTMLTable() function of the XML package. It returns a list of all tables on the page.

library(XML)
url <- 'http://en.wikipedia.org/wiki/World_population'
allTables <- readHTMLTable(url)

There can be many tables on each page.

length(allTables)
# [1] 17

So just select the one you want.

tbl <- allTables[[3]]

The biggest hassle can be installing the XML package. It's big, and it needs the libxml2 library (and, under Linux, it needs the xml2-config Debian package, too). The second biggest hassle is that HTML tables often contain junk you don't want, besides the data you do want.

Tergum answered 8/7, 2010 at 13:31 Comment(0)
H
2

You can also use the rvest package and first, select all html nodes/tags containing text (e.g. p, h1, h2, h3) and then extract the text from those:

require(rvest)
url = 'https://en.wikipedia.org/wiki/Alan_Turing'
site = read_html(url)
text = html_text(html_nodes(site, 'p,h1,h2,h3')) # comma separate
Hinda answered 5/5, 2020 at 8:37 Comment(0)
S
0

Here is another approach that can be used :

library(pagedown)
library(pdftools)
chrome_print(input = "http://stackoverflow.com/questions/tagged?tagnames=r", 
             output = "C:/.../test.pdf")
text <- pdf_text("C:/.../test.pdf")

It is also possible to use RSelenium :

library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("http://stackoverflow.com/questions/tagged?tagnames=r")
remDr$getPageSource()[[1]]
Shingly answered 22/1, 2022 at 12:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.