Identify a weblink in bold in R

Asked 5/5, 2016 at 23:26 Answered 6/5, 2016 at 1:29

The following script allows me to get to a website with several links with similar names. I want to get only one of them, which can be diferentiated from the others because it is printed in bold in the website. However, i could not find a way of selecting a bold link within a list.

Would anyone have ahint on this? Thanks in advance!

library(httr)
library(rvest)
sp="Alnus japonica"

res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do", 
              body = list(page ="advancedSearch", 
                          AttachmentExist ="", 
                          family ="", 
                          placeOfPub ="", 
                          genus = unlist(strsplit(as.character(sp), split="         "))[1], 
                          yearPublished ="", 
                          species = unlist(strsplit(as.character(sp), split="    "))[2], 
                          author ="", 
                          infraRank ="", 
                          infraEpithet ="", 
                          selectedLevel ="cont"), 
              encode ="form") 
pg <- content(res, as="parsed") 
lnks <- html_attr(html_nodes(pg,"a"),"href")
#how get the url of the link wth accepted name (in bold)?
res2 <- try(GET(sprintf("http://apps.kew.org%s", lnks[grep("id=",lnks)]      [1])),silent=T)
#this gets a link but often fails to get the bold one

Koral answered 5/5, 2016 at 23:26 Comment(4)

It depends a lot on how it was made bold. If it's inline styling, that's pretty easy, but it's probably CSS applied to a particular id or class, which means digging through the code. – Thorley 5/5, 2016 at 23:37

If you search manually, you actually do get a  tag, but it doesn't seem to show up in the httr results, so it must be inserted after the fact somehow. – Thorley 5/5, 2016 at 23:52

The links are surrounded by  tags, so you should be able to get them that way. Like alistaire said, not sure why httr is deleting them (I've no experience with httr, there may be an option...) – Liquor 5/5, 2016 at 23:52

libxml2 (which powers rvest & XML) is not as flexible as a browser.  outside a  is technically invalid HTML/XML and libxml2 parses it that way. – Tunnell 6/5, 2016 at 1:52

First, grab tidy-html5 (it works on pretty much everything) and install it and ensure it's in your PATH.

As my comment said, browsers handle  outside  as they need to be bulletproof. libxml2 does not. So, we need to clean this up first (and I now need to make a new tidyhtml package) and then process the tidied version:

library(xml2)
library(httr)
library(rvest)

res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do", 
              body = list(page ="advancedSearch", 
                          AttachmentExist ="", 
                          family ="", 
                          placeOfPub ="", 
                          genus = "Alnus", 
                          yearPublished ="", 
                          species = "japonica", 
                          author ="", 
                          infraRank ="", 
                          infraEpithet ="", 
                          selectedLevel ="cont"), 
              encode ="form") 

tf <- tempfile(fileext=".html")
cat(content(res, as="text"), file=tf)

tidy <- system2("tidy", c("-q", tf), TRUE)

pg <- read_html(paste0(tidy, sep="", collapse=""))

html_nodes(pg, xpath=".//p/b/a[contains(@href, 'name_id')]")

## {xml_nodeset (1)}
## [1] <a href="/wcsp/namedetail.do?name_id=6471" class="onwa ...

If CSS selectors are desired over XPath:

html_nodes(pg, "p > b > a[href*='name_id']")

UPDATE

I started a basic pkg wrapper for libtidy. If you're on OS X and use Homebrew you can do: brew install tidy-html5 (which installs the binary above and the libtidy library) and devtools::install_github("hrbrmstr/tidyhtml") to install the pkg. Then, it's just:

library(xml2)
library(httr)
library(rvest)
library(htmltidy)

res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do", 
              body = list(page ="advancedSearch", 
                          AttachmentExist ="", 
                          family ="", 
                          placeOfPub ="", 
                          genus = "Alnus", 
                          yearPublished ="", 
                          species = "japonica", 
                          author ="", 
                          infraRank ="", 
                          infraEpithet ="", 
                          selectedLevel ="cont"), 
              encode ="form") 

tidy_html <- tidy(content(res, as="text"))

pg <- read_html(tidy_html)

html_nodes(pg, "p > b > a[href*='name_id']")

I should be able to get this to work on Windows & linux and make it a real package (it's a thin wrapper w/o error checking now) but that'll be down on the TODO for a while.

Tunnell answered 6/5, 2016 at 1:29 Comment(6)

wow, awesome! it would've taken me a week to figure this out. – Liquor 6/5, 2016 at 14:32

thank you very much! i am trying to install tidy in windows 64 from github using cmake, but not so easy...any good tutorial is appreciated. – Koral 7/5, 2016 at 6:35

They have binaries for Windows. – Tunnell 7/5, 2016 at 11:46

thanks, now that worked for the example, but not for other species, like Abies amabilis. Despite is a valid name, i got this error: lnks <- html_attr(html_nodes(pg, xpath=".//p/b/a[contains(@href, 'name_id')]")) -Error in node_attr(x$node, name = attr, missing = default, nsMap = ns) : argument "name" is missing, with no default. Should i use a list of potential identifiers? – Koral 9/5, 2016 at 14:32

you forgot , "href" before the last ) – Tunnell 9/5, 2016 at 14:56

The package now compiles on Windows and is on CRAN but until CRAN is up to 0.3.0 (I found a nasty bug right after the CRAN submission) it's best to use the github/dev version. – Tunnell 11/9, 2016 at 13:14

Seems to me like there might be a bug with rvest/httr here, as  appears to surround <a href...> on the relevant link, but not in the parsed version.

I used:

library(rvest)
sp=strsplit("Alnus japonica", " ")[[1]]

session <- html_session("http://apps.kew.org/wcsp/advsearch.do")
form <- html_form(session)[[1]]

filled_form <- set_values(form, genus = sp[1], species = sp[2])

out <- submit_form(session, filled_form)

Look at the following:

out %>% html_nodes(xpath = "descendant-or-self::*") %>% `[`(81:90)
# {xml_nodeset (10)}
#  [1] <p><a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...
#  [2] <a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A?nam ...
#  [3] <i>Alnus</i>
#  [4] <i> japonica</i>
#  [5] <b>\n        </b>
#  [6] <p><a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...
#  [7] <a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A?nam ...
#  [8] <i>Alnus</i>
#  [9] <i> japonica</i>
# [10] <p><a # href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...

As you can see, the  node appears empty. However, when I enter the search manually and View Source on Chrome, I see:

<b>
    <p><a href="/wcsp/namedetail.do?name_id=6471" class="onwardnav"><i>Alnus</i><i> japonica</i> (Thunb.) Steud., Nomencl. Bot., ed. 2, 1: 55 (1840).</a>
    </p>
</b>

That the relevant <a> is between  and  tells me it should be a child of that , but this comes up blank:

out %>% html_nodes(xpath = "//b/child::*")

I'm admittedly no xpath expert, so I could be mucking things up here. Hope this helps get you on your way.

Liquor answered 6/5, 2016 at 0:24 Comment(0)

Recommended topics

Hot tags