How do I use html_nodes to select nodes with "attribute = x" in R?
Asked Answered
T

2

7

I have a set of html pages. I want to extract all table nodes where the attribute "border" = 1. Here is an example:

<table border="1" cellspacing="0" cellpadding="5">
   <tbody><tr><td>
    <table border="0" cellpadding="2" cellspacing="0">
      <tbody><tr>
        <td bgcolor="#ff9999"><strong><font size="+1">CASEID</font></strong></td>
      </tr></tbody>
    </table>
   <tr><td>[tbody]
</table>

In the example, I want to select the table node where border=1 but not the tables where border = 0. I am using html_nodes() from rvest but can't figure out how to add attributes:

html_nodes(x, "table")
Trevortrevorr answered 5/12, 2019 at 15:52 Comment(2)
Can you post a snippet of HTML as text?Horsehide
Added HTML as text.Trevortrevorr
O
10

Check out the CSS3 selectors documentation that’s linked from the documentation of html_nodes. It provides a thorough explanation of the CSS selector syntax.

For you case, you want

html_nodes(x, "tag[attribute]")

to select all tags with attribute set, or

html_nodes(x, "tag[attribute=value]")

to select all tags with attribute set to value.

Orleanist answered 5/12, 2019 at 15:56 Comment(0)
H
4

There are 2 major ways to find nodes from HTML and similar documents: CSS selectors and XPath. CSS is often easier but isn't capable of more complex use cases, whereas XPath has functions that can do things like search text within a node. Which one to use is always up for debate but I think it's worthwhile to try them both.

library(rvest)

with_css <- html_nodes(x, css = "table[border='1']")
with_css
#> {xml_nodeset (1)}
#> [1] <table border="1" cellspacing="0" cellpadding="5"><tbody>\n<tr><td>\n     ...

Verifying that the table looks right:

html_table(with_css, fill = TRUE)
#> [[1]]
#>        X1     X2
#> 1  CASEID CASEID
#> 2  CASEID   <NA>
#> 3 [tbody]   <NA>

The equivalent XPath gets the same table.

with_xpath <- html_nodes(x, xpath = "//table[@border=1]")
with_xpath
#> {xml_nodeset (1)}
#> [1] <table border="1" cellspacing="0" cellpadding="5"><tbody>\n<tr><td>\n     ...
html_table(with_xpath, fill = TRUE)
#> [[1]]
#>        X1     X2
#> 1  CASEID CASEID
#> 2  CASEID   <NA>
#> 3 [tbody]   <NA>
Horsehide answered 5/12, 2019 at 16:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.