Using R to "click" a download file button on a webpage
Asked Answered
D

1

15

I am attempting to use this webpage http://volcano.si.edu/search_eruption.cfm to scrape data. There are two drop-down boxes that ask for filters of the data. I do not need filtered data, so I leave those blank and continue on to the next page by clicking "Search Eruptions".

What I have noticed, though, is that the resulting table only includes a small amount of columns (only 5) compared to the total amount of columns (total of 24) it should have. However, all 24 columns will be there if you click the "Download Results to Excel" button and open the downloaded file. This is what I need.

So, it looks like this has turned from a scraping exercise (using httr and rvest) into something more difficult. However, I'm stumped on how to actually "click" on the "Download Results to Excel" button using R. My guess is I will have to use RSelenium, but here is my code trying to use httr with POST in case there is an easier way that any of you kind people can find. I've also tried using gdata, data.table, XML, etc. to no avail which could just be a result of user error.

Also, it might be helpful to know that the download button cannot be right-clicked to show a URL.

url <- "http://volcano.si.edu/database/search_eruption_results.cfm"

searchcriteria <- list(
    eruption_category = "",
    country = ""
)

mydata <- POST(url, body = "searchcriteria")

Using the Inspector in my browser, I was able to see that the two filters are "eruption_category" and "country" and both will be blank since I do not need any filtered data.

Lastly, it would seem that the above code will get me on to the page that has the table with only 5 columns. However, I was still unable to scrape this table using rvest in the code below (using SelectorGadget to scrape just one column). In the end, this part doesn't matter as much because, as I had said above, I need all 24 columns, not just these 5. But, if you find any errors with what I did below as well, I would be grateful.

Eruptions <- mydata %>%
    read_html() %>%
    html_nodes(".td8") %>%
    html_text()
Eruptions

Thank you for any help you can provide.

Danais answered 7/2, 2017 at 21:22 Comment(2)
It looks like the page uses a JavaScript to render the page. The easiest and fastest way could be just download the Excel file and process that. The data looks to be relitivitly static so the occasional download shouldn't be a problem.Klong
Thanks @Dave2e. Unfortunately, I do need to be doing this in R. And, as you said, it is mostly static, but still is updated frequently enough.Danais
D
11

Just mimic the POST it does:

library(httr)
library(rvest)
library(purrr)
library(dplyr)

POST("http://volcano.si.edu/search_eruption_results.cfm",
     body = list(bp = "", `eruption_category[]` = "", `country[]` = "", polygon = "",  cp = "1"),
     encode = "form") -> res

content(res, as="parsed") %>%
  html_nodes("div.DivTableSearch") %>%
  html_nodes("div.tr") %>%
  map(html_children) %>%
  map(html_text) %>%
  map(as.list) %>%
  map_df(setNames, c("volcano_name", "subregion", "eruption_type",
                     "start_date", "max_vei", "X1")) %>%
  select(-X1)
## # A tibble: 750 × 5
##    volcano_name            subregion      eruption_type  start_date
##           <chr>                <chr>              <chr>       <chr>
## 1   Chirinkotan        Kuril Islands Confirmed Eruption 2016 Nov 29
## 2   Zhupanovsky  Kamchatka Peninsula Confirmed Eruption 2016 Nov 20
## 3       Kerinci              Sumatra Confirmed Eruption 2016 Nov 15
## 4       Langila          New Britain Confirmed Eruption  2016 Nov 3
## 5     Cleveland     Aleutian Islands Confirmed Eruption 2016 Oct 24
## 6         Ebeko        Kuril Islands Confirmed Eruption 2016 Oct 20
## 7        Ulawun          New Britain Confirmed Eruption 2016 Oct 11
## 8      Karymsky  Kamchatka Peninsula Confirmed Eruption  2016 Oct 5
## 9        Ubinas                 Peru Confirmed Eruption  2016 Oct 2
## 10      Rinjani Lesser Sunda Islands Confirmed Eruption 2016 Sep 27
## # ... with 740 more rows, and 1 more variables: max_vei <chr>

I assumed the "Excel" part could be inferred, but if not:

POST("http://volcano.si.edu/search_eruption_excel.cfm", 
     body = list(`eruption_category[]` = "", 
                 `country[]` = ""), 
     encode = "form",
     write_disk("eruptions.xls")) -> res
Distinguished answered 7/2, 2017 at 21:51 Comment(8)
Great answer, but the questions states that the file manually downloaded is more complete as it has more columns (24 instead of 5). I'd really like to know how it would be possible to automatically download and load itLongs
Added it (I assumed it would have been easily extrapolated)Distinguished
Thank you @Distinguished for this answer. It is exactly what I needed.Danais
Hi In the example above you press the submit button on a form. How would you go about doing it on a javascript buttin with "onclick" event like this: <a href="#" class="download" onclick="ExportToExcel('ambviewone'); return false;" title="Download som Excel ark">Download som Excel ark</a> This returns "could not find function" httr::POST(url=url, ExportToExcel('ambviewone')) SincerelyPiper
@Piper shld be a new question with a MWEDistinguished
update of the code: POST("volcano.si.edu/database/search_eruption_results.cfm", body = list(eruption_category[] = "", country[] = ""), encode = "form", write_disk("GVP_Eruption_Results.xls" ))Conservancy
I have similar problem and I stumbled across this answer. However I couldn't make the code work (even with @Adam's corrections). Is it problem on my machine or the code doesn't work (anymore?)?Multiversity
@Multiversity post. a new question with an MWE so we could helpRevelry

© 2022 - 2024 — McMap. All rights reserved.