Using R to accept cookies to download a PDF file
Asked Answered
H

2

8

I'm getting stuck on cookies when trying to download a PDF.

For example, if I have a DOI for a PDF document on the Archaeology Data Service, it will resolve to this landing page with an embedded link in it to this pdf but which really redirects to this other link.

library(httr) will handle resolving the DOI and we can extract the pdf URL from the landing page using library(XML) but I'm stuck at getting the PDF itself.

If I do this:

download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf", destfile = "tmp.pdf")

then I receive a HTML file that is the same as http://archaeologydataservice.ac.uk/myads/

Trying the answer at How to use R to download a zipped file from a SSL page that requires cookies leads me to this:

library(httr)

terms <- "http://archaeologydataservice.ac.uk/myads/copyrights"
download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload"
values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf")

# Accept the terms on the form,
# generating the appropriate cookies

POST(terms, body = values)
GET(download, query = values)

# Actually download the file (this will take a while)

resp <- GET(download, query = values)

# write the content of the download to a binary file

writeBin(content(resp, "raw"), "c:/temp/thefile.zip")

But after the POST and GET functions I simply get the HTML of the same cookie page that I got with download.file:

> GET(download, query = values)
Response [http://archaeologydataservice.ac.uk/myads/copyrights?from=2f6172636869766544532f61726368697665446f776e6c6f61643f61677265653d79657326743d617263682d313335322d3125324664697373656d696e6174696f6e2532467064662532464479666564253246474c34343030342e706466]
  Date: 2016-01-06 00:35
  Status: 200
  Content-Type: text/html;charset=UTF-8
  Size: 21 kB
<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "h...
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
        <head>
            <meta http-equiv="Content-Type" content="text/html; c...


            <title>Archaeology Data Service:  myADS</title>

            <link href="http://archaeologydataservice.ac.uk/css/u...
...

Looking at http://archaeologydataservice.ac.uk/about/Cookies it seems that the cookie situation at this site is complicated. Seems like this kind of cookie complexity is not unusual for UK data providers: automating the login to the uk data service website in R with RCurl or httr

How can I use R to get past the cookies on this website?

Honebein answered 6/1, 2016 at 0:40 Comment(0)
S
6

Your plea on rOpenSci has been heard!

There's lots of javascript between those pages that makes it somewhat annoying to try to decipher via httr + rvest. Try RSelenium. This worked on OS X 10.11.2, R 3.2.3 & Firefox loaded.

library(RSelenium)

# check if a sever is present, if not, get a server
checkForServer()

# get the server going
startServer()

dir.create("~/justcreateddir")
setwd("~/justcreateddir")

# we need PDFs to download instead of display in-browser
prefs <- makeFirefoxProfile(list(
  `browser.download.folderList` = as.integer(2),
  `browser.download.dir` = getwd(),
  `pdfjs.disabled` = TRUE,
  `plugin.scan.plid.all` = FALSE,
  `plugin.scan.Acrobat` = "99.0",
  `browser.helperApps.neverAsk.saveToDisk` = 'application/pdf'
))
# get a browser going
dr <- remoteDriver$new(extraCapabilities=prefs)
dr$open()

# go to the page with the PDF
dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755")

# find the PDF link and "hit ENTER"
pdf_elem <- dr$findElement(using="css selector", "a.dlb3")
pdf_elem$sendKeysToElement(list("\uE007"))

# find the ACCEPT button and "hit ENTER"
# that will save the PDF to the default downloads directory
accept_elem <- dr$findElement(using="css selector", "a[id$='agreeButton']")
accept_elem$sendKeysToElement(list("\uE007"))

Now wait for the download to complete. The R console will not be busy while it downloads, so it is easy to close the session accidently, before the download has completed.

# close the session
dr$close()
Suspensory answered 6/1, 2016 at 2:28 Comment(7)
Gave a try on Ubuntu 14.04, R 3.2.3 and Firefox. dr$open() reports [1] "Connecting to remote server" Undefined error in RCurl call.Error in queryRD(paste0(serverURL, "/session"), "POST", qdata = toJSON(serverOpts)) :Gunflint
This has always been my biggest nit to pick with Selenium in general (not necessarily the R pkg). Getting consistency between Windows, OS X & *nix is so difficult. Hopefully folks can add to this (all my *nix systems are very thinly configured headless server-y things and I'm not abt to try to master the phantomjs driver tonight :-)Suspensory
OK, found how to make it to work on my computer. I had to manually start the selenium standalone server first with java -jar selenium-server-standalone-2.48.0.jar. Then I can connect.Gunflint
Thanks, rOpenSci to the rescue! That's a great workaround and gets it done. For improved portability it would be ideal to have the download in the R session's working directory, is that possible? I made some minor edits to add a little detail (worked for me on Win 7, R v3.2.3 with firefox).Honebein
Aye. I'll figure out the Firefox profile setting string and update the answer in a bit.Suspensory
That took more effort than expected (initial profile settings weren't working, but the above did). You may need to quote the directory path better given the crazy Windows slashes, but I can confirm the above worked on 2 Macs.Suspensory
I can confirm this works on my Ubuntu. Just had to skip the checkForServer() step, as keeping to try to download the standalone server, when a server is already running after java -jar selenium-server-standalone-2.48.0.jar.Gunflint
H
3

This answer came from John Harrison by email, posted here at his request:

This will allow you to download the PDF:

appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
           , curl=curl, followLocation = TRUE)
pdfData <- getBinaryURL(appURL, curl = curl, .opts = list(cookie = "ADSCOPYRIGHT=YES"))
writeBin(pdfData, "test2.pdf")

Here's a longer version showing his working

appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
           , curl=curl, followLocation = TRUE)
appData <- getURL(appURL, curl = curl)

# get the necessary elements for the POST that is initiated when the ACCEPT button is pressed

doc <- htmlParse(appData)
appAttrs <- doc["//input", fun = xmlAttrs]
postData <- lapply(appAttrs, function(x){data.frame(name = x[["name"]], value = x[["value"]]
                                                    , stringsAsFactors = FALSE)})
postData <- do.call(rbind, postData)

# post your acceptance
postURL <- "http://archaeologydataservice.ac.uk/myads/copyrights.jsf;jsessionid="
# get jsessionid
jsessionid <- unlist(strsplit(getCurlInfo(curl)$cookielist[1], "\t"))[7]

searchData <- postForm(paste0(postURL, jsessionid), curl = curl,
                       "j_id10" = "j_id10",
                       from = postData[postData$name == "from", "value"],
                       "javax.faces.ViewState" = postData[postData$name == "javax.faces.ViewState", "value"],
                       "j_id10:_idcl" = "j_id10:agreeButton"
                       , binary = TRUE
)
con <- file("test.pdf", open = "wb")
writeBin(searchData, con)
close(con)


Pressing the ACCEPT button on the page you gave initiates a POST to "http://archaeologydataservice.ac.uk/myads/copyrights.jsf;jsessionid=......" via some javascript.
This post then redirects to the page with the pdf having given some additional cookies.

Checking our cookies we see:

> getCurlInfo(curl)$cookielist
[1] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tJSESSIONID\t3d249e3d7c98ec35998e69e15d3e" 
[2] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tSSOSESSIONID\t3d249e3d7c98ec35998e69e15d3e"
[3] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tADSCOPYRIGHT\tYES"          

so it would probably be sufficient to set that last cookie to start with (indicating we accept copyright)
Honebein answered 9/1, 2016 at 21:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.