I'm getting stuck on cookies when trying to download a PDF.
For example, if I have a DOI for a PDF document on the Archaeology Data Service, it will resolve to this landing page with an embedded link in it to this pdf but which really redirects to this other link.
library(httr)
will handle resolving the DOI and we can extract the pdf URL from the landing page using library(XML)
but I'm stuck at getting the PDF itself.
If I do this:
download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf", destfile = "tmp.pdf")
then I receive a HTML file that is the same as http://archaeologydataservice.ac.uk/myads/
Trying the answer at How to use R to download a zipped file from a SSL page that requires cookies leads me to this:
library(httr)
terms <- "http://archaeologydataservice.ac.uk/myads/copyrights"
download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload"
values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf")
# Accept the terms on the form,
# generating the appropriate cookies
POST(terms, body = values)
GET(download, query = values)
# Actually download the file (this will take a while)
resp <- GET(download, query = values)
# write the content of the download to a binary file
writeBin(content(resp, "raw"), "c:/temp/thefile.zip")
But after the POST
and GET
functions I simply get the HTML of the same cookie page that I got with download.file
:
> GET(download, query = values)
Response [http://archaeologydataservice.ac.uk/myads/copyrights?from=2f6172636869766544532f61726368697665446f776e6c6f61643f61677265653d79657326743d617263682d313335322d3125324664697373656d696e6174696f6e2532467064662532464479666564253246474c34343030342e706466]
Date: 2016-01-06 00:35
Status: 200
Content-Type: text/html;charset=UTF-8
Size: 21 kB
<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "h...
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; c...
<title>Archaeology Data Service: myADS</title>
<link href="http://archaeologydataservice.ac.uk/css/u...
...
Looking at http://archaeologydataservice.ac.uk/about/Cookies it seems that the cookie situation at this site is complicated. Seems like this kind of cookie complexity is not unusual for UK data providers: automating the login to the uk data service website in R with RCurl or httr
How can I use R to get past the cookies on this website?
dr$open()
reports[1] "Connecting to remote server" Undefined error in RCurl call.Error in queryRD(paste0(serverURL, "/session"), "POST", qdata = toJSON(serverOpts)) :
– Gunflint