I'm trying to automate the shibboleth-based login process for the UK Data Service in R. One can sign up for an account to login here. A previous attempt to automate this process is found in this question, automating the login to the uk data service website in R with RCurl or httr.
I thought the excellent answers to this question, how to authenticate a shibboleth multi-hostname website with httr in R, were going to get me there, but I've run into a wall.
And, yes, RSelenium
provides an alternative—which I've actually tried—but my experience with RSelenium
is that it is always flaking out (not to mention that it is hard to get to work across platforms), while rvest
/httr
/RCurl
solutions don't break unless or until the website changes and are easy to get working on other people's machines.
Anyway, the site requires you to click through an initial signin page (and get a cookie), then enter your organization (click through and get cookies), then enter your username and password (cookies), and then (because rvest
doesn't do javascript) click through one more cookie-modifying page, before landing on the "your account" page. It looks to me that the cookies at all steps are necessary—the one that eventually signifies that you've logged in (ASPSESSIONIDSQAQSSQA
) is the one created by the initial signin page.
So here's what I have so far. First, get to the organization page and enter the organization, saving the cookies from the initial signin page (using the trick from here, Submit form with no submit button in rvest, to cope with the fact that the submit button doesn't activate until an organization is entered).
library(tidyverse)
library(rvest)
library(stringr)
org <- "your_organization"
user <- "your_username"
password <- "your_password"
signin <- "http://esds.ac.uk/newRegistration/newLogin.asp"
handle_reset(signin)
# get to org page and enter org
p0 <- html_session(signin) %>%
follow_link("Login")
org_link <- html_nodes(p0, "option") %>%
str_subset(org) %>%
str_match('(?<=\\")[^"]*') %>%
as.character()
f0 <- html_form(p0) %>%
first() %>%
set_values(origin = org_link)
fake_submit_button <- list(name = "submit-btn",
type = "submit",
value = "Continue",
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "btn-enabled"
f0[["fields"]][["submit"]] <- fake_submit_button
c0 <- cookies(p0)$value
names(c0) <- cookies(p0)$name
p1 <- submit_form(session = p0, form = f0, config = set_cookies(.cookies = c0))
Then, enter the username and password:
# enter user and password
f1 <- html_form(p1) %>%
first() %>%
set_values("j_username" = user,
"j_password" = password)
c1 <- cookies(p1)$value
names(c1) <- cookies(p1)$name
p2 <- submit_form(session = p1, form = f1, config = set_cookies(.cookies = c1))
p2$response
says "Since your browser does not support JavaScript, you must press the Continue button once to proceed", so:
# click through
f2 <- p2 %>%
html_form() %>%
first()
c2 <- cookies(p2)$value
names(c2) <- cookies(p2)$name
p3 <- submit_form(p2, f2, config = set_cookies(.cookies = c2))
Sadly, instead of finally being "your account", p3
actually winds us back up at the organization-entry page p0
.
One potentially important issue is that c2
contains two JSESSIONID
cookies that cookies(p2)
shows are for different domains. I don't know what to do about that—I've tried dropping first one then the other from c2
with no luck. Any suggestions? Thanks!
RSelenium
implements the API of the underlying projects. In my experience it is issue with the underlying projects that cause problems predominately for users. – RemontantRSelenium
—using Chrome and keeping file management separate helped—and wrote up my solution as a package, ukds. – Thurgau