R: use rvest (or httr) to log in to a site requiring cookies
Asked Answered
T

0

3

I'm trying to automate the shibboleth-based login process for the UK Data Service in R. One can sign up for an account to login here. A previous attempt to automate this process is found in this question, automating the login to the uk data service website in R with RCurl or httr.

I thought the excellent answers to this question, how to authenticate a shibboleth multi-hostname website with httr in R, were going to get me there, but I've run into a wall.

And, yes, RSelenium provides an alternative—which I've actually tried—but my experience with RSelenium is that it is always flaking out (not to mention that it is hard to get to work across platforms), while rvest/httr/RCurl solutions don't break unless or until the website changes and are easy to get working on other people's machines.

Anyway, the site requires you to click through an initial signin page (and get a cookie), then enter your organization (click through and get cookies), then enter your username and password (cookies), and then (because rvest doesn't do javascript) click through one more cookie-modifying page, before landing on the "your account" page. It looks to me that the cookies at all steps are necessary—the one that eventually signifies that you've logged in (ASPSESSIONIDSQAQSSQA) is the one created by the initial signin page.

So here's what I have so far. First, get to the organization page and enter the organization, saving the cookies from the initial signin page (using the trick from here, Submit form with no submit button in rvest, to cope with the fact that the submit button doesn't activate until an organization is entered).

library(tidyverse)
library(rvest)
library(stringr)

org <- "your_organization"
user <- "your_username"
password <- "your_password"

signin <- "http://esds.ac.uk/newRegistration/newLogin.asp"
handle_reset(signin)

# get to org page and enter org
p0 <- html_session(signin) %>% 
    follow_link("Login")
org_link <- html_nodes(p0, "option") %>% 
    str_subset(org) %>% 
    str_match('(?<=\\")[^"]*') %>%
    as.character()

f0 <- html_form(p0) %>%
    first() %>%
    set_values(origin = org_link)
fake_submit_button <- list(name = "submit-btn",
                           type = "submit",
                           value = "Continue",
                           checked = NULL,
                           disabled = NULL,
                           readonly = NULL,
                           required = FALSE)
attr(fake_submit_button, "class") <- "btn-enabled"
f0[["fields"]][["submit"]] <- fake_submit_button

c0 <- cookies(p0)$value
names(c0) <- cookies(p0)$name
p1 <- submit_form(session = p0, form = f0, config = set_cookies(.cookies = c0))

Then, enter the username and password:

# enter user and password
f1 <- html_form(p1) %>%
    first() %>%
    set_values("j_username" = user,
               "j_password" = password)
c1 <- cookies(p1)$value
names(c1) <- cookies(p1)$name
p2 <- submit_form(session = p1, form = f1, config = set_cookies(.cookies = c1))

p2$response says "Since your browser does not support JavaScript, you must press the Continue button once to proceed", so:

# click through
f2 <- p2 %>%
    html_form() %>%
    first()
c2 <- cookies(p2)$value
names(c2) <- cookies(p2)$name

p3 <- submit_form(p2, f2, config = set_cookies(.cookies = c2))

Sadly, instead of finally being "your account", p3 actually winds us back up at the organization-entry page p0.

One potentially important issue is that c2 contains two JSESSIONID cookies that cookies(p2) shows are for different domains. I don't know what to do about that—I've tried dropping first one then the other from c2 with no luck. Any suggestions? Thanks!

Thurgau answered 9/3, 2017 at 17:8 Comment(5)
You are right. RSelenium is not trustworthy always. But when I tried to singup it asked me to wait for three days to provide me username and password credentials. This is interesting. I whave worked on several other password, cookie, session seeking websites. So I will tryImmiscible
Great! I really appreciate it!Thurgau
RSelenium implements the API of the underlying projects. In my experience it is issue with the underlying projects that cause problems predominately for users.Remontant
That's fair enough.Thurgau
Epilogue: I eventually got over my misgivings with RSelenium—using Chrome and keeping file management separate helped—and wrote up my solution as a package, ukds.Thurgau

© 2022 - 2024 — McMap. All rights reserved.