R httr post-authentication download works in interactive mode but fails in function
Asked Answered
U

1

7

the code below works fine in interactive mode but fails when used in a function. it's pretty simply two authentications POST commands followed by the data download. my goal is to get this working inside a function, not just in interactive mode.

this question is sort of a sequel to this question.. icpsr recently updated their website. the minimal reproducible example below requires a free account, available at

https://www.icpsr.umich.edu/rpxlogin?path=ICPSR&request_uri=https%3a%2f%2fwww.icpsr.umich.edu%2ficpsrweb%2findex.jsp

i tried adding Sys.sleep(1) and various httr::GET/httr::POST calls but nothing worked.

my_download <-
    function( your_email , your_password ){

        values <-
            list(
                agree = "yes",
                path = "ICPSR" ,
                study = "21600" ,
                ds = "" ,
                bundle = "rdata",
                dups = "yes",
                email=your_email,
                password=your_password
            )


        httr::POST("https://www.icpsr.umich.edu/cgi-bin/terms", body = values)
        httr::POST("https://www.icpsr.umich.edu/rpxlogin", body = values)

        tf <- tempfile()
        httr::GET( 
            "https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2" , 
            query = values , 
            httr::write_disk( tf , overwrite = TRUE ) , 
            httr::progress()
        )

    }

# fails 
my_download( "[email protected]" , "some_password" )

# stepping through works
debug( my_download )
my_download( "[email protected]" , "some_password" )

EDIT the failure simply downloads this page as if not logged in (and not the dataset), so it's losing the authentication for some reason. if you are logged in to icpsr, use private browsing to see the page--

https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2?study=21600&ds=1&bundle=rdata&path=ICPSR

thanks!

Unicorn answered 23/2, 2018 at 17:6 Comment(9)
so where / how exactly does it fail when used via the function?Filigreed
@Filigreed sorry for not including that. see edit..thank youUnicorn
icpsr.umich.edu/robots.txt suggests this activity is not authorized (and robots.txt is currently a bona fide technical control upheld in — at least U.S. — civil courts). Unless one has written permission to automate access, it's not a good idea to pursue this.Dressage
I suggest ignoring @hrbrmstr's hand-wringing about robots.txt. At least it is not clear that a) your script qualifies as a "robot", or b) that respecting restrictions specified in robots.txt is necessarily a good idea. See en.wikipedia.org/wiki/Robots_exclusion_standard for relatively unbiased information on this issue.Marcello
For me running the function a second time works. So it's not about running it line-by-line, but rather whether it's been run before. In practical terms: just run it twice.Marcello
@Marcello bizarro. yes, running the three POST and GET commands twice triggers the download within the function. happy to award the bounty if you want to make that an answer. thanks very much!Unicorn
Nice job encouraging unethical and (depending on the jurisdiction) criminal actions, @MarcelloDressage
Nice job trying to derail this question with irrelevant opinions @hrbrmstr. If you want to talk about legal issues please take it over to law.stackexchange.comMarcello
@AnthonyDamico I'm going to look into it a bit more to see if I can actually understand what is happening before writing up an answer. It will a while before I have time to do that, hopefully someone else will beat me to it.Marcello
M
1

This sort of thing can happen because the state (such as cookies) the httr package stores in the handle for each URL (see ?handle).

In this particular case it remains unclear what exactly make it work, but one strategy is to include a GET request to https://www.icpsr.umich.edu/cgi-bin/bob/ prior to authenticating and requesting the data. For example,

my_download <-
    function( your_email , your_password ){
        ## for some reason this is required ...
        httr::GET("https://www.icpsr.umich.edu/cgi-bin/bob/")
        values <-
            list(
                agree = "yes",
                path = "ICPSR" ,
                study = "21600" ,
                ds = "" ,
                bundle = "rdata",
                dups = "yes",
                email=your_email,
                password=your_password
            )
        httr::POST("https://www.icpsr.umich.edu/rpxlogin", body = values)
        httr::POST("https://www.icpsr.umich.edu/cgi-bin/terms", body = values)
        tf <- tempfile()
        httr::GET( 
            "https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2" , 
            query = values , 
            httr::write_disk( tf , overwrite = TRUE ) , 
            httr::progress()
        )
    }

appears to work correctly, though it remains unclear what the GET request to https://www.icpsr.umich.edu/cgi-bin/bob/` does exactly or why it is needed.

Marcello answered 4/3, 2018 at 22:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.