how to download a large binary file with RCurl *after* server authentication
Asked Answered
U

2

10

i originally asked this question about performing this task with the httr package, but i don't think it's possible using httr. so i've re-written my code to use RCurl instead -- but i'm still tripping up on something probably related to the writefunction.. but i really don't understand why.

you should be able to reproduce my work by using the 32-bit version of R, so you hit memory limits if you read anything into RAM. i need a solution that downloads directly to the hard disk.

to start, this code to works -- the zipped file is appropriately saved to the disk.

library(RCurl)
filename <- tempfile()
f <- CFILE(filename, "wb")
url <- "http://www2.census.gov/acs2011_5yr/pums/csv_pus.zip"
curlPerform(url = url, writedata = f@ref)
close(f)
# 2.1 GB file successfully written to disk

now here's some RCurl code that does not work. as stated in the previous question, reproducing this exactly will require creating an extract on ipums.

your.email <- "[email protected]"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"

library(RCurl)

values <- 
    list(
        "login[email]" = your.email , 
        "login[password]" = your.password , 
        "login[is_for_login]" = 1
    )

curl = getCurlHandle()

curlSetOpt(
    cookiejar = 'cookies.txt', 
    followlocation = TRUE, 
    autoreferer = TRUE, 
    ssl.verifypeer = FALSE,
    curl = curl
)

params <- 
    list(
        "login[email]" = your.email , 
        "login[password]" = your.password , 
        "login[is_for_login]" = 1
    )

html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)

and now that i'm logged in, try the same commands as above, but with the curl object to keep the cookies.

filename <- tempfile()
f <- CFILE(filename, mode = "wb")

this line breaks--

curlPerform(url = extract.path, writedata = f@ref, curl = curl)
close(f)

# the error is:
Error in curlPerform(url = extract.path, writedata = f@ref, curl = curl) : 
  embedded nul in string: [[binary jibberish here]]

the answer to my previous post referred me to this c-level writefunction answer, but i'm clueless about how to re-create that curl_writer C program (on windows?)..

dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
curlPerform(URL=url, writefunction=writer)

..or why it's even necessary, given that the five lines of code at the top of this question work without anything crazy like getNativeSymbolInfo. i just don't understand why passing in that extra curl object that stores the authentication/cookies and tells it not to verify SSL would cause code that otherwise works.. to break?

Unciform answered 26/6, 2013 at 19:56 Comment(5)
What happens if you edit the code that works adding curl = getCurlHandle() and curlPerform(url = url, writedata = f@ref, curl = curl)? and, are you able to download some other content once session has started? for example, using curlPerform and writedata to save https://usa.ipums.org/usa-action/extract_requests/downloadBrunell
About the C code, you'd need to compile it into a DLL, and then dyn.load("curl_writer.dll")Brunell
1) i don't understand how your editing getCurlHandle() is any different from my code? 2) yes, i am able to download other content once the session has started. z <- getBinaryURL( extract.path , curl = curl ) works, but it reads everything into RAM and so doesn't solve my problem. 3) is it possible to do this within R on windows? thanks!! :)Unciform
Compile the code using Visual C++ or cigwin, or check this page: stat.ethz.ch/R-manual/R-devel/library/utils/html/SHLIB.htmlBrunell
@AstDerek any chance i could convince you to provide a working example start-to-finish? :) this non-R stuff is unintelligible to me..Unciform
U
1

this is now possible with the httr package. thanks hadley!

https://github.com/hadley/httr/issues/44

Unciform answered 2/10, 2014 at 8:59 Comment(0)
B
3
  1. From this link create a file named curl_writer.c and save it to C:\<folder where you save your R files>

    #include <stdio.h>
    
    /**
     * Original code just sent some message to stderr
     */
    size_t writer(void *buffer, size_t size, size_t nmemb, void *stream) {
        fwrite(buffer,size,nmemb,(FILE *)stream);
        return size * nmemb;
    }
    
  2. Open a command window, go to the folder where you saved curl_writer.c and run the R compiler

    c:> cd "C:\<folder where you save your R files>"
    c:> R CMD SHLIB -o curl_writer.dll curl_writer.c
    
  3. Open R and run your script

    C:> R
    
    your.email <- "[email protected]"
    your.password <- "password"
    extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"
    
    library(RCurl)
    
    values <- 
        list(
            "login[email]" = your.email , 
            "login[password]" = your.password , 
            "login[is_for_login]" = 1
        )
    
    curl = getCurlHandle()
    
    curlSetOpt(
        cookiejar = 'cookies.txt', 
        followlocation = TRUE, 
        autoreferer = TRUE, 
        ssl.verifypeer = FALSE,
        curl = curl
    )
    
    params <- 
        list(
            "login[email]" = your.email , 
            "login[password]" = your.password , 
            "login[is_for_login]" = 1
        )
    
    html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
    dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)
    
    # Load the DLL you created
    # "writer" is the name of the function
    # "curl_writer" is the name of the dll
    dyn.load("curl_writer.dll")
    writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
    
    # Note that "URL" parameter is upper case, in your code it is lowercase
    # I'm not sure if that has something to do
    # "writer" is the symbol defined above
    f <- CFILE(filename <- tempfile(), "wb")
    curlPerform(URL=url, writedata=f@ref, writefunction=writer, curl=curl)
    close(f)
    
Brunell answered 6/7, 2013 at 20:7 Comment(5)
thanks!! ..but when i run this in windows - setwd( "C:/My Directory" ) ; cwr <- "#include <stdio.h>\n\nsize_t writer(void *buffer, size_t size, size_t nmemb, void *stream) {\nfwrite(buffer,size,nmemb,(FILE *)stream);\nreturn size * nmemb;\n}" ; writeLines( cwr , "curl_writer.c" ) ; shell( "'C:\\Program Files\\R\\R-3.0.0\\bin\\x64\\Rcmd.exe' SHLIB -o 'C:\\My Directory\\curl_writer.dll' 'C:\\My Directory\\curl_writer.c'" ) - i get The filename, directory name, or volume label syntax is incorrect.[[snip]]execution failed with error code 1 any idea what's wrong? i want to keep it within R :)Unciform
system2(command="R",args="CMD SHLIB -o curl_writer.dll curl_writer.c") instead of shell(...)Brunell
thank you again, and sorry if i'm missing something obvious here.. R isn't in my PATH, so i used system2( command = "C:\\Program Files\\R\\R-3.0.0\\bin\\x64\\R.exe" , args = "CMD SHLIB -o curl_writer.dll curl_writer.c" ) but that gave a warning running command '"C:\Program Files\R\R-3.0.0\bin\x64\R.exe" CMD SHLIB -o curl_writer.dll curl_writer.c' had status 1 and didn't create the .dll file.. :/Unciform
I'm not sure to be honest, I'm using a Mac and it works with no problem here. Maybe you need to install a compiler or something? Can you ask somebody else to compile the DLL for you?Brunell
This solution requires a compiler to be installed. That's usually the case on linux and mac, but not on windows. So you will probably need to manually install a C compiler (and maybe tell R where to find it).Therefore
U
1

this is now possible with the httr package. thanks hadley!

https://github.com/hadley/httr/issues/44

Unciform answered 2/10, 2014 at 8:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.