Programmatically scraping a response header within R
Asked Answered
K

2

10

I am trying to access the highlighted response header: location text in the screenshot below using only R and its curl-based webscraping libraries. one can easily get to this point in any web browser by visiting http://www.worldvaluessurvey.org/WVSDocumentationWVL.jsp, clicking on the download for any of the data files, and filling out the agreement form. The download begins automatically in a web browser.

enter image description here

I believe that the only way to obtain a valid cookie is with library(curlconverter) (see How to download a file behind a semi-broken javascript asp function with R) but that answer does not appear to be enough to programmatically determine the http url of the file, only to download the zipped file once it's already known.

I've pasted some code below with different httr and curlconverter code that I've played around with, but I'm missing something here. Again, the only goal is to programmatically determine the highlighted text entirely within R (cross-platform).

library(curlconverter)
library(httr)

browserPOST <-
    "curl 'http://www.worldvaluessurvey.org/AJDownload.jsp'
    -H 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
    -H 'Accept-Encoding:gzip, deflate'
    -H 'Accept-Language:en-US,en;q=0.8'
    -H 'Cache-Control:max-age=0'
    --compressed -H 'Connection:keep-alive'
    -H 'Content-Length:188'
    -H 'Content-Type:application/x-www-form-urlencoded'
    -H 'Cookie:ASPSESSIONIDCASQAACD=IBLGBFOAEHFILMMJJCFEOEMI; JSESSIONID=50DABDEDD0B2FC370C415B4BD1855260; __atuvc=13%7C45; __atuvs=58224f37d312c42400c'
    -H 'Host:www.worldvaluessurvey.org'
    -H 'Origin:http://www.worldvaluessurvey.org'
    -H 'Referer:http://www.worldvaluessurvey.org/AJDownloadLicense.jsp'
    -H 'Upgrade-Insecure-Requests:1'
    -H 'User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'"

form_data <-
    list( 
        ulthost = "WVS" ,
        CMSID = "" ,
        LITITLE = "" ,
        LINOMBRE = "fas" ,
        LIEMPRESA = "asf" ,
        LIEMAIL = "asdf" ,
        LIPROJECT = "asfd" ,
        LIUSE = "1" ,
        LIPURPOSE = "asdf" ,
        LIAGREE = "1" ,
        DOID = "3996" ,
        CndWAVE = "-1" ,
        SAID = "-1" ,
        AJArchive = "WVS Data Archive" ,
        EdFunction = "" ,
        DOP = "" 
    )   



getDATA <- (straighten(browserPOST) %>% make_req)[[1]]()

a <- VERB(verb = "POST", url = "http://www.worldvaluessurvey.org/AJDownload.jsp", 
    httr::add_headers(Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
        `Accept-Encoding` = "gzip, deflate", `Accept-Language` = "en-US,en;q=0.8", 
        `Cache-Control` = "max-age=0", Connection = "keep-alive", 
        `Content-Length` = "188", Host = "www.worldvaluessurvey.org", 
        Origin = "http://www.worldvaluessurvey.org", Referer = "http://www.worldvaluessurvey.org/AJDownloadLicense.jsp", 
        `Upgrade-Insecure-Requests` = "1", `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"), 
    httr::set_cookies(`Cookie:ASPSESSIONIDCASQAACD` = "IBLGBFOAEHFILMMJJCFEOEMI", 
        JSESSIONID = "50DABDEDD0B2FC370C415B4BD1855260", `__atuvc` = "13%7C45", 
        `__atuvs` = "58224f37d312c42400c"), encode = "form",body=form_data)
Katzir answered 8/11, 2016 at 23:41 Comment(6)
I've added capitalization and punctuation to your question. Please consider doing this yourself in the future as we try to maintain good quality standards for the dozens to thousands of people who may be reading this over time.Hormone
One issue here is that the links are embedded in an iframe that is embedded in another iframe. Scraping those aren't easy, to put it mildly.Pluckless
Voting as unclear as per #40498777Fleshings
Consider rephrasing the question if you really want to get an answer you want. It's impossible to answer a question asking god-knows-what, and there are no telepathists here.Fleshings
@Anthony - Just making sure I understand what you want. Do you want to create an R script to download the files, without the user having to enter the registration data manually? Correct? If that is your goal, then you can do that with the RSelenium package. (Headless Browser)Gory
@Gory no, i just want to capture the highlighted url entirely with R (cross-platform and without external installs). the linked SO question already downloads properly if the url is known. thanksKatzir
T
5

This was a nice challenge!

The problem is not related to R language. We'll have the same result in any language if we just try to post some data to the download script. We have to deal with some kind of security “pattern” here. The site restricts users from retrieving the files urls and it asks them to fill forms with data in order to provide those links. If a browser can retrieve these links, then we can too by writing the proper HTTP calls. Thing is, we need to know exactly which calls we have to make. In order to find that, we need to see the individual calls the site does whenever someone clicks to download. Here is what I found a few calls before a successful 302 AJDownload.jsp POST call:

Http requests

We can see it clearly, if we look at the AJDocumentation.jsp source, it makes these calls by using jQuery $.get:

$.get("http://ipinfo.io?token=xxxxxxxxxxxxxx", function (response) {
    var geodatos=encodeURIComponent(response.ip+"\t"+response.country+"\t"+response.postal+"\t"+
    response.loc+"\t"+response.region+"\t"+response.city+"\t"+
    response.org);

    $.get("jdsStatJD.jsp?ID="+geodatos+
        "&url=http%3A%2F%2Fwww.worldvaluessurvey.org%2FAJDocumentation.jsp&referer=null&cms=Documentation",
        function (resp2) {
    });
}, "jsonp");

Then, a few calls below, we can see the successful POST /AJDownload.jsp with status 302 Moved Temporarily and with the wanted Location in its response headers:

Http requests

HTTP/1.1 302 Moved Temporarily
Content-Length: 0
Content-Type: text/html
Location: http://www.worldvaluessurvey.org/wvsdc/CO00001/F00003724-WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18.zip
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Thu, 01 Dec 2016 16:24:37 GMT

So, this is the security mechanism of this site. It uses ipinfo.io to store visitor informations about their IP, Location and even the ISP organization, just before the user is about to initiate a download by clicking on a link. The script which receives these data, is the /jdsStatJD.jsp. I haven’t used ipinfo.io, nor their API key for this service (have it hidden on my screenshots) and instead I created a dummy valid sequence of data, just to validate the request. The post form data for the “protected” files are not require at all. It is possible to download the files without posting these data.

Also, the curlconverter library is not required. All we have to do, is simple GET and POST requests by using httr library. One important part I want to point out, is that in order to prevent httr POST function from following the Location header received with 302 status at our last call, we need to use the config setting config(followlocation = FALSE) which of course will prevent it from following the Location and let us fetch the Location from the headers.

OUTPUT

My R script can be run from the command line and it can accept DOID numeric values for parameters to get the file needed. For example, if we want to get the link for the file WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18, then we have to add its DOID (which is 3724) to the end of our script when calling it using the Rscript command:

Rscript wvs_fetch_downloads.r 3724
[1] "http://www.worldvaluessurvey.org/wvsdc/CO00001/F00003724-WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18.zip"

I have created an R function to get each file location you want by just passing the DOID:

getFileById <- function(fileId)

You can remove the command line argument parsing and use the function by passing the DOID directly:

#args <- commandArgs(TRUE)
#if(length(args) == 0) {
#   print("No file id specified. Use './script.r ####'.")
#   quit("no")
#}

#fileId <- args[1]
fileId <- "3724"

# DOID=3843 : WVS_EVS_Integrated_Dictionary_Codebook v_2014_09_22 (Excel)
# DOID=3844 : WVS_Values Surveys Integrated Dictionary_TimeSeries_v_2014-04-25 (Excel)
# DOID=3725 : WVS_Longitudinal_1981-2014_rdata_v_2015_04_18
# DOID=3996 : WVS_Longitudinal_1981-2014_sas_v_2015_04_18
# DOID=3723 : WVS_Longitudinal_1981-2014_spss_v_2015_04_18
# DOID=3724 : WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18

getFileById(fileId)

Final R working script

library(httr)

getFileById <- function(fileId) {
    response <- GET(
        url = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1", 
        add_headers(
            `Accept` = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
            `Accept-Encoding` = "gzip, deflate",
            `Accept-Language` = "en-US,en;q=0.8", 
            `Cache-Control` = "max-age=0",
            `Connection` = "keep-alive", 
            `Host` = "www.worldvaluessurvey.org", 
            `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
            `Content-type` = "application/x-www-form-urlencoded",
            `Referer` = "http://www.worldvaluessurvey.org/AJDownloadLicense.jsp", 
            `Upgrade-Insecure-Requests` = "1"))

    set_cookie <- headers(response)$`set-cookie`
    cookies <- strsplit(set_cookie, ';')
    cookie <- cookies[[1]][1]

    response <- GET(
        url = "http://www.worldvaluessurvey.org/jdsStatJD.jsp?ID=2.72.48.149%09IT%09undefined%0941.8902%2C12.4923%09Lazio%09Roma%09Orange%20SA%20Telecommunications%20Corporation&url=http%3A%2F%2Fwww.worldvaluessurvey.org%2FAJDocumentation.jsp&referer=null&cms=Documentation", 
        add_headers(
            `Accept` = "*/*", 
            `Accept-Encoding` = "gzip, deflate",
            `Accept-Language` = "en-US,en;q=0.8", 
            `Cache-Control` = "max-age=0",
            `Connection` = "keep-alive", 
            `X-Requested-With` = "XMLHttpRequest",
            `Host` = "www.worldvaluessurvey.org", 
            `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
            `Content-type` = "application/x-www-form-urlencoded",
            `Referer` = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1",
            `Cookie` = cookie))

    post_data <- list( 
        ulthost = "WVS",
        CMSID = "",
        CndWAVE = "-1",
        SAID = "-1",
        DOID = fileId,
        AJArchive = "WVS Data Archive",
        EdFunction = "",
        DOP = "",
        PUB = "")  

    response <- POST(
        url = "http://www.worldvaluessurvey.org/AJDownload.jsp", 
        config(followlocation = FALSE),
        add_headers(
            `Accept` = "*/*", 
            `Accept-Encoding` = "gzip, deflate",
            `Accept-Language` = "en-US,en;q=0.8", 
            `Cache-Control` = "max-age=0",
            `Connection` = "keep-alive",
            `Host` = "www.worldvaluessurvey.org",
            `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
            `Content-type` = "application/x-www-form-urlencoded",
            `Referer` = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1",
            `Cookie` = cookie),
        body = post_data,
        encode = "form")

    location <- headers(response)$location
    location
}

args <- commandArgs(TRUE)
if(length(args) == 0) {
    print("No file id specified. Use './script.r ####'.")
    quit("no")
}

fileId <- args[1]

# DOID=3843 : WVS_EVS_Integrated_Dictionary_Codebook v_2014_09_22 (Excel)
# DOID=3844 : WVS_Values Surveys Integrated Dictionary_TimeSeries_v_2014-04-25 (Excel)
# DOID=3725 : WVS_Longitudinal_1981-2014_rdata_v_2015_04_18
# DOID=3996 : WVS_Longitudinal_1981-2014_sas_v_2015_04_18
# DOID=3723 : WVS_Longitudinal_1981-2014_spss_v_2015_04_18
# DOID=3724 : WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18

getFileById(fileId)
Turgot answered 2/12, 2016 at 1:43 Comment(9)
Well, you did it - this time. For any other case - and every time anything changes at the site - this'll need doing all over again, and it'll be the same kind of challenge - back to the drawing board each time. This really is a task for a headless browser. P.S. I now ironically see this in the side panel: security.stackexchange.com/questions/144155/…Fleshings
@Fleshings You are opening a big subject about security here. The OPs question is about a specific web application and for some specific files. Of course, when the application gets upgraded with a strongest security pattern, this fetch code will break. The developers could have used even a captcha service to protect their forms from bots scraping like this one here. I just did the reversing of the HTTP requests to find out what it has to be sent to the server in order to return back the wanted result, which is the 302 response with the file it requests. This is what the OP is seeking for.Turgot
@Fleshings I wonder how a headless browser can successfully continue fetching such a request if it changes its pattern and even if the developers add protections like captcha. I don’t think that’s possible, but can you please provide some resources with such examples and how a headless browser can continue and successfully return the wanted result after a security method has totally or partially changed? I don’t know much about headless browser and I’d like to dig more into this subject.Turgot
I'm not saying a headless browser would automagically adapt to changes. I'm saying it will be much easier to adapt to them because you don't need to emulate Javascript logic and can copy all the needed xpaths right from a normal browser. Of course, your solution is just fine as a one-off task - or if changes are presumed to be rare and/or effort to detect and keep up with them tolerable.Fleshings
So, a headless browser does something like parsing the responses javascript dom manipulation and XMLHttpRequest calls for each HTTP response body and then it has a document result with all the javascript driven changes and additional calls, am I right?Turgot
@Fleshings I see you got a point. By using a headless browser, it will be most likely that the script will continue to work after some changes/additions. For example, developers add calls to more validation requests. This code here will fail. But with a headless browser it will execute everything according to the source and the script will continue to work! I wasn't so much into headless browsers even if I knew a few things about them. I will try to work for a headless browser solution for this problem here and I'll get back with an update!Turgot
It's also about the ease. I managed to quickly put together a selenium script to control a browser to get that URL. But there was no way I could manage to get it with curl. There was an iframe, then after you click the link, a nested iframe, for which the source was added with js. Then, there was the action of the form, also added with js. Then an intermediate jsp page, that used the ipinfo.io. I have to praise your achievement, @ChristosLytras. +1Buckskin
hi, thanks for making this possible github.com/ajdamico/asdfree/issues/130 sorry for not reviewing your work in time to award the first bounty, i'll give you the 300+ one. thanks very very muchKatzir
@AnthonyDamico you're welcome. Thank you for the extra bounty you offering. I really enjoy reversing and solving that kind of problems.Turgot
F
0

According to the source of the underlying httr::request_perform, the object you get from VERB() looks like this:

res <- response(
  url = resp$url,
  status_code = resp$status_code,
  headers = headers,
  all_headers = all_headers,
  cookies = curl::handle_cookies(handle),
  content = resp$content,
  date = date,
  times = resp$times,
  request = req,
  handle = handle
)

So, you're interested in its headers or all_headers (response is but a structure). If a redirect was involved, all_headers will have multiple sets of headers as returned by curl::parse_headers(), headers are always the final set.

Fleshings answered 24/11, 2016 at 12:32 Comment(6)
hi, sorry, i am downvoting because the objects a$headers and a$all_headers from my example in the question do not solve the stated problemKatzir
@AnthonyDamico the stated problem is "Programmatically scrape a response header within R", and it does that. If the real problem is "scrape a specific page for you" (i.e. determine which set of requests is needed to programmatically get to the file on that specific page), that's a completely different problem from the one stated.Fleshings
@AnthonyDamico, I personally think that down votes are being used lightly very often in SO. You posted a question, a person made an effort to help. If you simply explain in a comment why that answer didn't solve the problem, most of the time, you will just get more help. You see how this could be taken as a lack of respect for the effort that was put into trying to help out.Buckskin
@IvanChaer I personally think that on the contrary, the system discourages their use where they are due (in poor answer magnets) too much, so I don't hold any grudges here. The hover text on the downvote reads: "this answer is not useful". If my answer has been useless to a member of the target audience, that's neither good nor bad, that's a fact of life. If it's really as good as I've written it to be, it'll get upvotes eventually - the question title is Google-friendly, and the answer addresses exactly what it says.Fleshings
@IvanChaer As for your "lack of respect" assumption... don't you think that if system designers felt the same, they wouldn't have introduced downvotes in the first place?Fleshings
When there is more than one question, sure, to differentiate. Judiciously, I think, as it can sometimes attract negativity. Apparently wasn't the case. ;)Buckskin

© 2022 - 2024 — McMap. All rights reserved.