How to properly set cookies to get URL content using httr
Asked Answered
D

2

6

I need to download information from web site that is protected using cookies. I pass this protection manually and then insert cookies to httr.

Here is similar topic, but it does not solve my problem: (Copying cookie for httr)

library(httr)
url<-"http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ"

cook<-"_SMIDA=9117a9eb136353bd6956651bd59acd37; __utmt=1; __utma=29983421.1729484844.1413489369.1413625619.1413627797.3; __utmb=29983421.7.10.1413627797; __utmc=29983421; __utmz=29983421.1413489369.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"

response <- GET(url, config(cookie= cook))

content(x = response,as = 'text', encoding = "UTF-8")   

So when I use content it return me information, that I am not logged in( as I do without cookie)

How can I solve this problem?

Test credentials are login: mytest2, pass: qwerty12)

Doridoria answered 18/10, 2014 at 16:17 Comment(0)
S
6

This would be the way to set_cookies with GET & httr:

GET("http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ", 
    set_cookies(`_SMIDA` = "7cf9ea4bfadb60bbd0950e2f8f4c279d",
                `__utma` = "29983421.138599299.1413649536.1413649536.1413649536.1",
                `__utmb` = "29983421.5.10.1413649536",
                `__utmc` = "29983421",
                `__utmt` = "1",
                `__utmz` = "29983421.1413649536.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"))

That worked for me, well at least I think it did as I cannot read the language. A table comes back with the same structure and no prompt to login.

Unfortunately the captcha on login prevents the use of Rselenium (or other, similar, crawling packages), so you'll have to continue to manually grab those cookies (or use a utility to extract them from the session).

Finally, you probably want to seriously consider changing those credentials, now :-)


EDIT: @VadymB and I both found that the code didn't work until we rebooted RStudio. Your mileage may vary.

Sapling answered 18/10, 2014 at 16:39 Comment(5)
thanks, it helped! But it was really strange, this code didn't worked unless i rebooted RStudio =\Doridoria
And can you explain to me the next thing: If I run this code 2nd time, it wouldn't work, because the site would reject these cookies. I tried to reset_config() but nothing happens. This is real problem for me because, I'd like to create 5-10 accounts and download data simultaneouslyDoridoria
+1 This, should really be part of the documentation as an example. Because this syntax is not discoverable otherwise.Shipload
@VadymB: Same thing for me - This code didn't work until I rebooted RStudio!Anthozoan
Not specific to this example, but ran into it when trying to set cookies. If I grabbed them from Chrome DevTools, the cookie values were often URL encoded. They need to be passed through URLdecode otherwise I think httr tries to re-encode themDagall
C
0

You can just try this:

url <- "http://httpbin.org/get"
httr::GET(url)
httr::GET(url, httr::add_headers(a = 1, b = 2))
httr::GET(url, httr::set_cookies(a = 1, b = 2))
httr::GET(url, httr::add_headers(a = 1, b = 2), httr::set_cookies(a = 1, b = 2))
httr::GET(url, httr::add_headers(a = 1, b = 2, cookie = 'c=3;d=4'), httr::set_cookies(a = 1, b = 2))
# codes ref by: https://httr.r-lib.org/reference/GET.html

And these will be the outs with commands:

httr::GET(url)
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 378 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#|     "X-Amzn-Trace-Id": "Root=1-66a99dfc-3ee62d216a517e6844e8815f"
#|   }, 
#|   "origin": "101.200.73.219", 
#| ...

httr::GET(url, httr::add_headers(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 408 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "A": "1", 
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "B": "2", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#|     "X-Amzn-Trace-Id": "Root=1-66a99dfc-2fddaa4e49a8325309990191"
#| ...

httr::GET(url, httr::set_cookies(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 404 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "Cookie": "a=1;b=2", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#|     "X-Amzn-Trace-Id": "Root=1-66a99dfc-44b9d09700c6b7f87e086e40"
#|   }, 
#| ...

httr::GET(url, httr::add_headers(a = 1, b = 2), httr::set_cookies(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 434 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "A": "1", 
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "B": "2", 
#|     "Cookie": "a=1;b=2", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#| ...

httr::GET(url, httr::add_headers(a = 1, b = 2, cookie = 'c=3;d=4'), httr::set_cookies(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 434 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "A": "1", 
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "B": "2", 
#|     "Cookie": "c=3;d=4", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#| ...

So, the httr::set_cookies is like a warp to httr::add_headers, but the httr::add_headers have bigger priority while they both appears to setting cookies.

But, httr::set_cookies(...) is friendly to read rather than httr::add_headers(cookie = ....), so I think you can still just use it.

Countercharge answered 31/7 at 2:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.