Web scraping password protected website using R
Asked Answered
A

2

0

i would like to web scrape yammer data using R,but in order to do so first il have to login to this page,(which is authentication for an app that i created).

https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg

I am able to get the yammer data once i login to this page but all this is in browser by standard yammer urls (https://www.yammer.com/api/v1/messages/received.json)

I have read through similar questions and tried the suggestions but still cant get through this issue.

I have tried using httr,RSelenium,rvest+Selector gadget.

End goal here is to do everything in R (getting data,cleaning,sentiment analysis...the cleaning and sentiment analysis part is done but as of now the getting data part is manual and i would like to automate that by handling it from R)

1.Trial using httr:

usinghttr<- GET("https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg",
     authenticate("Username", "Password"))

corresponding Result : Response [https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg] Date: 2015-04-27 12:25 Status: 200 Content-Type: text/html; charset=utf-8 Size: 15.7 kB content of this page showed that it has opened the login page but didnt authenticate.

2.Trial using selector gadget + rvest

i tried scraping wikipedia using this method but couldnt apply it to yammer as authentication would be required prior to calling the html tag that selctor gadget gives.

3.Trial using RSelenium

tried this using the standard browsers and phantomjs but got some errors

> startServer()

remDr <- remoteDriver$new()

remDr$open() [1] "Connecting to remote server" Undefined error in RCurl call. Error in queryRD(paste0(serverURL, "/session"), "POST", qdata = toJSON(serverOpts)) :

> pJS <- phantom()

Error in phantom() : PhantomJS binary not located.

Apposite answered 23/4, 2015 at 8:43 Comment(6)
R is really not great at that and you will end up jumping through some painful hoops. All this has probably been solved for Python or C# or other more common automation languages. You should think of using Python to create your data files, and have R read them.Quin
thx Mike,I saw that there is package called yampy in python specifically for yammer,but for now i would like to know if i can get a quick and dirty solution for this using R,completely agree with you that python would give a more robust solution (python is on my "Next thing to learn" list as of now)Apposite
Python is not that hard. The syntax is a bit weird (and god help you if you mix tabs and spaces in the same file), but Python is probably one of the easiest to learn, and most versatile languages out there. It is worth learning.Quin
Mind if I write this suggestion up as an answer :)Quin
You said you tried "httr,RSelenium,rvest+Selector gadget" but you didn't show what you have tried.Looselimbed
@Metrics, have added the codes that i tried they might look a little clumsy but that is because im trying this for the first time and would like to learn to make this better.Apposite
M
2

I also spent very long time to manage to access password-protected sites from inside R. Finally I managed to do so by submitting the credentials as an html form. I had a quick look to the login page on Yammer and it seems similar to the case where I managed to have access.

Here is the code that I used. You need to adapt it to your context: You first start a session on the login page, you reach to the form that collects the Id and the password and finally you submit the form. I think in your case, the code below would work:

session <- html_session("https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg")
    login_form <- session %>% html_nodes("form") %>%
    .... %>%  #Instructions that lead you to the login form, e.g. extract2(1)
                    html_form() %>%
                    set_values(`login` = YourId,`password` = YourPasswd)  
     Logged_in=session %>%  submit_form(login_form))

logged_in should contains the session information after logging in.

BR

Misbecome answered 30/11, 2015 at 13:3 Comment(0)
H
0

What are you trying to achieve with this? If you are just looking to collect data then you can always use the data export API to download the network data instead for analysis. This requires an Enterprise network.

Haileyhailfellowwellmet answered 8/5, 2015 at 21:21 Comment(4)
That's the constraint i don't have a admin account/admin of that particular page, with an admin account it would have been very easy to download the network data with the data export API (I am just a member of the network, so have the most basic access )Apposite
Couldn't you just hit normal apis with a Bearer Token to get as much data you have access to?Haileyhailfellowwellmet
So this would flow like this. 1. Create an Application (yammer.com/client_applications 2. Follow the instructions here developer.yammer.com/v1.0/docs/test-token to obtain a token to start making API calls to the publicly documented api that is here developer.yammer.com/v1.0/docs/rest-api-rate-limitsHaileyhailfellowwellmet
hey brian i have already done the above, i.e creating yammer application and getting the access token and i am able to get all the network feed in my browser by making api calls using the token but what i want to do is, i want to do all of this through R i.e make those api calls using R so that i get the data directly into R rather than getting it into my browser - storing it in a text file - then reading those files into R. to sum it up i want to make the below call using R link I am able to do this in my browser and i get the dataApposite

© 2022 - 2024 — McMap. All rights reserved.