I am trying to scrape a webpage
library(RCurl)
webpage <- getURL("https://somewebpage.com")
webpage
<div class='CredibilityFacts'><span id='qZyoLu'><a class='answer_permalink'
action_mousedown='AnswerPermalinkClickthrough' href='/someurl/answer/my_id'
id ='__w2_yeSWotR_link'>
<a class='another_class' action_mousedown='AnswerPermalinkClickthrough'
href='/ignore_url/answer/some_id' id='__w2_ksTVShJ_link'>
<a class='answer_permalink' action_mousedown='AnswerPermalinkClickthrough'
href='/another_url/answer/new_id' id='__w2_ksTVShJ_link'>
class(webpage)
[1] "character"
I am trying to extract all the href
value but only when it is preceded with answer_permalink
class.
The output of this should be
[1] "/someurl/answer/my_id" "/another_url/answer/new_id"
/ignore_url/answer/some_id
should be ignored as it is preceded with another_class
and not answer_permalink
class.
Right now, I am thinking of an approach with regex. I think something like this can be used for regex in stri_extract_all
class='answer_permalink'.*href='
but this isn't exactly what I want.
In what way can I achieve this? Moreover, apart from regex is there a function in R where we can extract element by class like in Javascript?
rvest
package using something likeread_html(webpage) %>% html_nodes("answer_permalink") %>% html_attr("href")
– Anergycharacter(0)
. – Cheryllches