Get text from href tag after specific class

Asked 15/4, 2018 at 10:27 Answered 16/4, 2018 at 8:57

I am trying to scrape a webpage

library(RCurl)
webpage <- getURL("https://somewebpage.com")

webpage

<div class='CredibilityFacts'><span id='qZyoLu'><a class='answer_permalink'
action_mousedown='AnswerPermalinkClickthrough' href='/someurl/answer/my_id' 
id ='__w2_yeSWotR_link'>
<a class='another_class' action_mousedown='AnswerPermalinkClickthrough' 
href='/ignore_url/answer/some_id' id='__w2_ksTVShJ_link'>
<a class='answer_permalink' action_mousedown='AnswerPermalinkClickthrough' 
href='/another_url/answer/new_id' id='__w2_ksTVShJ_link'>

class(webpage)
[1] "character"

I am trying to extract all the href value but only when it is preceded with answer_permalink class.

The output of this should be

[1] "/someurl/answer/my_id"  "/another_url/answer/new_id"

/ignore_url/answer/some_id should be ignored as it is preceded with another_class and not answer_permalink class.

Right now, I am thinking of an approach with regex. I think something like this can be used for regex in stri_extract_all

class='answer_permalink'.*href='

but this isn't exactly what I want.

In what way can I achieve this? Moreover, apart from regex is there a function in R where we can extract element by class like in Javascript?

Cheryllches answered 15/4, 2018 at 10:27 Comment(2)

You ought to be able to do this with the rvest package using something like read_html(webpage) %>% html_nodes("answer_permalink") %>% html_attr("href") – Anergy 15/4, 2018 at 10:41

@AndrewGustar that returns me character(0). – Cheryllches 15/4, 2018 at 10:45

With dplyr and rvest we could do:

library(rvest)
library(dplyr)

"https://www.quora.com/profile/Ronak-Shah-96" %>% 
  read_html() %>% 
  html_nodes("[class='answer_permalink']") %>% 
  html_attr("href")

[1] "/How-can-we-adjust-in-engineering-if-we-are-not-in-IITs-or-NITs-How-can-we-enjoy-engineering-if-we-are-pursuing-it-from-a-local-private-college/answer/Ronak-Shah-96"                                                                        
[2] "/Do-you-think-it-is-worth-it-to-change-my-career-path-For-the-past-2-years-I-was-pursuing-a-career-in-tax-advisory-in-a-BIG4-company-I-just-got-a-job-offer-that-will-allow-me-to-learn-coding-It-is-not-that-well-paid/answer/Ronak-Shah-96"
[3] "/Why-cant-India-opt-for-40-hours-work-a-week-for-all-professions-when-it-is-proved-and-working-well-in-terms-of-efficiency/answer/Ronak-Shah-96"

[4] "/Why-am-I-still-confused-and-thinking-about-my-career-after-working-more-than-one-year-in-software-engineering/answer/Ronak-Shah-96"

[5] "/Would-you-rather-be-a-jack-of-all-trades-or-the-master-of-one-trade/answer/Ronak-Shah-96"

Orfinger answered 16/4, 2018 at 1:23 Comment(0)

Instead of string parsing, you could use a package like rvest or xml2:

library(xml2)
xml <- read_html(webpage)
l <- as_list(xml)[[1]][[1]][[1]][[1]]  #not sure why you need to go this deep.

l2 <- l[sapply(l, attr, ".class") == "answer_permalink"]
sapply(l2, attr, "href")

                       a                            a 
 "/someurl/answer/my_id" "/another_url/answer/new_id"

Subscapular answered 15/4, 2018 at 10:48 Comment(4)

sapply(l, attr, ".class") gives me output as [[1]] NULL . Am I doing something wrong ? and length(l) is 1. – Cheryllches 15/4, 2018 at 10:57

I don't know, I just read your webpage as a string, and am running the exact code above.. – Subscapular 15/4, 2018 at 10:59

Had to go one level less deep. l <- as_list(xml)[[1]][[1]][[1]]. Not sure what changed there. – Cheryllches 15/4, 2018 at 11:4

Yeah I'm not sure how reliable this is, generally Andrew's comment should be right, but I can't get that to work either. – Subscapular 15/4, 2018 at 11:5

require(XML)
require(RCurl)

doc <- getURL("https://www.quora.com/profile/Ronak-Shah-96" )
html <- htmlTreeParse(doc, useInternalNodes = TRUE)
nodes <- getNodeSet(html, "//a[@class='answer_permalink']")
sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])

[1] "/Do-you-think-it-is-worth-it-to-change-my-career-path-For-the-past-2-years-I-was-pursuing-a-career-in-tax-advisory-in-a-BIG4-company-I-just-got-a-job-offer-that-will-allow-me-to-learn-coding-It-is-not-that-well-paid/answer/Ronak-Shah-96"
[2] "/Why-cant-India-opt-for-40-hours-work-a-week-for-all-professions-when-it-is-proved-and-working-well-in-terms-of-efficiency/answer/Ronak-Shah-96"                                                                                             
[3] "/Why-am-I-still-confused-and-thinking-about-my-career-after-working-more-than-one-year-in-software-engineering/answer/Ronak-Shah-96"                                                                                                         
[4] "/Would-you-rather-be-a-jack-of-all-trades-or-the-master-of-one-trade/answer/Ronak-Shah-96"                                                                                                                                                   
[5] "/Is-software-engineering-a-good-career-choice-I-know-it-pays-well-initially-but-if-you-look-at-the-managing-directors-of-most-companies-they-are-people-with-MBAs/answer/Ronak-Shah-96"

Cerulean answered 16/4, 2018 at 8:57 Comment(0)

Recommended topics

Hot tags