Scraperwiki + lxml. How to get the href attribute of a child of an element with a class?
Asked Answered
S

2

7

On the link that contains 'alpha' in the URL has many links (hrefs) which I would like to collect from 20 different pages and paste onto the end of the general url (second last line). The href are found in a table which class is mys-elastic mys-left for the td and the a is obviously the element which contains the href attribute. Any help would greatly be appreciated for I have been working at this for about a week.

for i in range(1, 11):
# The HTML Scraper for the 20 pages that list all the exhibitors
 url = 'http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page='         + str(i) + '#GotoResults'
print url
list_html = scraperwiki.scrape(url)
root = lxml.html.fromstring(list_html)
href_element = root.cssselect('td.mys-elastic mys-left a')

for element in href_element:
#   Convert HTMl to lxml Object 
 href = href_element.get('href')
 print href

 page_html = scraperwiki.scrape('http://ahr13.mapyourshow.com' + href)
 print page_html
Sherris answered 2/1, 2013 at 9:30 Comment(3)
What is the problem exactly?Whereinto
How familiar with XPath are you?Rojo
rds: The problem is that it does not acquire the href attribute and save it as a variable to later add to the base url. Jon Clements: I did not know about it really until I searched it up just now, this term is very helpful, thank you.Sherris
P
23

No need to muck about with javascript - it's all there in the html:

import scraperwiki
import lxml.html

html = scraperwiki.scrape('http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?  alpha=%40&type=alpha&page=1')

root = lxml.html.fromstring(html)
# get the links
hrefs = root.xpath('//td[@class="mys-elastic mys-left"]/a')

for href in hrefs:
   print 'http://ahr13.mapyourshow.com' + href.attrib['href'] 
Playwriting answered 3/1, 2013 at 10:17 Comment(2)
Thanks man, just what I needed. A quick question though, how would I do XPath or csselect with scraperwiki for all the URLs that we just scraped?Sherris
In essence: Use attrib to get the href: root.xpath('//td[@class="mys-elastic mys-left"]/a').attrib['href']Chrissychrist
T
2
import lxml.html as lh
from itertools import chain

URL = 'http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page='
BASE = 'http://ahr13.mapyourshow.com'
path = '//table[2]//td[@class="mys-elastic mys-left"]//@href'

results = []   
for i in range(1,21):     
    doc=lh.parse(URL+str(i)) 
    results.append(BASE+i for i in doc.xpath(path))

print list(chain(*results))
Torres answered 2/1, 2013 at 10:25 Comment(2)
selenium is very difficult and problamtic to set up on Windows. Is there an alternative or specifically a way to do it with scraperwiki.com? ( I am getting ChromeDriver errors )Sherris
@PatrickArtounian -- sorry for the initial wrong, I was in a hurry when I took a look. Corrected my answer, it should be fine now. Note that the xpath gets both regular and bold links from the table.Torres

© 2022 - 2024 — McMap. All rights reserved.