CSS selectors to be used for scraping specific links

Asked 16/7, 2014 at 19:27 Answered 28/7, 2014 at 16:17

I am new to Python and working on a scraping project. I am using Firebug to copy the CSS path of required links. I am trying to collect the links under the tab of "UPCOMING EVENTS" from http://kiascenehai.pk/ but it is just for learning how I can get the specified links.

I am looking for the fix of this problem and also suggestions for how to retrieve specified links using CSS selectors.

from bs4 import BeautifulSoup
import requests

url = "http://kiascenehai.pk/"

r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data)

for link in soup.select("html body div.body-outer-wrapper div.body-wrapper.boxed-mode div.main-     outer-wrapper.mt30 div.main-wrapper.container div.row.row-wrapper div.page-wrapper.twelve.columns.b0 div.row div.page-wrapper.twelve.columns div.row div.eight.columns.b0 div.content.clearfix section#main-content div.row div.six.columns div.small-post-wrapper div.small-post-content h2.small-post-title a"):
    print  link.get('href')

Absorb answered 16/7, 2014 at 19:27 Comment(3)

Can you help me fix and learn how to get useful css selectors ? @Martijn Pieters – Absorb 17/7, 2014 at 7:49

The URL you are loading asks for a city to be picked at http://kiascenehai.pk/select_city?url=http%3A%2F%2Fkiascenehai.pk%2F, and contains no upcoming events for me. When I pick 'Lahore', say, a cookie is set. You need to make sure that requests does the same. – Passade 17/7, 2014 at 11:19

@MartijnPieters how it could be made possible ? – Absorb 17/7, 2014 at 11:34

First of all, that page requires a city selection to be made (in a cookie). Use a Session object to handle this:

s = requests.Session()
s.post('http://kiascenehai.pk/select_city/submit_city', data={'city': 'Lahore'})
response = s.get('http://kiascenehai.pk/')

Now the response gets the actual page content, not redirected to the city selection page.

Next, keep your CSS selector no larger than needed. In this page there isn't much to go on as it uses a grid layout, so we first need to zoom in on the right rows:

upcoming_events_header = soup.find('div', class_='featured-event')
upcoming_events_row = upcoming_events_header.find_next(class_='row')

for link in upcoming_events_row.select('h2 a[href]'):
    print link['href']

Passade answered 17/7, 2014 at 12:59 Comment(4)

will you explain a bit why you used .find_next method here ? What is it actually for ? @Martijn Pieters – Absorb 17/7, 2014 at 13:50

@Flecha: see the .find_next() documentation; it scans through the element tree finding the requested element, but doesn't search the whole document, only after the starting point. – Passade 17/7, 2014 at 13:52

@Flecha: so it is just like a normal find, but it doesn't look at anything before the upcoming_events_header element. – Passade 17/7, 2014 at 13:52

@Flecha: the accept is much appreciated, btw. Did the code in the other answer also work for you? – Passade 17/7, 2014 at 13:53

This is co-founder KiaSceneHai.pk; please don't scrape websites, alot of effort goes into collecting the data, we offer access through our API, you can use the contact form to request access, ty

Receivable answered 28/7, 2014 at 16:17 Comment(0)

Recommended topics

Hot tags