Multiple nested request with scrapy
Asked Answered
S

1

2

I try to scrap some airplane schedule information on www.flightradar24.com website for research project.

The hierarchy of json file i want to obtain is something like that :

Object ID
 - country
   - link
   - name
   - airports
     - airport0 
       - code_total
       - link
       - lat
       - lon
       - name
       - schedule
          - ...
          - ...
      - airport1 
       - code_total
       - link
       - lat
       - lon
       - name
       - schedule
          - ...
          - ...

Country and Airport are stored using items, and as you can see on json file the CountryItem (link, name attribute) finally store multiple AirportItem (code_total, link, lat, lon, name, schedule) :

class CountryItem(scrapy.Item):
    name = scrapy.Field()
    link = scrapy.Field()
    airports = scrapy.Field()
    other_url= scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

class AirportItem(scrapy.Item):
    name = scrapy.Field()
    code_little = scrapy.Field()
    code_total = scrapy.Field()
    lat = scrapy.Field()
    lon = scrapy.Field()
    link = scrapy.Field()
    schedule = scrapy.Field()

Here my scrapy code AirportsSpider to do that :

class AirportsSpider(scrapy.Spider):
    name = "airports"
    start_urls = ['https://www.flightradar24.com/data/airports']
    allowed_domains = ['flightradar24.com']

    def clean_html(self, html_text):
        soup = BeautifulSoup(html_text, 'html.parser')
        return soup.get_text()

    rules = [
    # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LxmlLinkExtractor(allow=('data/airports/',)), callback='parse')
    ]


    def parse(self, response):
        count_country = 0
        countries = []
        for country in response.xpath('//a[@data-country]'):
            if count_country > 5:
                break
            item = CountryItem()
            url =  country.xpath('./@href').extract()
            name = country.xpath('./@title').extract()
            item['link'] = url[0]
            item['name'] = name[0]
            count_country += 1
            countries.append(item)
            yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)

    def parse_airports(self,response):
        item = response.meta['my_country_item']
        airports = []

        for airport in response.xpath('//a[@data-iata]'):
            url = airport.xpath('./@href').extract()
            iata = airport.xpath('./@data-iata').extract()
            iatabis = airport.xpath('./small/text()').extract()
            name = ''.join(airport.xpath('./text()').extract()).strip()
            lat = airport.xpath("./@data-lat").extract()
            lon = airport.xpath("./@data-lon").extract()

            iAirport = AirportItem()
            iAirport['name'] = self.clean_html(name)
            iAirport['link'] = url[0]
            iAirport['lat'] = lat[0]
            iAirport['lon'] = lon[0]
            iAirport['code_little'] = iata[0]
            iAirport['code_total'] = iatabis[0]

            airports.append(iAirport)

        for airport in airports:
            json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page=1&limit=50&token='.format(code=airport['code_little'], timestamp="1484150483")
            yield scrapy.Request(json_url, meta={'airport_item': airport}, callback=self.parse_schedule)

        item['airports'] = airports

        yield {"country" : item}

    def parse_schedule(self,response):

        item = response.request.meta['airport_item']
        jsonload = json.loads(response.body_as_unicode())
        json_expression = jmespath.compile("result.response.airport.pluginData.schedule")
        item['schedule'] = json_expression.search(jsonload)

Explanation :

  • In my first parse, i call a request on for each country link i found whith the CountryItem created via meta={'my_country_item':item}. Each of these request callback self.parse_airports

  • In my second level of parse parse_airports, i catch CountryItem created using item = response.meta['my_country_item'] and i create a new item iAirport = AirportItem() for each airport i found into this country page. Now i want to get schedule information for each AirportItem created and stored in airports list.

  • In the second level of parse parse_airports, i run a for loop on airports to catch schedule information using a new Request. Because i want to include this schedule information into my AirportItem, i include this item into meta information meta={'airport_item': airport}. The callback of this request run parse_schedule

  • In the third level of parse parse_schedule, i inject the schedule information collected by scrapy into the AirportItem previously created using response.request.meta['airport_item']

But i have a problem in my source code, scrapy correctly scrap all the informations (country, airports, schedule), but my comprehension of nested item seems not correct. As you can see the json i produced contain country > list of (airport), but not country > list of (airport > schedule )

enter image description here

My code is on github : https://github.com/IDEES-Rouen/Flight-Scrapping

Stamp answered 13/1, 2017 at 11:56 Comment(0)
V
4

The issue is that you fork your item, where according to your logic you only want 1 item per country, so you can't yield mutltiple items at any point after parsing the country. What you want to do is stack all of them into one item.
To do that you need to create a parsing loop:

def parse_airports(self, response):
    item = response.meta['my_country_item']
    item['airports'] = []

    for airport in response.xpath('//a[@data-iata]'):
        url = airport.xpath('./@href').extract()
        iata = airport.xpath('./@data-iata').extract()
        iatabis = airport.xpath('./small/text()').extract()
        name = ''.join(airport.xpath('./text()').extract()).strip()
        lat = airport.xpath("./@data-lat").extract()
        lon = airport.xpath("./@data-lon").extract()

        iAirport = dict()
        iAirport['name'] = 'foobar'
        iAirport['link'] = url[0]
        iAirport['lat'] = lat[0]
        iAirport['lon'] = lon[0]
        iAirport['code_little'] = iata[0]
        iAirport['code_total'] = iatabis[0]
        item['airports'].append(iAirport)

    urls = []
    for airport in item['airports']:
        json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page=1&limit=50&token='.format(
            code=airport['code_little'], timestamp="1484150483")
        urls.append(json_url)
    if not urls:
        return item

    # start with first url
    next_url = urls.pop()
    return Request(next_url, self.parse_schedule,
                   meta={'airport_item': item, 'airport_urls': urls, 'i': 0})

def parse_schedule(self, response):
    """we want to loop this continuously for every schedule item"""
    item = response.meta['airport_item']
    i = response.meta['i']
    urls = response.meta['airport_urls']

    jsonload = json.loads(response.body_as_unicode())
    item['airports'][i]['schedule'] = 'foobar'
    # now do next schedule items
    if not urls:
        yield item
        return
    url = urls.pop()
    yield Request(url, self.parse_schedule,
                  meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
Vanillic answered 13/1, 2017 at 12:32 Comment(4)
Hi Granitosaurus, i try the code after some minor typo correction, but any Item is written by pipeline.Stamp
@Stamp hey, I did fix up some typos and tested this code and this does work. These are the results for one country item: pastebin.ubuntu.com/23798326 I got.Vanillic
That is bad solution. Reduces scraping efficency,no exceptions proof, not sutable when there is a lots of child items and you plan to insert in DB. Read more in my questionLesleelesley
@Lesleelesley if you have a better solution feel free to post an answer :)Vanillic

© 2022 - 2024 — McMap. All rights reserved.