Scrapy: how to populate hierarchic items with multipel requests
Asked Answered
P

0

0

This one is extension of Multiple nested request with scrapy . Asking because presented solution have flaws:
1. It iliminates asynchrony, thus heavily reducing scraping efficiency
2. Should exception appeare while processing links "stack" and no item will be yelded
3. What if there is a huge amount of child items?

To deal with (1) I considered this:

class CatLoader(ItemLoader):

    def __int__(self, item=None, selector=None, response=None, parent=None, **context):
        super(self.__class__, self).__init__(item, selector, response, parent, **context)
        self.lock = threading.Lock()
        self.counter = 0

    def dec_counter(self):
        self.lock.acquire()
        self.counter += 1
        self.lock.release()

Then in parser:

    if len(urls) == 0:
        self.logger.warning('Cat without items, url: ' + response.url)
        item = cl.load_item()
        yield item
    cl.counter = len(urls)
    for url in urls:
        rq = Request(url, self.parse_item)
        rq.meta['loader'] = cl
        yield rq

And in parse_item() I can do:

def parse_item(self, response):
    l = response.meta['loader']

    l.dec_counter()
    if l.counter == 0:
        yield l.load_item()

BUT! To deal with 2 i neeed in each function do:

def parse_item(self, response):
    try:
        l = response.meta['loader']

    finally:
        l.dec_counter()
        if l.counter == 0:
            yield l.load_item()

Which I consider not elegant solution. So could anyone help with better solution? Also I'm up to insert items to DB, rather than json output, so maybe it better to create item with promise and make pipline, that parses children to check if promise is fulfiled(when item is inserted to DB), or something like that?

UPD: Hierchic items: category -> article -> images. All to be saved in different tables with proper relations. So: 1) Articles must be inservet to table AFTER category. 2) Article must know ID of it's category to form relation Same thing for images records

Primula answered 23/9, 2017 at 19:30 Comment(4)
Please explain your use case because that question is answer, so I would prefer to see what is that you want to do and then propose a probable solutionCircumpolar
I want to scrap hierarchic items and save it to DB. Yes, the qestion is answer. I just confused, because this is pretty common situation and I see no official documentation how to deal with it. I don't like that I need to create my own not very good looking solution for that case.Primula
Yes I am trying to understand what you mean by hierarchic items?Circumpolar
Updated questionPrimula

© 2022 - 2024 — McMap. All rights reserved.