Scrapy Clusters Distributed Crawl Strategy

Scrapy Clusters is awesome. It can be used to perform huge, continuous crawls using Redis and Kafka. It's really durible but I'm still trying to figure out the finer details of the best logic for my specific needs.

In using Scrapy Clusters I'm able to set up three levels of spiders that sequentially receive urls from one another like so:

site_url_crawler >>> gallery_url_crawler >>> content_crawler

(site_crawler would give something like cars.com/gallery/page:1 to gallery_url_crawler. gallery_url_crawler would give maybe 12 urls to content_crawler that might look like cars.com/car:1234, cars.com/car:1235, cars.com/car:1236, etc. And content_crawler would gather the all-important data we want.)

I can do this by adding to gallery_url_crawler.py

    req = scrapy.Request(url)
    for key in response.meta.keys():

        req.meta[key] = response.meta[key]
        req.meta['spiderid']= 'content_crawler1'
        req.meta['crawlid'] = 'site1'

    yield req

With this strategy I can feed urls from one crawler to another without having to wait for the subsequent crawl to complete. This then creates a queue. To fully utilize Clusters I hope to add more crawlers wherever there is a bottleneck. In this work-flow the bottleneck is at the end, when scraping the content. So I experimented with this:

site_url_crawler >>> gallery_url_crawler >>> content_crawler + content_crawler + content_crawler

For lack of a better illustration I was just trying to show I used three instances of that final spider to handle the longer queue.

BUT it seems that each instance of the content_crawler waited patiently for the current content_crawler to complete. Hence, no boost in productivity.

A final idea I had was something like this:

site_url_crawler >>> gallery_url_crawler >>> content_crawler1 + content_crawler2 + content_crawler3

So I tried to use separate spiders to receive the final queue.

Unfortunately, I could not experiment with this since I could not pass the kafka message to demo.inbound like so in gallery_url_crawler.py:

    req = scrapy.Request(url)
    for key in response.meta.keys():

        req.meta[key] = response.meta[key]
        req.meta['spiderid']= 'content_crawler1'
        req.meta['spiderid']= 'content_crawler2'
        req.meta['crawlid'] = 'site1'

    yield req

(Notice the extra spiderid)The above did not work because I think it can not assign a single message to two different spiders... And

    req1 = scrapy.Request(url)
    req2 = scrapy.Request(url)
    for key in response.meta.keys():

        req1.meta[key] = response.meta[key]
        req1.meta['spiderid']= 'content_crawler1'           
        req1.meta['crawlid'] = 'site1'

    for key2 in response.meta.keys():
        req2.meta[key2] = response.meta[key2]
        req2.meta['spiderid']= 'content_crawler2'
        req2.meta['crawlid'] = 'site1'
    yield req1
    yield req2

Did not work I think because the dupefilter kicked out the second one because it saw it as a dupe.

Anyway, I just hope to ultimately use Clusters in a way that can allow me to fire up instances of multiple spiders at anytime, pull from the queue, and repeat.

Recommended topics

Hot tags