Scrapy Clusters Distributed Crawl Strategy
Asked Answered
C

1

7

Scrapy Clusters is awesome. It can be used to perform huge, continuous crawls using Redis and Kafka. It's really durible but I'm still trying to figure out the finer details of the best logic for my specific needs.

In using Scrapy Clusters I'm able to set up three levels of spiders that sequentially receive urls from one another like so:

site_url_crawler >>> gallery_url_crawler >>> content_crawler

(site_crawler would give something like cars.com/gallery/page:1 to gallery_url_crawler. gallery_url_crawler would give maybe 12 urls to content_crawler that might look like cars.com/car:1234, cars.com/car:1235, cars.com/car:1236, etc. And content_crawler would gather the all-important data we want.)

I can do this by adding to gallery_url_crawler.py

    req = scrapy.Request(url)
    for key in response.meta.keys():

        req.meta[key] = response.meta[key]
        req.meta['spiderid']= 'content_crawler1'
        req.meta['crawlid'] = 'site1'

    yield req   

With this strategy I can feed urls from one crawler to another without having to wait for the subsequent crawl to complete. This then creates a queue. To fully utilize Clusters I hope to add more crawlers wherever there is a bottleneck. In this work-flow the bottleneck is at the end, when scraping the content. So I experimented with this:

site_url_crawler >>> gallery_url_crawler >>> content_crawler + content_crawler + content_crawler

For lack of a better illustration I was just trying to show I used three instances of that final spider to handle the longer queue.

BUT it seems that each instance of the content_crawler waited patiently for the current content_crawler to complete. Hence, no boost in productivity.

A final idea I had was something like this:

site_url_crawler >>> gallery_url_crawler >>> content_crawler1 + content_crawler2 + content_crawler3

So I tried to use separate spiders to receive the final queue.

Unfortunately, I could not experiment with this since I could not pass the kafka message to demo.inbound like so in gallery_url_crawler.py:

    req = scrapy.Request(url)
    for key in response.meta.keys():

        req.meta[key] = response.meta[key]
        req.meta['spiderid']= 'content_crawler1'
        req.meta['spiderid']= 'content_crawler2'
        req.meta['crawlid'] = 'site1'

    yield req   

(Notice the extra spiderid)The above did not work because I think it can not assign a single message to two different spiders... And

    req1 = scrapy.Request(url)
    req2 = scrapy.Request(url)
    for key in response.meta.keys():

        req1.meta[key] = response.meta[key]
        req1.meta['spiderid']= 'content_crawler1'           
        req1.meta['crawlid'] = 'site1'

    for key2 in response.meta.keys():
        req2.meta[key2] = response.meta[key2]
        req2.meta['spiderid']= 'content_crawler2'
        req2.meta['crawlid'] = 'site1'
    yield req1
    yield req2

Did not work I think because the dupefilter kicked out the second one because it saw it as a dupe.

Anyway, I just hope to ultimately use Clusters in a way that can allow me to fire up instances of multiple spiders at anytime, pull from the queue, and repeat.

Cockup answered 14/3, 2016 at 6:27 Comment(3)
Are you still using scrapy cluster in 2020? I am wondering if there is an alternative as the project did not receive any commits in the last 2 years. Seems dead.Indigotin
@Liam Hanninen - Are you still using scrapy-cluster today? If not, what tool?Morel
Ah that's too bad. No I am not using it. But I am not using any tool to scrape.Cockup
C
1

It turns out that distributing the urls is based on IP addresses. Once I stood up the cluster on separate machines ie. different machines for each spider the urls flowed and were all taking from the queue.

http://scrapy-cluster.readthedocs.org/en/latest/topics/crawler/controlling.html

Scrapy Cluster comes with two major strategies for controlling how fast your pool of spiders hit different domains. This is determined by spider type and/or IP Address, but both act upon the different Domain Queues.

Cockup answered 22/3, 2016 at 18:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.