Running Multiple Scrapy Spiders (the easy way) Python

Asked 25/1, 2014 at 0:47 Answered 12/5, 2017 at 15:47

Scrapy is pretty cool, however I found the documentation to very bare bones, and some simple questions were tough to answer. After putting together various techniques from various stackoverflows I have finally come up with an easy and not overly technical way to run multiple scrapy spiders. I would imagine its less technical than trying to implement scrapyd etc:

So here is one spider that works well at doing it's one job of scraping some data after a formrequest:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.http import FormRequest
from swim.items import SwimItem

class MySpider(BaseSpider):
    name = "swimspider"
    start_urls = ["swimming website"]

    def parse(self, response):
        return [FormRequest.from_response(response,formname="AForm",
                    formdata={"lowage": "20, "highage": "25"}
                    ,callback=self.parse1,dont_click=True)]

    def parse1(self, response):       
        #open_in_browser(response)
        hxs = Selector(response)
        rows = hxs.xpath(".//tr")
        items = []

        for rows in rows[4:54]:
            item = SwimItem()
            item["names"] = rows.xpath(".//td[2]/text()").extract()
            item["age"] = rows.xpath(".//td[3]/text()").extract()
            item["swimtime"] = rows.xpath(".//td[4]/text()").extract()
            item["team"] = rows.xpath(".//td[6]/text()").extract()
            items.append(item)           
        return items

Instead of deliberately writing out the formdata with the form inputs I wanted ie "20" and "25:

formdata={"lowage": "20", "highage": "25}

I used "self." + a variable name:

formdata={"lowage": self.lowage, "highage": self.highage}

This then allows you to call the spider from the command line with the arguments that you want (see below). Use the python subprocess call() function to call those very command lines one after another, easily. It means I can go to my commandline, type "python scrapymanager.py" and have all of my spiders do their thing, each with different arguments passed at their command line, and download their data to the correct place:

#scrapymanager

from random import randint
from time import sleep
from subprocess import call

#free
call(["scrapy crawl swimspider -a lowage='20' -a highage='25' -a sex='W' -a StrkDist='10025' -o free.json -t json"], shell=True)
sleep(randint(15,45))

#breast
call(["scrapy crawl swimspider -a lowage='20' -a highage='25' -a sex='W' -a StrkDist='30025' -o breast.json -t json"], shell=True)
sleep(randint(15,45))

#back
call(["scrapy crawl swimspider -a lowage='20' -a highage='25' -a sex='W' -a StrkDist='20025' -o back.json -t json"], shell=True)
sleep(randint(15,45))

#fly
call(["scrapy crawl swimspider -a lowage='20' -a highage='25' -a sex='W' -a StrkDist='40025' -o fly.json -t json"], shell=True)
sleep(randint(15,45))

So rather than spending hours trying to rig up a complicated single spider that crawls each form in succession (in my case different swim strokes), this is a pretty painless way to run many many spiders "all at once" (I did include a delay between each scrapy call with the sleep() functions).

Hopefully this helps someone.

Jam answered 25/1, 2014 at 0:47 Comment(0)

Here it is the easy way. you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) :

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spiders.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy

process.start()

and run it. thats it!

Crosseyed answered 12/5, 2017 at 15:47 Comment(0)

yes there is an excellent companion to scrapy called scrapyd that's doing exactly what you are looking for, among many other goodies, you can also launch spiders through it, like this:

$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2
{"status": "ok", "jobid": "26d1b1a6d6f111e0be5c001e648c57f8"}

you can add your custom parameters as well using -d param=123

btw, spiders are being scheduled and not launched cause scrapyd manage a queue with (configurable) max number of running spiders in parallel

Obadiah answered 25/1, 2014 at 2:52 Comment(0)

Your method makes it procedural which makes it slow, against Scrapy's main principal. To make it asynchronous as always, you can try using CrawlerProcess

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

from myproject.spiders import spider1, spider2

1Spider = spider1.1Spider()
2Spider = spider2.2Spider()
process = CrawlerProcess(get_project_settings())
process.crawl(1Spider)
process.crawl(2Spider)
process.start()

If you want to see the full log of the crawl, set LOG_FILE in your settings.py.

LOG_FILE = "logs/mylog.log"

Alternant answered 22/3, 2017 at 3:17 Comment(0)

Recommended topics

Hot tags