Run multiple scrapy spiders at once using scrapyd
Asked Answered
H

2

12

I'm using scrapy for a project where I want to scrape a number of sites - possibly hundreds - and I have to write a specific spider for each site. I can schedule one spider in a project deployed to scrapyd using:

curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2

But how do I schedule all spiders in a project at once?

All help much appreciated!

Herstein answered 29/5, 2012 at 14:23 Comment(0)
O
25

My solution for running 200+ spiders at once has been to create a custom command for the project. See http://doc.scrapy.org/en/latest/topics/commands.html#custom-project-commands for more information about implementing custom commands.

YOURPROJECTNAME/commands/allcrawl.py :

from scrapy.command import ScrapyCommand
import urllib
import urllib2
from scrapy import log

class AllCrawlCommand(ScrapyCommand):

    requires_project = True
    default_settings = {'LOG_ENABLED': False}

    def short_desc(self):
        return "Schedule a run for all available spiders"

    def run(self, args, opts):
        url = 'http://localhost:6800/schedule.json'
        for s in self.crawler.spiders.list():
            values = {'project' : 'YOUR_PROJECT_NAME', 'spider' : s}
            data = urllib.urlencode(values)
            req = urllib2.Request(url, data)
            response = urllib2.urlopen(req)
            log.msg(response)

Make sure to include the following in your settings.py

COMMANDS_MODULE = 'YOURPROJECTNAME.commands'

Then from the command line (in your project directory) you can simply type

scrapy allcrawl
Overcharge answered 29/5, 2012 at 18:2 Comment(7)
Great, I Will try this first thing in the morning. I have no computer at the moment. Thank's for helping out!Herstein
Hi. I tried your solution but I receive the follwoing import error: import error: No module named commands I put the line "COMMANDS_MODULE = 'YOURPROJECTNAME.commands'" in the settings file in the project directory. Is this correct?Herstein
@Herstein Make sure your commands folder has an __init__.pyOvercharge
Thank's for staying with me on this, i really helps. Ok, I've included an ini.py file in the commands directory and uploaded th project again to scrapyd. But now I get the following error: NotImplementedError Do I have to implement/instantiate the command?Herstein
What is the structure inside schedule.json ?Dafodil
Can anyone explain me, how multiple spiders crawl by this custom command?Fromm
How do I pass argument? Eg: -a query="myquery" with this?Hospitalization
R
1

Sorry, I know this is an old topic, but I've started learning scrapy recently and stumbled here, and I don't have enough rep yet to post a comment, so posting an answer.

From the common scrapy practices you'll see that if you need to run multiple spiders at once, you'll have to start multiple scrapyd service instances and then distribute your Spider runs among those.

Regimentals answered 29/11, 2014 at 19:40 Comment(1)
If this was true when posted, it's not anymore. scrapyd runs jobs in parallel. If it seems it's not, check this: #9162224Rollandrollaway

© 2022 - 2024 — McMap. All rights reserved.