scrapy passing custom_settings to spider from script using CrawlerProcess.crawl()
Asked Answered
K

4

8

I am trying to programatically call a spider through a script. I an unable to override the settings through the constructor using CrawlerProcess. Let me illustrate this with the default spider for scraping quotes from the official scrapy site (last code snippet at official scrapy quotes example spider).

class QuotesSpider(Spider):

    name = "quotes"

    def __init__(self, somestring, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)
        self.somestring = somestring
        self.custom_settings = kwargs


    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

Here is the script through which I try to run the quotes spider

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings

    def main():

    proc = CrawlerProcess(get_project_settings())

    custom_settings_spider = \
    {
        'FEED_URI': 'quotes.csv',
        'LOG_FILE': 'quotes.log'
    }
    proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
    proc.start()
Kezer answered 28/2, 2017 at 14:48 Comment(0)
R
11

Scrapy Settings are a bit like Python dicts. So you can update the settings object before passing it to CrawlerProcess:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings

def main():

    s = get_project_settings()
    s.update({
        'FEED_URI': 'quotes.csv',
        'LOG_FILE': 'quotes.log'
    })
    proc = CrawlerProcess(s)

    proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
    proc.start()

Edit following OP's comments:

Here's a variation using CrawlerRunner, with a new CrawlerRunner for each crawl and re-configuring logging at each iteration to write to different files each time:

import logging
from twisted.internet import reactor, defer

import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging, _get_handler
from scrapy.utils.project import get_project_settings


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        page = getattr(self, 'page', 1)
        yield scrapy.Request('http://quotes.toscrape.com/page/{}/'.format(page),
                             self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }


@defer.inlineCallbacks
def crawl():
    s = get_project_settings()
    for i in range(1, 4):
        s.update({
            'FEED_URI': 'quotes%03d.csv' % i,
            'LOG_FILE': 'quotes%03d.log' % i
        })

        # manually configure logging for LOG_FILE
        configure_logging(settings=s, install_root_handler=False)
        logging.root.setLevel(logging.NOTSET)
        handler = _get_handler(s)
        logging.root.addHandler(handler)

        runner = CrawlerRunner(s)
        yield runner.crawl(QuotesSpider, page=i)

        # reset root handler
        logging.root.removeHandler(handler)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished
Retractor answered 28/2, 2017 at 15:25 Comment(7)
For my use case, I need to pass a .csv file for each run of the spider using proc.crawl(). I want to have 1 crawler process (with the common settings) but call crawl successively with different names for the log and csv feed output. Can I achieve this using scrapy?Kezer
@Kezer you can use a for loop when calling CrawlerProcess, and updating the settings there, instead of overriding custom_settingsBidentate
@hAcKnRoCk, have you looked at the last example in Running multiple spiders in the same process, i.e. running spiders sequentially with CrawlerRunner?Retractor
@eLRuLL: Yes, I already tried with a for loop. The code is at pastebin.com/RTnUWntQ. I receive a 'twisted.internet.error.ReactorNotRestartable' error during the 2nd iteration.Kezer
@paultrmbrth Yes, I did see that example. But I am not sure if it will suit my usecase. The problem in the question will still persist. I wont be able to run my spider with each run giving me a .csv and a .log file.Kezer
Here's an example running one of the tutorial's example spiders, outputting different quotes*.csv and quotes*.log each run: gist.github.com/redapple/02e17ef4bb7c9998b95412d07a846bbaRetractor
@paultrmbrth I think you nailed it. Thank you very much. I would have never figured out the root handler thing since in the documentation it is all oldschool 'configure_logging' calls. Anyways I am going to test to make sure the log files dont have any interference among runs as it did last time. I have a few questions about the root handler that I am trying to figure out. I will update at the earliest. If you can change the main code snippet, I will accept your answer of course.Kezer
B
1

I think you can't override the custom_settings variable of a Spider Class when calling it as a script, basically because the settings are being loaded before the spider is instantiated.

Now, I don't really see a point on changing the custom_settings variable specifically, as it is only a way to override your default settings, and that's exactly what the CrawlerProcess offers too, this works as expected:

import scrapy
from scrapy.crawler import CrawlerProcess


class MySpider(scrapy.Spider):
    name = 'simple'
    start_urls = ['http://httpbin.org/headers']

    def parse(self, response):
        for k, v in self.settings.items():
            print('{}: {}'.format(k, v))
        yield {
            'headers': response.body
        }

process = CrawlerProcess({
    'USER_AGENT': 'my custom user anget',
    'ANYKEY': 'any value',
})

process.crawl(MySpider)
process.start()
Bidentate answered 28/2, 2017 at 15:28 Comment(3)
The point in being able to override custom_settings is this. I want to be able to do a 'crawl('myspider', list1_urlstoscrape, 'list1output.csv', 'list1.log' )', then again do a 'crawl('myspider', list2_urlstoscrape, 'list2output.csv', 'list2.log'). To achieve this, therefore, I have to create multiple CrawlerProcess instances which is not possible due to the twister reactor problem.Kezer
you could change your spider code to receive multiple lists at once, and then process eachBidentate
Yes, but the problem would still exist. The issue is not in passing the inputs list to be scraped but in saying how you want the outputs for each of those lists (that is, for each crawl of the same spider).Kezer
F
0

It seems you want to have custom log for each spiders. You need to activate the logging like this:

from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
    #ommited
    def __init__(self):
        configure_logging({'LOG_FILE' : "logs/mylog.log"})
Frost answered 25/3, 2017 at 6:52 Comment(1)
This actually helped me in a very unique situation where I have a spider that calls an api and multiple "accounts" that can be used with the spider. Thanks!Neural
F
0

You can override a setting from the command line

https://doc.scrapy.org/en/latest/topics/settings.html#command-line-options

For example: scrapy crawl myspider -s LOG_FILE=scrapy.log

Frost answered 7/11, 2018 at 4:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.