Hey so I have about 50 spiders in my project and I'm currently running them via scrapyd server. I'm running into an issue where some of the resources I use get locked and make my spiders fail or go really slow. I was hoping their was some way to tell scrapyd to only have 1 running spider at a time and leave the rest in the pending queue. I didn't see a configuration option for this in the docs. Any help would be much appreciated!
Change number of running spiders scrapyd
Asked Answered
What kind of shared resources do you have? –
Jilleen
I have an sqlite file that I write to. Every once in awhile I get a cannot connect error. Also I'm using phantomjs and selenium to handle dynamic (javascript) content. Sometimes phantomjs's GhostDriver seems to get blocked due to a race condition. –
Brophy
This can be controlled by scrapyd settings. Set max_proc
to 1
:
max_proc
The maximum number of concurrent Scrapy process that will be started.
Does max proc keep requests from being made asynchronously? That is why I didn't use it. It was unclear to me if this would be the case. This could be a lack of understanding on my part, follow up question: Does scrapy actually spawn new processes or threads to handle requests asynchronously or is there some kind of twisted framework "magic" making this happen? –
Brophy
@Brophy requests would be async anyway since there is twisted under-the-hood.
max_proc
just helps to have a single spider running at a time. This is how I understand this. What kind of resources are shared among spiders and slowing things down? I think you need to fix it instead of trying to make it run in a blocking mode.. –
Jilleen Answered that one above. Thanks for the quick responses. –
Brophy
@Brophy ok, yeah, first of all, sqlite is really not a good choice here since it blocks the whole database on writes. Switch to postgresql, or mysql etc in case you need classic relational database, or to mongodb, or redis etc in case you need a NoSQL solution..also, elaborate phantomjs problem into a separate question with details. Thanks. –
Jilleen
Thanks for the insight into sqlite. Right now my project is a prototype and I'm just using the sqlite file as a dummy database till I hook my project up to the real database next week. I'll only be using the max_proc = 1 till then. I'll make a new question about the phantomjs problem. –
Brophy
Don't know if this is still in your knowledge base but here is the follow up question: #24963020 –
Brophy
@Brophy it is not directly in my knowledge base, but the things you are tackling with are very much connected to what I'm doing at one of mine projects. I will pay attention to it. –
Jilleen
@Brophy by the way, is phantomjs logging critical? Turning it off can be an option :) –
Jilleen
Not critical at all! If I don't find another solution I'll point phantomjs log_path at dev null. And thanks! –
Brophy
© 2022 - 2024 — McMap. All rights reserved.