I am new to scrapy and scrapyd. Did some reading and developed my crawler which crawls a news website and gives me all the news articles from it. If I run the crawler simply by
scrapy crawl project name -o something.txt
It gives me all scraped data in something.txt correctly.
Now I tried deploying my scrapy crawler project on localhost:6800 using scrapyd.
And I schduled the crawler using
curl http://localhost:6800/schedule.json -d project=tutorial -d spider=dmoz_spider
it gives me this on command line
{"status": "ok", "jobid": "545dfcf092de11e3ad8b0013d43164b8"}
which is I think is correct and I am even able to see my cralwer as a job on UI view of localhost:6800
But where do I find the data that is scraped by my crawler which I used to collect previously in something.txt.
Please help....
this is my crawler code
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["timesofindia.com"]
start_urls = ["http://mobiletoi.timesofindia.com/htmldbtoi/TOIPU/20140206/TOIPU_articles__20140206.html"]
def parse(self, response):
sel = Selector(response)
torrent = DmozItem()
items=[]
links = sel.xpath('//div[@class="gapleftm"]/ul[@class="content"]/li')
sel.xpath("//div[@class='gapleftm']/ul[@class='content']/li/b/a/stname/text()").extract()
sel.xpath("//div[@class='gapleftm']/ul[@class='content']/li/b/a/@href").extract()
for ti in sel.xpath("//a[@class='pda']/text()").extract():
yield DmozItem(title=ti)
for url in sel.xpath("//a[@class='pda']/@href").extract():
itemLink = urlparse.urljoin(response.url, url)
yield DmozItem(link=url)
yield Request(itemLink, callback=self.my_parse)
def my_parse(self, response):
sel = Selector(response)
self.log('A response from my_parse just arrived!')
for head in sel.xpath("//b[@class='pda']/text()").extract():
yield DmozItem(heading=head)
for text in sel.xpath("//a[@class='pda']/text()").extract():
yield DmozItem(desc=text)
for url_desc in sel.xpath("//a[@class='pda']/@href").extract():
itemLinkDesc = urlparse.urljoin(response.url, url_desc)
yield DmozItem(link=url_desc)
yield Request(itemLinkDesc, callback=self.my_parse_desc)
def my_parse_desc(self, response):
sel = Selector(response)
self.log('ENTERED ITERATION OF MY_PARSE_DESC!')
for bo in sel.xpath("//font[@class='pda']/text()").extract():
yield DmozItem(body=bo)
/var/log/scrapyd/
. – Danellef980130e92e711e3ad8b0013d43164b8.log
file inside the/var/log/scrapyd/
– Intersexualvar/log/scrapyd/scrapyd.out
but I am not getting anything in that file.... – Intersexual/etc/scrapyd/scrapyd.conf
and see whatitems_dir
is set to. – Danelle/var/lib/scrapyd/items
got you point that if I changed this path I can get my output file at path that I want but the output file I am getting is in.jl
extension and even its name is the Job Id of the crawler job instead I want my own file name and JSON extension. – Intersexual