Scrapyd: where do I get to see the output of my crawler once i schedule it using scrapyd
Asked Answered
I

1

7

I am new to scrapy and scrapyd. Did some reading and developed my crawler which crawls a news website and gives me all the news articles from it. If I run the crawler simply by

scrapy crawl project name -o something.txt

It gives me all scraped data in something.txt correctly.

Now I tried deploying my scrapy crawler project on localhost:6800 using scrapyd.

And I schduled the crawler using

curl http://localhost:6800/schedule.json -d project=tutorial -d spider=dmoz_spider

it gives me this on command line

{"status": "ok", "jobid": "545dfcf092de11e3ad8b0013d43164b8"}

which is I think is correct and I am even able to see my cralwer as a job on UI view of localhost:6800

But where do I find the data that is scraped by my crawler which I used to collect previously in something.txt.

Please help....

this is my crawler code

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["timesofindia.com"]
    start_urls = ["http://mobiletoi.timesofindia.com/htmldbtoi/TOIPU/20140206/TOIPU_articles__20140206.html"]

    def parse(self, response):
    sel = Selector(response)
        torrent = DmozItem()
    items=[]
    links = sel.xpath('//div[@class="gapleftm"]/ul[@class="content"]/li')
        sel.xpath("//div[@class='gapleftm']/ul[@class='content']/li/b/a/stname/text()").extract()
    sel.xpath("//div[@class='gapleftm']/ul[@class='content']/li/b/a/@href").extract()

    for ti in sel.xpath("//a[@class='pda']/text()").extract():
        yield DmozItem(title=ti)
    for url in sel.xpath("//a[@class='pda']/@href").extract():
        itemLink = urlparse.urljoin(response.url, url)  
        yield DmozItem(link=url)    
        yield Request(itemLink, callback=self.my_parse)

    def my_parse(self, response):
    sel = Selector(response)
    self.log('A response from my_parse just arrived!')
    for head in sel.xpath("//b[@class='pda']/text()").extract():
        yield DmozItem(heading=head)
    for text in sel.xpath("//a[@class='pda']/text()").extract():
        yield DmozItem(desc=text)
    for url_desc in sel.xpath("//a[@class='pda']/@href").extract():
        itemLinkDesc = urlparse.urljoin(response.url, url_desc) 
        yield DmozItem(link=url_desc)   
        yield Request(itemLinkDesc, callback=self.my_parse_desc)

    def my_parse_desc(self, response):
        sel = Selector(response)
        self.log('ENTERED ITERATION OF MY_PARSE_DESC!')
        for bo in sel.xpath("//font[@class='pda']/text()").extract():
            yield DmozItem(body=bo) 
Intersexual answered 11/2, 2014 at 5:43 Comment(7)
Check /var/log/scrapyd/.Danelle
thnx got the output in f980130e92e711e3ad8b0013d43164b8.log file inside the /var/log/scrapyd/Intersexual
@Danelle But as per the tutorial on scrapyd tutorial I should get any standard o/p in var/log/scrapyd/scrapyd.out but I am not getting anything in that file....Intersexual
@Danelle though I am getting the O/p in logs I actually need to have my output in separate JSON file as I have further data extraction and processing to be done on it at server side.Intersexual
Look in /etc/scrapyd/scrapyd.conf and see what items_dir is set to.Danelle
@Danelle path is set to /var/lib/scrapyd/items got you point that if I changed this path I can get my output file at path that I want but the output file I am getting is in .jl extension and even its name is the Job Id of the crawler job instead I want my own file name and JSON extension.Intersexual
Then subclass some of Scrapyd's modules and do just that. It's not versatile.Danelle
P
8

When using the feed exports you define where to store the feed using a URI (through the FEED_URI setting). The feed exports supports multiple storage backend types which are defined by the URI scheme.

curl http://localhost:6800/schedule.json -d project=tutorial -d spider=dmoz_spider -d setting=FEED_URI=file:///path/to/output.json
Paff answered 11/2, 2014 at 6:25 Comment(7)
what should be path like..? I mean something like this file:///home/yogesh/to/output.jsonIntersexual
Sorry but I am missing something or doing something wrong I am giving this at cmd line curl http://localhost:6800/schedule.json -d project=tutorial -d spider=dmoz -d setting=FEED_URI=file:///home/yogesh/output.json with this my crawler runs but I get this error in log file exceptions.IOError: [Errno 13] Permission denied: '/home/yogesh/output.json'Intersexual
@y.dixit You should give write permission to other users: chmod 777 /path/to/.Paff
done first did chmod 777 /home/crawled_data and then the JSON API command posted in answer and it worked successfully thanks kevIntersexual
earlier when I used to run scrapy crawl [spider_name] -o [something].json -t json I used to get the o/p as well formatted JSON file but if I remove -t json part of the command though the extension is json its not in JSON format inside....same thing is happening now....I am getiing a JSON extension file but its not well formatted...for eg: {"title": "Front Page"}{"title": "Times City"} {"title": "Times Nation"} this is not well formatted JSON compared to [{"body": "WORLD RAP "}] thisIntersexual
@y.dixit It's jsonline format. It works well if the output is large. You can read it by line.Paff
I somehow wanted it to be in normal JSON format as I have my parser which works well with normal JSON format file....other format gives me some kind of overheads when I parse it . So lets see will work with it.Intersexual

© 2022 - 2024 — McMap. All rights reserved.