How can scrapy export items to separate csv files per item
Asked Answered
B

6

21

I am scraping a soccer site and the spider (a single spider) gets several kinds of items from the site's pages: Team, Match, Club etc. I am trying to use the CSVItemExporter to store these items in separate csv files, teams.csv, matches.csv, clubs.csv etc.

I am not sure what is the right way to do this. The only way I have thought so far is to create my own custom pipeline like in the example http://doc.scrapy.org/en/0.14/topics/exporters.html and there open all needed csv files in the spider_opened method, ie create a csv exporter for each csv file and in the process_item put code to figure out what kind of item is the "item" parameter and then send it to the corresponding exporter object.

Anyway I haven't found any examples of handling multiple csv files (per item type) in scrapy so I am worrying that I am using it in a way that is not meant to be used. (this is my first experience with Scrapy).

diomedes

Bramante answered 1/9, 2012 at 18:34 Comment(0)
S
14

You approach seems fine to me. Piplines are a great feature of Scrapy and are IMO build for something like your approach.

You could create multiple items (e.g. SoccerItem, MatchItem) and in your MultiCSVItemPipeline just delegate each item to its own CSV class by checking the item class.

Schoolbook answered 3/9, 2012 at 7:31 Comment(2)
Ok, after writing the MultiCSVItemPipeline I feel better :-). I check as you suggested the item class to figure out where the item goes. I am giving a self answer to show the code for anyone that has the same question.Bramante
@Bramante - Could you please share you code for items.py, spider code, and also settings.py. It will be very helpful. As i am getting csv file as empty using the below same code. Scrapy version for me is 1.8.0. ThanksHindsight
B
28

I am posting here the code I used to produce a MultiCSVItemPipeline based on the answer of drcolossos above.

This pipeline assumes that all the Item classes follow the convention *Item (e.g. TeamItem, EventItem) and creates team.csv, event.csv files and sends all records to the appropriate csv files.

from scrapy.exporters import CsvItemExporter
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher


def item_type(item):
    return type(item).__name__.replace('Item','').lower()  # TeamItem => team

class MultiCSVItemPipeline(object):
    SaveTypes = ['team','club','event', 'match']
    def __init__(self):
        dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
        dispatcher.connect(self.spider_closed, signal=signals.spider_closed)

    def spider_opened(self, spider):
        self.files = dict([ (name, open(CSVDir+name+'.csv','w+b')) for name in self.SaveTypes ])
        self.exporters = dict([ (name,CsvItemExporter(self.files[name])) for name in self.SaveTypes])
        [e.start_exporting() for e in self.exporters.values()]

    def spider_closed(self, spider):
        [e.finish_exporting() for e in self.exporters.values()]
        [f.close() for f in self.files.values()]

    def process_item(self, item, spider):
        what = item_type(item)
        if what in set(self.SaveTypes):
            self.exporters[what].export_item(item)
        return item
Bramante answered 3/9, 2012 at 16:38 Comment(2)
Can you please include the code where you import some modules?Churchless
What is dispatcher used for?Pluckless
S
14

You approach seems fine to me. Piplines are a great feature of Scrapy and are IMO build for something like your approach.

You could create multiple items (e.g. SoccerItem, MatchItem) and in your MultiCSVItemPipeline just delegate each item to its own CSV class by checking the item class.

Schoolbook answered 3/9, 2012 at 7:31 Comment(2)
Ok, after writing the MultiCSVItemPipeline I feel better :-). I check as you suggested the item class to figure out where the item goes. I am giving a self answer to show the code for anyone that has the same question.Bramante
@Bramante - Could you please share you code for items.py, spider code, and also settings.py. It will be very helpful. As i am getting csv file as empty using the below same code. Scrapy version for me is 1.8.0. ThanksHindsight
A
3

I have tried the answer. It seems do not work in the latest version (2.21).

I have included my code for your reference:

class MultiCSVItemPipeline(object):
    SaveTypes = ['CentalineTransactionsItem','CentalineTransactionsDetailItem','CentalineBuildingInfo']

    def open_spider(self, spider):
        self.files = dict([ (name, open(name+'.csv','w+b')) for name in self.SaveTypes ])
        self.exporters = dict([ (name,CsvItemExporter(self.files[name])) for name in self.SaveTypes])
        [e.start_exporting() for e in self.exporters.values()]

    def close_spider(self, spider):
        [e.finish_exporting() for e in self.exporters.values()]
        [f.close() for f in self.files.values()]

    def process_item(self, item, spider):
        what = type(item).__name__
        if what in set(self.SaveTypes):
            self.exporters[what].export_item(item)
        return item
    
Akel answered 13/8, 2020 at 0:11 Comment(0)
E
0

I am working with Scrapy = "^2.5.0" and had to do a couple of modifications so it works. I've also made the SaveTypes (now defined_items) dynamic to all Items in the items file.

from scrapy.exporters import CsvItemExporter
from YOUR_PROJECT import items


def item_type(item):
    return type(item).__name__


class MultiCSVItemPipeline(object):
    defined_items = [name for name, _ in items.__dict__.items() if "Item" in name]

    def open_spider(self, spider):
        self.files = dict(
            [
                (name, open("FOLDER_TO_SAVE/" + name + ".csv", "w+b"))
                for name in self.defined_items
            ]
        )
        self.exporters = dict(
            [(name, CsvItemExporter(self.files[name])) for name in self.defined_items]
        )
        [e.start_exporting() for e in self.exporters.values()]

    def close_spider(self, spider):
        [e.finish_exporting() for e in self.exporters.values()]
        [f.close() for f in self.files.values()]

    def process_item(self, item, spider):
        item_name = item_type(item)
        if item_name in set(self.defined_items):
            self.exporters[item_name].export_item(item)
        return item
Ensoul answered 15/9, 2021 at 1:6 Comment(0)
C
0

Here is the code I used to leverage scrapy's Item Pipline and Exporters to output a separate csv per Item class type scraped. The logic closely resembles this example.

from scrapy.exporters import CsvItemExporter

class cvs_per_itemtype_Pipeline:

    def open_spider(self, spider):
        self.itemType_to_exporterAndCsvFile = {}     

    def process_item(self, item, spider):
        itemType = type(item).__name__ #item class name as str
        if itemType not in self.itemType_to_exporterAndCsvFile:
            csvFile = open(f'{itemType}.csv', 'wb')
            exporter = CsvItemExporter(csvFile)
            exporter.start_exporting()
            self.itemType_to_exporterAndCsvFile[itemType] = (exporter, csvFile)
        exporter = self.itemType_to_exporterAndCsvFile[itemType][0]
        exporter.export_item(item)
        return item

    def close_spider(self, spider):
        for exporter, csvFile in self.itemType_to_exporterAndCsvFile.values():
            exporter.finish_exporting()
            csvFile.close()
Canada answered 28/10, 2021 at 19:4 Comment(0)
C
0

There is now an easy method to achieve this. We can configure feed exports in settings.py file as shown below -

FEED = {
    'team.csv': {
        'format': 'csv',
        'encoding': 'utf8',
        'store_empty': False,
        'item_classes': [MyItemClass1, 'myproject.items.TeamItem'],
        'fields': None,
        'indent': 4,
        'item_export_kwargs': {
           'export_empty_fields': True,
        },
    },
{
    'club.csv': {
        'format': 'csv',
        'encoding': 'utf8',
        'store_empty': False,
        'item_classes': ['myproject.items.ClubItem'],
        'fields': None,
        'indent': 4,
        'item_export_kwargs': {
           'export_empty_fields': True,
        },
    },

This will generate two different .csv files and redirect the corresponding items as per item_classes key above.

The docs are not much clear but here is the link from where I infered this - Feed Exports

Cougar answered 18/1 at 17:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.