Scrapy how to ignore items with blank fields using Loader
Asked Answered
G

1

6

I would like to know how to ignore items that don't fill all fields, some kind of droping, because in the output of scrapyd I'm getting pages that don't fill all fields.

I have that code:

class Product(scrapy.Item):
    source_url = scrapy.Field(
        output_processor = TakeFirst()
    )
    name = scrapy.Field(
        input_processor = MapCompose(remove_entities),
        output_processor = TakeFirst()
    )
    initial_price = scrapy.Field(
        input_processor = MapCompose(remove_entities, clear_price),
        output_processor = TakeFirst()
    )
    main_image_url = scrapy.Field(
        output_processor = TakeFirst()
    )

Parser:

def parse_page(self, response):
    try:
        l = ItemLoader(item=Product(), response=response)
        l.add_value('source_url', response.url)
        l.add_css('name', 'h1.title-product::text')
        l.add_css('main_image_url', 'div.pics a img.zoom::attr(src)')

        l.add_css('initial_price', 'ul.precos li.preco_normal::text')
        l.add_css('initial_price', 'ul.promocao li.preco_promocao::text')

        return l.load_item()

    except Exception as e:
        print self.log("#1 ERRO: %s" % e), response.url

I want to do it with Loader without need to create with my own Selector (to avoid processing items twice). I guess that I can drop them in pipeline but probably it's not the best way because these items aren't valid.

Grassquit answered 22/5, 2014 at 15:7 Comment(1)
Dropping items in a pipeline is not a bad way, quite the opposite IMHO.Miasma
B
10

Validation of data is one of typical use case for pipelines. In your case you only need to write some small amount of code to check for required fields, something along the lines of:

from scrapy.exceptions import DropItem

class YourPersonalPipeline(object):
    def process_item(self, item, spider):
        required_fields = [] # your list of required fields
        if all(field in item for field in required_fields):
            return item
        else:
            raise DropItem("your reason")

You need to enable pipeline in settings.py Read more in scrapy docs.

Bort answered 22/5, 2014 at 16:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.