Scrapy: Images Pipeline, download images

Asked 26/7, 2016 at 11:53 Answered 6/5, 2021 at 15:41

Following: scrapy's tutorial i made a simple image crawler (scrapes images of Bugattis). Which is illustrated below in EXAMPLE.

However, following the guide has left me with a non functioning crawler! It finds all of the urls but it does not download the images.

I found a duck tape solution: replace ITEM_PIPELINES and IMAGES_STORE such that;

ITEM_PIPELINES['scrapy.pipeline.images.FilesPipeline'] = 1 and

IMAGES_STORE -> FILES_STORE

But I do not know why this works? I would like to use the ImagePipeline as documented by scrapy.

EXAMPLE

settings.py

BOT_NAME = 'imagespider'
SPIDER_MODULES = ['imagespider.spiders']
NEWSPIDER_MODULE = 'imagespider.spiders'
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = "/home/user/Desktop/imagespider/output"

items.py

import scrapy

class ImageItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()

imagespider.py

from imagespider.items import ImageItem
import scrapy


class ImageSpider(scrapy.Spider):
    name = "imagespider"

    start_urls = (
        "https://www.find.com/search=bugatti+veyron",
    )

    def parse(self, response):
        for elem in response.xpath("//img"):
            img_url = elem.xpath("@src").extract_first()
            yield ImageItem(file_urls=[img_url])

Anfractuosity answered 26/7, 2016 at 11:53 Comment(2)

Could you please post the __main__ stub? How do we enter these functions? – Orianna 4/2, 2019 at 17:3

__main__ would be standard Scrapy code, a boilerplate. It would invoke a spider which is this code. The code, I agree with you is incomplete, however one could speculate what other moving parts would look like. – Psychometrics 21/12, 2020 at 5:55

The item your spider returns must contains fields "file_urls" for files and/or "image_urls" for images. In your code you specify settings for Image pipeline but your return urls in "file_urls".

Simply change this line:

yield ImageItem(file_urls=[img_url])
# to
yield {'image_urls': [img_url]}

* scrapy can return dictionary objects instead of items, which saves time when you only have one or two fields.

Knopp answered 26/7, 2016 at 12:58 Comment(1)

Thanks! You could also change ImageItem to have image_urls and yield ImageItem(image_urls=[img_url]) – Anfractuosity 26/7, 2016 at 13:53

Spent hours investigating why built-int ImagePipeline doesn't work on my local. Finally, I found this from documentation

The Images Pipeline requires Pillow 4.0.0 or greater. It is used for thumbnailing and normalizing images to JPEG/RGB format.

After installed Pillow. it works normally.

Fewell answered 6/5, 2021 at 15:41 Comment(1)

Wow, I wish this was more visible in the scraping output - the silent failure is surprising. Docs are here. Thanks for sharing! – Ryurik 8/9, 2022 at 1:6

Recommended topics

Hot tags