Following: scrapy's tutorial i made a simple image crawler (scrapes images of Bugattis). Which is illustrated below in EXAMPLE.
However, following the guide has left me with a non functioning crawler! It finds all of the urls but it does not download the images.
I found a duck tape solution: replace ITEM_PIPELINES
and IMAGES_STORE
such that;
ITEM_PIPELINES['scrapy.pipeline.images.FilesPipeline'] = 1
and
IMAGES_STORE
-> FILES_STORE
But I do not know why this works? I would like to use the ImagePipeline as documented by scrapy.
EXAMPLE
settings.py
BOT_NAME = 'imagespider'
SPIDER_MODULES = ['imagespider.spiders']
NEWSPIDER_MODULE = 'imagespider.spiders'
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = "/home/user/Desktop/imagespider/output"
items.py
import scrapy
class ImageItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
imagespider.py
from imagespider.items import ImageItem
import scrapy
class ImageSpider(scrapy.Spider):
name = "imagespider"
start_urls = (
"https://www.find.com/search=bugatti+veyron",
)
def parse(self, response):
for elem in response.xpath("//img"):
img_url = elem.xpath("@src").extract_first()
yield ImageItem(file_urls=[img_url])
The item your spider returns must contains fields "file_urls"
for files and/or "image_urls"
for images. In your code you specify settings for Image pipeline but your return urls in "file_urls"
.
Simply change this line:
yield ImageItem(file_urls=[img_url])
# to
yield {'image_urls': [img_url]}
* scrapy can return dictionary objects instead of items, which saves time when you only have one or two fields.