pythonweb-scrapingscrapyrequest

Proper way of collecting data from multiple sources for a single item


This is a thing I've been encountering very often lately. I am supposed scrape data from multiple requests for a single item.

I've been using request meta to accumulate data between requests like this;

def parse_data(self, response):
    data = 'something'

    yield scrapy.Request(
        url='url for another page for scraping images',
        method='GET',
        meta={'data': data}    
    )

def parse_images(self, response):
    images = ['some images']
    data = response.meta['data']

    yield scrapy.Request(
        url='url for another page for scraping more data',
        method='GET',
        meta={'images': images, 'data': data}    
    )

def parse_more(self, response):
    more_data = 'more data'
    images = response.meta['images']
    data = response.meta['data']

    yield item

In the last parse method, I scrape the final needed data and yield the item. However, this approach looks awkward to me. Is there any better way to scrape webpages like those or am I doing this correctly?


Solution

  • This is the proper way of tracking your item throughout requests. What I would do differently though is actually just set the item values like so:

    item['foo'] = bar
    item['bar'] = foo
    yield scrapy.Request(url, callback=self.parse, meta={'item':item})
    

    With this approach you only have to send one thing the item itself through each time. There will be some instances where this isnt desirable.