This is a thing I've been encountering very often lately. I am supposed scrape data from multiple requests for a single item.
I've been using request meta to accumulate data between requests like this;
def parse_data(self, response):
data = 'something'
yield scrapy.Request(
url='url for another page for scraping images',
method='GET',
meta={'data': data}
)
def parse_images(self, response):
images = ['some images']
data = response.meta['data']
yield scrapy.Request(
url='url for another page for scraping more data',
method='GET',
meta={'images': images, 'data': data}
)
def parse_more(self, response):
more_data = 'more data'
images = response.meta['images']
data = response.meta['data']
yield item
In the last parse method, I scrape the final needed data and yield the item. However, this approach looks awkward to me. Is there any better way to scrape webpages like those or am I doing this correctly?
This is the proper way of tracking your item throughout requests. What I would do differently though is actually just set the item values like so:
item['foo'] = bar
item['bar'] = foo
yield scrapy.Request(url, callback=self.parse, meta={'item':item})
With this approach you only have to send one thing the item itself through each time. There will be some instances where this isnt desirable.