I have a list of URLs in a csv file, I load this file in a pandas dataframe, and use the column links as start URLs
start_urls = df['Links']
each link has this format
http://www.bbb.org/search/?type=name&input=%28408%29+998-0983&location=&tobid=&filter=business&radius=&country=USA%2CCAN&language=en&codeType=YPPA
this link is related with a phone number (408) 998-0983 which appears in the link as %28408%29+998-0983
for each of the pages in df['Links'] I scrap some data, and save it in an item, so far so good. The problem I have is that the order in which scrapy takes the list is not the same that is in the data frame, so I can't merge the data I get with scrapy and the file that I already have because the rows don't match, also I would like to handle the exception when the page doesn't have the data an return a string, in which part of the code could I do that, this is what I'm doing right now:
def parse(self, response):
producto = Product()
producto = Product(BBB_link = response.xpath('//*[@id="container"]/div/div[1]/div[3]/table/tbody/tr[1]/td/h4[1]/a').extract()
The first part of your question is answered here, which suggests over-riding start_requests() to add meta data. In your case I imagine you could add the phone number as meta data but any convenient link to your data frame would do. The order of the scraped data wont change but you will have enough information to relate to the original data in a database or spreadsheet.
class MySpider(CrawlSpider):
def start_requests(self):
...
yield Request(url1, meta={'phone_no': '(408) 998-0983'}, callback=self.parse)
...
def parse(self, response):
item['phone_no'] = response.meta['phone_no']
For the case where no data is found you could test the list returned by your xpath. If it's empty then nothing was found.
producto = Product(BBB_link = response.xpath('//*[@id="container"]/div/div[1]/div[3]/table/tbody/tr[1]/td/h4[1]/a').extract()
if producto:
<parse the page as normal>
item['status'] = 'found ok'
else:
item['status'] = 'not found'
yield item