pythonloopsweb-scrapingscrapy

Scrape using multiple POST data from the same URL


I have already created one spider that collects a list of company names with matching phone numbers. This is then saved to a CSV file.

I am then wanting to scrape data from another site using the phones numbers in the CSV file as POST data. I am wanting it to loop through the same start URL but just scraping the data that each phone number produces until there are no more numbers left in the CSV file.

This is what I have got so far:

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import sys
from scrapy.shell import inspect_response
from btw.items import BtwItem
import csv

class BtwSpider(BaseSpider):
    name = "btw"
    allowed_domains = ["siteToScrape.com"]
    start_urls = ["http://www.siteToScrape.com/broadband/broadband_checker"] 

    def parse(self, response):
        phoneNumbers = ['01253873647','01253776535','01142726749']

        return [FormRequest.from_response(response,formdata={'broadband_checker[phone]': phoneNumbers[1]},callback=self.after_post)]


    def after_post(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="results"]')
       items = []
       for site in sites:
           item = BtwItem()

           fttcText = site.select("div[@class='content']/div[@id='btfttc']/ul/li/text()").extract()

           # Now we will change the text to be a boolean value
           if fttcText[0].count('not') > 0:
               fttcEnabled=0
           else:
               fttcEnabled=1

           item['fttcAvailable'] = fttcEnabled
           items.append(item)
       return items

At the minute I have just been trying to get this looping through a list(phoneNumbers) but I have not even managed to get that to work so far. Once I know how to do that I will be able to get it to pull it from a CSV file by myself. In its current state it is just using the phoneNumber with a index of 1 in the list.


Solution

  • Assuming you have a phones.csv file with phones in it:

    01253873647
    01253776535
    01142726749
    

    Here's your spider:

    import csv
    from scrapy.item import Item, Field
    
    from scrapy.spider import BaseSpider
    from scrapy.http import Request
    from scrapy.http import FormRequest
    from scrapy.selector import HtmlXPathSelector
    
    
    class BtwItem(Item):
        fttcAvailable = Field()
        phoneNumber = Field()
    
    
    class BtwSpider(BaseSpider):
        name = "btw"
        allowed_domains = ["samknows.com"]
    
        def start_requests(self):
            yield Request("http://www.samknows.com/broadband/broadband_checker", self.parse_main_page)
    
        def parse_main_page(self, response):
            with open('phones.csv', 'r') as f:
                reader = csv.reader(f)
                for row in reader:
                    phone_number = row[0]
                    yield FormRequest.from_response(response,
                                                    formdata={'broadband_checker[phone]': phone_number},
                                                    callback=self.after_post,
                                                    meta={'phone_number': phone_number})
    
        def after_post(self, response):
            hxs = HtmlXPathSelector(response)
            sites = hxs.select('//div[@id="results"]')
    
            phone_number = response.meta['phone_number']
            for site in sites:
                item = BtwItem()
    
                fttc = site.select("div[@class='content']/div[@id='btfttc']/ul/li/text()").extract()
                item['phoneNumber'] = phone_number
                item['fttcAvailable'] = 'not' in fttc[0]
    
                yield item
    

    Here's what was scraped after running it:

    {'fttcAvailable': False, 'phoneNumber': '01253873647'}
    {'fttcAvailable': False, 'phoneNumber': '01253776535'}
    {'fttcAvailable': True, 'phoneNumber': '01142726749'}
    

    The idea is to scrape the main page using start_requests, then read the csv file line-by-line in the callback and yield new Requests for each phone number (csv row). Additionally, pass phone_number to the callback through the meta dictionary in order to write it to the Item field (I think you need this to distinguish items/results).

    Hope that helps.