pythonweb-scrapingscrapy

Can't follow links when web-scraping


I realize that others have covered similar topics, but having read these posts, I still can't solve my problem.

I am using Scrapy to write a crawl spider that should scrape search results pages. One example could be the results for all the 1 bed room apartments in the bay area on CraigsList.org. They are found here:

http://sfbay.craigslist.org/search/apa?zoomToPosting=&query=&srchType=A&minAsk=&maxAsk=&bedrooms=1

This shows the first 100 1-bedroom apartments in the Bay area. The 201st to 300th apartments are on this page

http://sfbay.craigslist.org/search/apa?bedrooms=1&srchType=A&s=100

And for the next 100 "&s=100" will be changed to "&s=200" etc. Let's say I want the name of the first post on each of these pages with results. I know it isn't very meaningful, but it's just to have a simple example.

My problem is how to write the rule so that "&s=100" is incremented to "&s=200" etc. This is what I have:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class Torrent(Item):
    name = Field()

class MySpiderSpider(CrawlSpider):

    name = 'MySpider'
    allowed_domains = ['http://sfbay.craigslist.org']
    start_urls = ['http://sfbay.craigslist.org/search/apa?zoomToPosting=&query=&srchType=A&minAsk=&maxAsk=&bedrooms=1']
    rules = [Rule(SgmlLinkExtractor(allow=[r'&s=\d+']), 'parse_torrent', follow=True)]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)
        torrent = Torrent()
        torrent['name'] = x.select("id('toc_rows')/p[2]/span[1]/a/text()").extract()
        return torrent

Can anyone set my rule straight, so that I get the name of the first post for each of the result pages?

Thanks!


Solution

  • On the basis that you're simply pulling information out of each index page, you could just generate a list of appropriate start urls and then use a BaseSpider instead. No rules required and it's a lot simpler to use.

    from scrapy.spider import BaseSpider
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.item import Item, Field
    
    class Torrent(Item):
        name = Field()
    
    class MySpiderSpider(BaseSpider):
        name = 'MySpider'
        allowed_domains = ['http://sfbay.craigslist.org']
        start_urls = ['http://sfbay.craigslist.org/search/apa?bedrooms=1&srchType=A&s=%d' %n for n in xrange(0, 2500, 100)]
    
        def parse(self, response):
            x = HtmlXPathSelector(response)
            torrent = Torrent()
            torrent['name'] = x.select("id('toc_rows')/p[2]/span[1]/a/text()").extract()
            return torrent