I realize that others have covered similar topics, but having read these posts, I still can't solve my problem.
I am using Scrapy to write a crawl spider that should scrape search results pages. One example could be the results for all the 1 bed room apartments in the bay area on CraigsList.org. They are found here:
This shows the first 100 1-bedroom apartments in the Bay area. The 201st to 300th apartments are on this page
And for the next 100 "&s=100" will be changed to "&s=200" etc. Let's say I want the name of the first post on each of these pages with results. I know it isn't very meaningful, but it's just to have a simple example.
My problem is how to write the rule so that "&s=100" is incremented to "&s=200" etc. This is what I have:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Torrent(Item):
name = Field()
class MySpiderSpider(CrawlSpider):
name = 'MySpider'
allowed_domains = ['http://sfbay.craigslist.org']
start_urls = ['http://sfbay.craigslist.org/search/apa?zoomToPosting=&query=&srchType=A&minAsk=&maxAsk=&bedrooms=1']
rules = [Rule(SgmlLinkExtractor(allow=[r'&s=\d+']), 'parse_torrent', follow=True)]
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = Torrent()
torrent['name'] = x.select("id('toc_rows')/p[2]/span[1]/a/text()").extract()
return torrent
Can anyone set my rule straight, so that I get the name of the first post for each of the result pages?
On the basis that you're simply pulling information out of each index page, you could just generate a list of appropriate start urls and then use a BaseSpider instead. No rules required and it's a lot simpler to use.
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Torrent(Item):
name = Field()
class MySpiderSpider(BaseSpider):
name = 'MySpider'
allowed_domains = ['http://sfbay.craigslist.org']
start_urls = ['http://sfbay.craigslist.org/search/apa?bedrooms=1&srchType=A&s=%d' %n for n in xrange(0, 2500, 100)]
def parse(self, response):
x = HtmlXPathSelector(response)
torrent = Torrent()
torrent['name'] = x.select("id('toc_rows')/p[2]/span[1]/a/text()").extract()
return torrent