I realize that others have covered similar topics, but having read these posts, I still can't solve my problem.
I am using Scrapy to write a crawl spider that should scrape search results pages. One example could be the results for all the 1 bed room apartments in the bay area on CraigsList.org. They are found here:
http://sfbay.craigslist.org/search/apa?zoomToPosting=&query=&srchType=A&minAsk=&maxAsk=&bedrooms=1
This shows the first 100 1-bedroom apartments in the Bay area. The 201st to 300th apartments are on this page
http://sfbay.craigslist.org/search/apa?bedrooms=1&srchType=A&s=100
And for the next 100 "&s=100" will be changed to "&s=200" etc. Let's say I want the name of the first post on each of these pages with results. I know it isn't very meaningful, but it's just to have a simple example.
My problem is how to write the rule so that "&s=100" is incremented to "&s=200" etc. This is what I have:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Torrent(Item):
name = Field()
class MySpiderSpider(CrawlSpider):
name = 'MySpider'
allowed_domains = ['http://sfbay.craigslist.org']
start_urls = ['http://sfbay.craigslist.org/search/apa?zoomToPosting=&query=&srchType=A&minAsk=&maxAsk=&bedrooms=1']
rules = [Rule(SgmlLinkExtractor(allow=[r'&s=\d+']), 'parse_torrent', follow=True)]
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = Torrent()
torrent['name'] = x.select("id('toc_rows')/p[2]/span[1]/a/text()").extract()
return torrent
Can anyone set my rule straight, so that I get the name of the first post for each of the result pages?
Thanks!
On the basis that you're simply pulling information out of each index page, you could just generate a list of appropriate start urls and then use a BaseSpider instead. No rules required and it's a lot simpler to use.
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Torrent(Item):
name = Field()
class MySpiderSpider(BaseSpider):
name = 'MySpider'
allowed_domains = ['http://sfbay.craigslist.org']
start_urls = ['http://sfbay.craigslist.org/search/apa?bedrooms=1&srchType=A&s=%d' %n for n in xrange(0, 2500, 100)]
def parse(self, response):
x = HtmlXPathSelector(response)
torrent = Torrent()
torrent['name'] = x.select("id('toc_rows')/p[2]/span[1]/a/text()").extract()
return torrent