[SOLVED] scrapy xpath selector repeats data

scrapy xpath selector repeats data

I am trying to extract the business name and address from each listing and export it to a -csv, but I am having problems with the output csv. I think bizs = hxs.select("//div[@class='listing_content']") may be causing the problems.

yp_spider.py

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from yp.items import Biz

class MySpider(BaseSpider):
    name = "ypages"
    allowed_domains = ["yellowpages.com"]
    start_urls = ["http://www.yellowpages.com/sanfrancisco/restaraunts"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        bizs = hxs.select("//div[@class='listing_content']")
        items = []

        for biz in bizs:
            item = Biz()
            item['name'] = biz.select("//h3/a/text()").extract()
            item['address'] = biz.select("//span[@class='street-address']/text()").extract()
            print item
            items.append(item)

items.py

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class Biz(Item):
    name = Field()
    address = Field()

    def __str__(self):
        return "Website: name=%s address=%s" %  (self.get('name'), self.get('address'))

The output from 'scrapy crawl ypages -o list.csv -t csv' is a long list of business names then locations and it repeats the same data several times.

Solution

you should add one "." to select the relative xpath, and here is from scrapy document(http://doc.scrapy.org/en/0.16/topics/selectors.html)

At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all

elements from the document, not only those inside elements:

>>> for p in divs.select('//p') # this is wrong - gets all <p> from the whole document
>>>     print p.extract()

This is the proper way to do it (note the dot prefixing the .//p XPath):

>>> for p in divs.select('.//p') # extracts all <p> inside
>>>     print p.extract()