Having trouble scraping links and article names from google scholar. I'm unsure if the issue is with my code or the xpath that I'm using to retrieve the data – or possibly both?
I've already spent the past few hours trying to debug/consulting other stackoverflow queries but to no success.
import scrapy
from scrapyproj.items import ScrapyProjItem
class scholarScrape(scrapy.Spider):
name = "scholarScraper"
allowed_domains = "scholar.google.com"
start_urls=["https://scholar.google.com/scholar?hl=en&oe=ASCII&as_sdt=0%2C44&q=rare+disease+discovery&btnG="]
def parse(self,response):
item = ScrapyProjItem()
item['hyperlink'] = item.xpath("//h3[class=gs_rt]/a/@href").extract()
item['name'] = item.xpath("//div[@class='gs_rt']/h3").extract()
yield item
The error messages I have been receiving say: "AttributeError: xpath" so I believe that the issue lies with the path that I'm using to try and retrieve the data, but I could also be mistaken?
Adding my comment as an answer, as it solved the problem:
The issue is with scrapyproj.items.ScrapyProjItem
objects: they do not have an xpath
attribute. Is this an official scrapy class? I think you meant to call xpath
on response
:
item['hyperlink'] = response.xpath("//h3[class=gs_rt]/a/@href").extract()
item['name'] = response.xpath("//div[@class='gs_rt']/h3").extract()
Also, the first path expression might need a set of quotes around the attribute value "gs_rt":
item['hyperlink'] = response.xpath("//h3[class='gs_rt']/a/@href").extract()
Apart from that, the XPath expressions are fine.