pythonxpathhrefgoogle-scholar

Identifying issue in retrieving href from Google Scholar


Having trouble scraping links and article names from google scholar. I'm unsure if the issue is with my code or the xpath that I'm using to retrieve the data – or possibly both?

I've already spent the past few hours trying to debug/consulting other stackoverflow queries but to no success.

import scrapy
from scrapyproj.items import ScrapyProjItem

class scholarScrape(scrapy.Spider):

    name = "scholarScraper"
    allowed_domains = "scholar.google.com"
    start_urls=["https://scholar.google.com/scholar?hl=en&oe=ASCII&as_sdt=0%2C44&q=rare+disease+discovery&btnG="]

    def parse(self,response):
        item = ScrapyProjItem()
        item['hyperlink'] = item.xpath("//h3[class=gs_rt]/a/@href").extract()
        item['name'] = item.xpath("//div[@class='gs_rt']/h3").extract()
        yield item

The error messages I have been receiving say: "AttributeError: xpath" so I believe that the issue lies with the path that I'm using to try and retrieve the data, but I could also be mistaken?


Solution

  • Adding my comment as an answer, as it solved the problem:

    The issue is with scrapyproj.items.ScrapyProjItem objects: they do not have an xpath attribute. Is this an official scrapy class? I think you meant to call xpath on response:

    item['hyperlink'] = response.xpath("//h3[class=gs_rt]/a/@href").extract()
    item['name'] = response.xpath("//div[@class='gs_rt']/h3").extract()
    

    Also, the first path expression might need a set of quotes around the attribute value "gs_rt":

    item['hyperlink'] = response.xpath("//h3[class='gs_rt']/a/@href").extract()
    

    Apart from that, the XPath expressions are fine.