htmlscrapytripadvisor

response.css html on Scrapy


I'm trying to extract the html link from the "website" here https://www.tripadvisor.com/Restaurant_Review-g982688-d1140717-Reviews-Strandkanten-Karlskoga_Orebro_County.html https://monosnap.com/file/agSNP29XoLDlG4HZtntaaifAtFPzcH i tried response.css('a.dOGcA::attr(href)').extract() but it is giving a blank response what am i doing wrong? thanks!


Solution

  • The url you are going to scrape is loaded dynamically. If you make disabled javascript then you will see that the url/href goes disappeared from the html Dom that's why I use SeleniumRequest with scrapy and getting the desired output.

    Code:

    import scrapy
    from scrapy_selenium import SeleniumRequest
    
    class LinkSpider(scrapy.Spider):
    
        name = 'link'
    
        def start_requests(self):
            url = 'https://www.tripadvisor.com/Restaurant_Review-g982688-d1140717-Reviews-Strandkanten-Karlskoga_Orebro_County.html'
            yield SeleniumRequest(
                url=url,
                wait_time=3,
                callback=self.parse)
    
        def parse(self, response):
            
            yield {'Link':response.xpath('//a[@class="dOGcA Ci Wc _S C fhGHT"]/@href').get()}
           
    
        def spider_closed(self):
            self.driver.close()
    

    Output:

    2021-10-24 01:25:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.com/Restaurant_Review-g982688-d1140717-Reviews-Strandkanten-Karlskoga_Orebro_County.html>
    {'Link': 'http://www.strandkanten.nu'}