scrapygoogle-news

How do I grab the headline titles from the Google News webpage with Scrapy?


I saved an offline file of https://news.google.com/search?q=amazon&hl=en-US&gl=US&ceid=US%3Aen

Having trouble determining how to grab the titles of the listed articles.

import scrapy

class newsSpider(scrapy.Spider):
    name = "news"
    start_urls = ['file:///127.0.0.1/home/toni/Desktop/crawldeez/googlenewsoffline.html/'
                  ]

    def parse(self, response):
        for xrnccd in response.css('a.MQsxIb.xTewfe.R7GTQ.keNKEd.j7vNaf.Cc0Z5d.EjqUne'):
            yield {
                'ipQwMb.ekueJc.RD0gLb': xrnccd.css('h3.ipQwMb.ekueJc.RD0gLb::ipQwMb.ekueJc.RD0gLb').get(),
            }

Solution

  • The problem seems to lay in the fact that the page content is rendered dynamically using JavaScript and thus can't be extracted from the HTML using css or xpath methods. However, it's present in the response body, so you can extract it using regular expressions. Here's the Scrapy shell session to show how:

    $ scrapy shell "https://news.google.com/search?q=amazon&hl=en-US&gl=US&ceid=US%3Aen"
    ...
    >>> import re
    >>> from pprint import pprint
    >>>
    >>> titles = re.findall(r'<h3 class="[^"]+?"><a[^>]+?>(.+?)</a>', response.text)
    >>> pprint(titles)
    ['Amazon will no longer sell Chinese goods in China',
     'YouTube is finally coming back to Amazon’s Fire TV devices',
     'Amazon Plans to Use Digital Media to Expand Its Advertising Business',
     'Amazon flooded with fake reviews; Learn how to spot them',
     'How To Win in Today&#39;s Amazon World',
     'Amazon Day: How to schedule Amazon deliveries',
     'Bezos Disputes Amazon’s Market Power. But His Merchants Feel the Pinch',
     '20 Best Action Movies to Stream on Amazon Prime',
     ...]