pythonweb-scrapinggoogle-news

Web scraping articles from Google News


I am trying to web scrape googlenews with the gnews package. However, I don't know how to do web scraping for older articles like, for example, articles from 2010.

from gnews import GNews
from newspaper import Article
import pandas as pd
import datetime

google_news = GNews(language='es', country='Argentina', period = '7d')
argentina_news = google_news.get_news('protesta clarin')
print(len(argentina_news))

this code works perfectly to get recent articles but I need older articles. I saw https://github.com/ranahaani/GNews#todo and something like the following appears:

google_news = GNews(language='es', country='Argentina', period='7d', start_date='01-01-2015', end_date='01-01-2016', max_results=10, exclude_websites=['yahoo.com', 'cnn.com'],
                    proxy=proxy)

but when I try star_date I get:

TypeError: __init__() got an unexpected keyword argument 'start_date'

can anyone help to get articles for specific dates. Thank you very mucha guys!


Solution

  • The example code is incorrect for gnews==0.2.7 which is the latest you can install off PyPI via pip (or whatever). The documentation is for the unreleased mainline code that you can get directly off their git source.

    Confirmed by inspecting the GNews::__init__ method, and the method doesn't have keyword args for start_date or end_date:

    In [1]: import gnews
    
    In [2]: gnews.GNews.__init__??
    Signature:
    gnews.GNews.__init__(
        self,
        language='en',
        country='US',
        max_results=100,
        period=None,
        exclude_websites=None,
        proxy=None,
    )
    Docstring: Initialize self.  See help(type(self)) for accurate signature.
    Source:
        def __init__(self, language="en", country="US", max_results=100, period=None, exclude_websites=None, proxy=None):
            self.countries = tuple(AVAILABLE_COUNTRIES),
            self.languages = tuple(AVAILABLE_LANGUAGES),
            self._max_results = max_results
            self._language = language
            self._country = country
            self._period = period
            self._exclude_websites = exclude_websites if exclude_websites and isinstance(exclude_websites, list) else []
            self._proxy = {'http': proxy, 'https': proxy} if proxy else None
    File:      ~/src/news-test/.venv/lib/python3.10/site-packages/gnews/gnews.py
    Type:      function
    

    If you want the start_date and end_date functionality, that was only added rather recently, so you will need to install the module off their git source.

    # use whatever you use to uninstall any pre-existing gnews module
    pip uninstall gnews
    
    # install from the project's git main branch
    pip install git+https://github.com/ranahaani/GNews.git
    

    Now you can use the start/end functionality:

    import datetime
    
    import gnews
    
    start = datetime.date(2015, 1, 15)
    end = datetime.date(2015, 1, 16)
    
    google_news = GNews(language='es', country='Argentina', start_date=start, end_date=end)
    rsp = google_news.get_news("protesta")
    print(rsp)
    

    I get this as a result:

    [{'title': 'Latin Roots: The Protest Music Of South America - NPR',
      'description': 'Latin Roots: The Protest Music Of South America  NPR',
      'published date': 'Thu, 15 Jan 2015 08:00:00 GMT',
      'url': 'https://www.npr.org/sections/world-cafe/2015/01/15/377491862/latin-roots-the-protest-music-of-south-america',
      'publisher': {'href': 'https://www.npr.org', 'title': 'NPR'}}]
    

    Also note: