pythonpython-3.xarchivepython-newspapernewspaper3k

Python Newspaper with web archive (wayback machine)


I'm trying to use the Python library newspaper with the archives from the Wayback Machine, which stores old versions of websites that were archived. Theoretically, old news articles could be queried and downloaded from these archives.

For instance, the follow code queries the archives for CNBC for a specific archive date.

import newspaper
url = 'http://web.archive.org/web/20161201123529/http://www.cnbc.com/'
paper = newspaper.build(url, memoize_articles = False )

Although the archived website itself contains links to actual news articles from 2016-12-01, the newspaper module does not seem to pick them up. Instead, you get urls such as:

https://blog.archive.org/2016/10/23/defining-web-pages-web-sites-and-web-captures/

which are not actual articles from this archived version of CNBC. However, newspaper works great with today's version of CNBC.

I suppose that it gets confused because of the format of the url (which contains two https). Does anyone have any suggestions on how to extract articles from the Wayback Machine archives?


Solution

  • This was an interesting problem, which I will add to my Newspaper Usage Overview document available on GitHub.

    I attempted to use newspaper.build, but I couldn't get it to work correctly, so I used newspaper Source.

    from time import sleep
    from random import randint
    from newspaper import Config
    from newspaper import Source
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
    
    config = Config()
    config.browser_user_agent = USER_AGENT
    config.request_timeout = 10
    
    wayback_cnbc = Source(url='https://web.archive.org/web/20180301012621/https://www.cnbc.com/', config=config,
                      memoize_articles=False, language='en', number_threads=20, thread_timeout_seconds=2)
    
    wayback_cnbc.build()
    for article_extract in wayback_cnbc.articles:
       article_extract.download()
       article_extract.parse()
    
       print(article_extract.publish_date)
       print(article_extract.title)
       print(article_extract.url)
       print('')
    
       # this sleep timer is helping with some timeout issues
       # that were happening when querying
       sleep(randint(1,3))
    

    The example above outputs this:

    None
    Media
    https://web.archive.org/web/20180301012621/https://www.cnbc.com/media/
        
    None
    CNBC Video
    https://web.archive.org/web/20180301012621/https://www.cnbc.com/video/
    
    2017-11-08 00:00:00
    CNBC Healthy Returns
    https://web.archive.org/web/20180301012621/https://www.cnbc.com/2017/11/08/healthy-returns.html
    
    2018-02-28 00:00:00
    Markets in Asia decline as dollar steadies; Nikkei falls 307 points 
    https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/asia-markets-stocks-dollar-and-china-caixin-pmi-in-focus.html
    
    2018-02-28 00:00:00
    S&P 500 rises, but on track to snap longest monthly win streak since 1959
    https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/us-stocks-interest-rates-fed-markets.html
         
    

    Hopefully, this answer helps with your use case for querying the WayBack Machine for articles. If you have any questions please let me know.