pythonpython-3.xweb-scrapingpython-newspaper

How to access cached articles in newspaper3k


Newspaper is a fantastic library that allows scraping web data however I am a little confused with article caching. It caches the article to speed up operations but how do I access those articles?

I have something like this. Now when I run this command twice with the same set of articles, I get the return type None the second time. How do I access those previously cached articles for processing?

newspaper_articles = [Article(url) for url in links]


Solution

  • Looking at this: https://github.com/codelucas/newspaper/issues/481 it seems the caching method 'cache_disk' in https://github.com/codelucas/newspaper/blob/master/newspaper/utils.py may have a bug. It indeed does cache the results to disk (search for a folder '.newspaper_scraper'), but doesn't load them afterwards.

    A workaround is to set memoize_articles=False when building your newspaper, or using the Config class.

    newspaper.build(url, memoize_articles=False)