Newspaper is a fantastic library that allows scraping web data however I am a little confused with article caching. It caches the article to speed up operations but how do I access those articles?
I have something like this. Now when I run this command twice with the same set of articles, I get the return type None
the second time. How do I access those previously cached articles for processing?
newspaper_articles = [Article(url) for url in links]
Looking at this: https://github.com/codelucas/newspaper/issues/481 it seems the caching method 'cache_disk' in https://github.com/codelucas/newspaper/blob/master/newspaper/utils.py may have a bug. It indeed does cache the results to disk (search for a folder '.newspaper_scraper'), but doesn't load them afterwards.
A workaround is to set memoize_articles=False when building your newspaper, or using the Config class.
newspaper.build(url, memoize_articles=False)