Dear Stackoverflow community!
This is a follow up question regarding a previous question I posted here.
I would like to extract news paper URLS with the NewsPaper library from MULTIPLE sources into one SINGLE list. This worked well for one source, but as soon as I add a second source link, it extracts only the URLs of the second one.
import feedparser as fp
import newspaper
from newspaper import Article
website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A
for source, value in website.items():
if 'rss' in value:
d = fp.parse(value['rss'])
#if there is an RSS value for a company, it will be extracted into d
article_list = []
for entry in d.entries:
if hasattr(entry, 'published'):
article = {}
article['link'] = entry.link
article_list.append(article['link'])
print(article['link'])
The ouput is as follows, only the links from the second source are appended:
['https://www.cnbc.com/2019/10/23/why-china-isnt-cutting-lending-rates-like-the-rest-of-the-world.html', 'https://www.cnbc.com/2019/10/22/stocks-making-the-biggest-moves-after-hours-snap-texas-instruments-chipotle-and-more.html' , ...]
I would like all the URLs from both sources to be extracted into the list. Does anyone know a solution to this problem? Thank you very much in advance!!
article_list
is being overwritten in your first for
loop. Each time you iterate over a new source you article_list
is set to a new empty list, effectively losing all information from the previous source. That's why at the end you only have information from one source, the last one
You should initialize article_list
at the beginning and not overwrite it.
import feedparser as fp
import newspaper
from newspaper import Article
website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A
article_list = [] # INIT ONCE
for source, value in website.items():
if 'rss' in value:
d = fp.parse(value['rss'])
#if there is an RSS value for a company, it will be extracted into d
# article_list = [] THIS IS WHERE IT WAS BEING OVERWRITTEN
for entry in d.entries:
if hasattr(entry, 'published'):
article = {}
article['link'] = entry.link
article_list.append(article['link'])
print(article['link'])