I am working on as a python novice on an exercise to practice importing data in python. Eventually I want to analyze data from different podcasts (infos on the podcasts itself and every episode) by putting the data into a coherent dataframe work on it with NLP.
So far I have managed to read a list of RSS feeds and get the information on every single episode of the RSS feed (a post).
But I am having trouble to find an integrated working process in python to gather both
Code This is what i have got so far
import feedparser
import pandas as pd
rss_feeds = ['http://feeds.feedburner.com/TEDTalks_audio',
'https://joelhooks.com/rss.xml',
'https://www.sciencemag.org/rss/podcast.xml',
]
#number of feeds is reduced for testing
posts = []
feed = []
for url in rss_feeds:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((post.title, post.link, post.summary))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
Output The dataframe includes 652 non-null objects for three columns (as intended) - basically every post made in every podcast. The column title refers to the title of the episode but not to the title of the podcast (which in this example is 'Ted Talk Daily').
title | link | summary | |
---|---|---|---|
0 | 3 questions to ask yourself about everything y... | https://www.ted.com/talks/stacey_abrams_3_ques... | How you respond to setbacks is what defines yo... |
1 | What your sleep patterns say about your relati... | https://www.ted.com/talks/tedx_shorts_what_you... | Wendy Troxel looks at the cultural expectation... |
2 | How we can actually pay people enough -- with ... | https://www.ted.com/talks/ted_business_how_we_... | Capitalism urgently needs an upgrade, says Pay... |
I am struggling to find a way on how to include the title of the podcasts to this dataframe, too. I always get an error selecting parts the whole feed information e.g. ['feed']['title'].
Thanks for every hint with this!
Source I accustomed what I have so far based on this source: Get Feeds from FeedParser and Import to Pandas DataFrame
Feed title can be accessed in this case with feed.feed.title
:
# ...
for url in rss_feeds:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((feed.feed.title, post.title, post.link, post.summary))
df = pd.DataFrame(posts, columns=['feed_title', 'title', 'link', 'summary'])
df
Output:
feed_title title link summary
0 TED Talks Daily 3 ways compa... https://www.... When we expe...
1 TED Talks Daily How we could... https://www.... Concrete is ...
2 TED Talks Daily 3 questions ... https://www.... How you resp...
3 TED Talks Daily What your sl... https://www.... Wendy Troxel...
4 TED Talks Daily How we can a... https://www.... Capitalism u...
.. ... ... ... ...
649 Science Maga... Science Podc... https://traf... Fear-enhance...
650 Science Maga... Science Podc... https://traf... Discussing t...
651 Science Maga... Science Podc... https://traf... Talking kids...
652 Science Maga... Science Podc... https://traf... The minimum ...
653 Science Maga... Science Podc... https://traf... The origin o...