When I do
import newspaper
cnn_paper = newspaper.build(news_source_url, memoize_articles=False)
for article in cnn_paper.articles:
print(article.url)
exit()
I get a list of URLs for articles that I can download from news_source_url
(e.g., 'http://cnn.com'
) using the newspaper3k
package. Is there any way to get the timestamps for the various articles?
For CNN specificaly, the dates seem to be encoded in the URLs for many of the articles, but I want to get the article timestamps for any news source. And I would like to get both date and time, if possible.
You can pull the published dates for the articles using Newspaper with the code below. I reformatted the date output, because they had 00:00:00 timestamps.
import newspaper
from datetime import datetime
cnn_paper = newspaper.build('http://cnn.com', memoize_articles=False)
for item in cnn_paper.articles:
article = newspaper.Article(item.url)
article.download()
article.parse()
if article.url and article.publish_date is not None:
print(article.url)
publish_date = datetime.strptime(str(article.publish_date), '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d')
print(publish_date)
If you need the article exact published dates with the timestamps then you need to obtain those from the articles' URLs. After looking into the code for Newspaper, I found a meta tag extractor.
import newspaper
cnn_paper = newspaper.build('http://cnn.com', memoize_articles=False)
for item in cnn_paper.articles:
article = newspaper.Article(item.url)
article.download()
article.parse()
if article.url and article.publish_date is not None:
article_meta_data = article.meta_data
article_published_date = sorted({value for (key, value) in article_meta_data.items() if key == 'pubdate'})
if article_published_date:
print(article_published_date)
else:
print('no published date provided')