I am using newspaper python library to extract some data from new stories. The problem is that I am not getting this data for some URLs. These URLs work fine. They all return 200. I am doing this for a very large dataset but this is one of the URLs for which the date extraction did not work. The code works for some links and not others (from the same domain) so I know that the problem isn't something like my IP being blocked for too many requests. I tried it on just one URL and getting the same result (no data).
import os
import sys
from newspaper import Article
def split(link):
try:
story = Article(link)
story.download()
story.parse()
date_time = str(story.publish_date)
split_date = date_time.split()
date = split_date[0]
if date != "None":
print(date)
except:
print("This URL did not return a published date. Try a different URL.")
print(link)
if __name__ == "__main__":
link = "https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one"
split(link)
I am getting this output:
This URL did not return a published date. Try a different URL. https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one
Try adding some error handling to your code to catch URLs that return a 404, such as this one: https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one
from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
base_url = 'https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one'
try:
article = Article(base_url, config=config)
article.download()
article.parse()
except ArticleException as error:
print(error)
Output:
Article `download()` failed with 404 Client Error: Not Found for url: https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one on URL https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one
Newspaper3k
has multiple ways to extract the publish dates from articles. Take a look at this document that I wrote on how to use Newspaper3k
.
Here is an example for this valid URL https://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-water
that extracts data elements from the page's meta tags
.
from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
base_url = 'https://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-water'
try:
article = Article(base_url, config=config)
article.download()
article.parse()
article_meta_data = article.meta_data
article_title = [value for (key, value) in article_meta_data.items() if key == 'pageTitle']
print(article_title)
article_published_date = str([value for (key, value) in article_meta_data.items() if key == 'publishedDate'])
print(article_published_date)
article_description = [value for (key, value) in article_meta_data.items() if key == 'description']
print(article_description)
except ArticleException as error:
print(error)
Output
['Lords of Water']
['2022-03-31T06:08:59']
['Is water the new oil? We expose the financialisation of water.']