pythonpython-newspapernewspaper3k

Python library newspaper is not returning the published date


I am using newspaper python library to extract some data from new stories. The problem is that I am not getting this data for some URLs. These URLs work fine. They all return 200. I am doing this for a very large dataset but this is one of the URLs for which the date extraction did not work. The code works for some links and not others (from the same domain) so I know that the problem isn't something like my IP being blocked for too many requests. I tried it on just one URL and getting the same result (no data).

import os
import sys
from newspaper import Article   

def split(link):
        try:
            story = Article(link)
            story.download()
            story.parse()
            date_time = str(story.publish_date)
            split_date = date_time.split()  
            date = split_date[0]
            if date != "None":
                print(date)
        except:
            print("This URL did not return a published date. Try a different URL.")
            print(link)

if __name__ == "__main__":
        link = "https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one"
        split(link)

I am getting this output:

This URL did not return a published date. Try a different URL. https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one


Solution

  • Try adding some error handling to your code to catch URLs that return a 404, such as this one: https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one

    from newspaper import Config
    from newspaper import Article
    from newspaper.article import ArticleException
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
    
    config = Config()
    config.browser_user_agent = USER_AGENT
    config.request_timeout = 10
    
    base_url = 'https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one'
    try:
        article = Article(base_url, config=config)
        article.download()
        article.parse()
    except ArticleException as error:
        print(error)
    

    Output:

    Article `download()` failed with 404 Client Error: Not Found for url: https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one on URL https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one
    

    Newspaper3k has multiple ways to extract the publish dates from articles. Take a look at this document that I wrote on how to use Newspaper3k.

    Here is an example for this valid URL https://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-water that extracts data elements from the page's meta tags.

    from newspaper import Config
    from newspaper import Article
    from newspaper.article import ArticleException
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
    
    config = Config()
    config.browser_user_agent = USER_AGENT
    config.request_timeout = 10
    
    base_url = 'https://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-water'
    try:
        article = Article(base_url, config=config)
        article.download()
        article.parse()
        article_meta_data = article.meta_data
    
        article_title = [value for (key, value) in article_meta_data.items() if key == 'pageTitle']
        print(article_title)
    
        article_published_date = str([value for (key, value) in article_meta_data.items() if key == 'publishedDate'])
        print(article_published_date)
    
        article_description = [value for (key, value) in article_meta_data.items() if key == 'description']
        print(article_description)
    
    except ArticleException as error:
        print(error)
    

    Output

    ['Lords of Water']
    ['2022-03-31T06:08:59']
    ['Is water the new oil? We expose the financialisation of water.']