pythonurltextlocalnewspaper3k

newsletter3k_does its funtions work on stored data,I already downloaded contents of the URL


The newspaper3k in GitHub here is a quite useful Library. Currently, it works with python3. I wonder if it can handle downloaded/stored text. The point is we already downloaded the contents of the URL and do not want to repeat this every time when we use certain functions (keyword, summary, date,...). We would like to query stored data for date and authors for example. Obvious code execution flow 1.download, 2.parse, extract various info: text, title, images,... it looks like a chain reaction to me that always starts with a download:

>>> url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
>>> article = Article(url)
>>> article.download()
>>> article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'
>>> article.parse()
>>> article.authors
['Leigh Ann Caldwell', 'John Honway']
>>> article.publish_date
datetime.datetime(2013, 12, 30, 0, 0)
>>> article.text
'Washington (CNN) -- Not everyone subscribes to a New Year's    resolution...'
>>> article.top_image
'http://someCDN.com/blah/blah/blah/file.png'

Solution

  • After your comment about using "ctrl+s" and save on news sources, I removed my first answer and added this one.

    I download the content from this article -- https://www.latimes.com/business/story/2021-02-08/tesla-invests-in-bitcoin -- to my file system.

    The example below shows how I can query this article from my local file system.

    from newspaper import Article
    
    with open("Elon Musk's Bitcoin embrace is a bit of a head-scratcher - Los Angeles Times.htm", 'r') as f:
        # note the empty URL string
        article = Article('', language='en')
        article.download(input_html=f.read())
        article.parse()
        article_meta_data = article.meta_data
    
        article_published_date = ''.join({value for (key, value) in article_meta_data['article'].items()
                                          if key == 'published_time'})
    
        print(article_published_date)
        # output 
        2021-02-08T15:52:56.252
    
        print(article.title)
        # output
        Elon Musk’s Bitcoin embrace is a bit of a head-scratcher
    
        article_author = {value for (key, value) in article_meta_data['article'].items() if key == 'author'}
        print(''.join(article_author).rsplit('/', 1)[-1])
        # output
        russ-mitchell
    
        article_summary = ''.join({value for (key, value) in article_meta_data['og'].items() if key == 'description'})
        print(article_summary)
        # output 
        The Tesla CEO says climate change is a threat to humanity, but his endorsement is driving demand for a cryptocurrency with a massive carbon footprint.