pythonpython-newspapernewspaper3k

Can't find publish_date with newspaper3k


I want to scrape an article from a website with the newspaper library (newspaper3k). However, it doesn't find the published_date for the article, which is div.source-date in the website's source text, and the authors (or source rather), which is div.delfi-source-name in the website's source text. How can I scrape the date and the author/source?

Website/URL example: https://www.delfi.lt/en/politics/foreign-ministry-tsikhanouskayas-consultation-needed-for-treating-belarusians-in-lithuania.d?id=91531501

My code:

import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd

article = Article("url")
article.download()
article.parse()
article.nlp()

df = pd.DataFrame([{'Title':article.title, 'Author':article.authors, 'Text':article.text,
                    'published_date':article.publish_date, 'Source':article.source_url}])

df.to_excel('Delfi-1.xlsx')

Any suggestions?


Solution

  • The date element in your source is located in 2 locations. The one that you see Wednesday, October 19, 2022 is located in a div tag that newspaper3k cannot parse without using BeautifulSoup.

    The second date is hidden in the meta tags, which newspaper3k can parse with some additional code.

    from newspaper import Config
    from newspaper import Article
    from newspaper.article import ArticleException
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
    
    config = Config()
    config.browser_user_agent = USER_AGENT
    config.request_timeout = 10
    
    base_url = 'https://www.delfi.lt/en/politics/foreign-ministry-tsikhanouskayas-consultation-needed-for-treating-belarusians-in-lithuania.d?id=91531501'
    try:
        article = Article(base_url, config=config)
        article.download()
        article.parse()
        article_meta_data = article.meta_data
    
        article_title = [value['title'] for (key, value) in article_meta_data.items() if key == 'og']
        print(article_title)
    
        article_published_date = [value['recs']['publishtime'] for key, value in article_meta_data.items()
                                  if key == 'cXenseParse']
        print(article_published_date)
    
        article_description = [value['description'] for (key, value) in article_meta_data.items() if key == 'og']
        print(article_description)
    
    except ArticleException as error:
        print(error)
    

    Output

    ["Foreign Ministry: Tsikhanouskaya's consultation needed for treating Belarusians in Lithuania"]
    ['2022-10-19T11:38:07+0300']
    ["As Belorus, a Belarus-owned sanatorium in Lithuania's southern resort of Druskininkai, complaints over the fact that Lithuania fails to issue visas to Belarusian citizens, forcing the sanatorium to fire a quarter of its staff, Lithuania's Foreign Ministry suggests coordinating the list of arrivals with Belarusian opposition leaders Sviatlana Tsikhanouskaya's office in Vilnius."]
    
    

    P.S. Newspaper3k has multiple ways to extract the publish dates from articles. Take a look at this document that I wrote on how to use Newspaper3k.