pythonparsingwebauthornewspaper3k

newsletter3k, am I did something wrong, author function did not pick up author in news article


This is about the author function of newspaper3k Library. I have this list of URL for news. the ">>> article.authors" did not pick up authors sometimes. An example is here:authors missing


Solution

  • Newspaper3k uses the Python package Beautiful Soup to extract items, such as author names from a news website. The tags that Newspaper3k queries are pre-defined within Newspaper3k source code. Newspaper3k makes a best effort to extract content from these standard tags on a news site.

    BUT not all news sources are structured the same, so Newspaper3k will miss certain content, because a tag (e.g., author) will be a different place in the HTML structure.

    For instance Newspaper3k looks for the author name in these tags:

    VALS = ['author', 'byline', 'dc.creator', 'byl']

    The tag dc.creator is always located in the META tag section of a news source. If your news source has a different author tag, such as article.author, which the LA Times uses then you must query that tag like this:

    article_meta_data = article.meta_data
    article_author = {value for (key, value) in article_meta_data['article'].items() if key == 'author'}
    

    I cover many of these harvesting issues in my newspaper3K overview document, which I have shared on my Github page.