web-scrapingpython-newspaper

Newspaper3k, User Agents and Scraping


I'm making text files consisting of the author, date of publication and main text of news articles. I have code to do this, but I need for Newspaper3k to identify the relevant information from these articles first. Since user agent specification has been an issue before, I also specify the user agent. Here's my code so you can follow along. This is version 3.9.0 of Python.

import time, os, random, nltk, newspaper 

from newspaper import Article, Config

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124  Safari/537.36'

config = Config()
config.browser_user_agent = user_agent

url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
article = Article(url, config=config)
article.download()
#article.html #
article.parse()
article.nlp()

article.authors
article.publish_date
article.text 

To better understand why this case is particularly puzzling, please substitute the link I've provided above with this one, and re-run the code. With this link, the code now runs correctly, returning the author, date and text. With the link in the code above, it doesn't. What am I overlooking here?


Solution

  • Apparently, Newspaper demands that we specify the language we're interested in. The code here still doesn't extract the author for some strange reason, but this is enough for me. Here's the code, if anyone else would benefit from it.

    
    #
    # Imports our modules
    #
    
    import time, os, random, nltk, newspaper
    from newspaper import Article
    from googletrans import Translator
    translator = Translator()
    
    # The link we're interested in
    
    url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
    
    
    #
    # Extracts the meta-data
    #
    
    article = Article(url, language='es')
    article.download()
    article.parse()
    article.nlp()
    
    #
    # Makes these into strings so they'll get into the list
    #
    
    authors = str(article.authors)
    date = str(article.publish_date)
    maintext = translator.translate(article.summary).text
    
    
    # Makes the list we'll append
    
    elements = [authors+ "\n", date+ "\n", maintext+ "\n", url]
    
    for x in elements:
        print(x)