I'm making text files consisting of the author, date of publication and main text of news articles. I have code to do this, but I need for Newspaper3k
to identify the relevant information from these articles first. Since user agent specification has been an issue before, I also specify the user agent. Here's my code so you can follow along. This is version 3.9.0
of Python.
import time, os, random, nltk, newspaper
from newspaper import Article, Config
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
config = Config()
config.browser_user_agent = user_agent
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
article = Article(url, config=config)
article.download()
#article.html #
article.parse()
article.nlp()
article.authors
article.publish_date
article.text
To better understand why this case is particularly puzzling, please substitute the link I've provided above with this one, and re-run the code. With this link, the code now runs correctly, returning the author, date and text. With the link in the code above, it doesn't. What am I overlooking here?
Apparently, Newspaper demands that we specify the language we're interested in. The code here still doesn't extract the author for some strange reason, but this is enough for me. Here's the code, if anyone else would benefit from it.
#
# Imports our modules
#
import time, os, random, nltk, newspaper
from newspaper import Article
from googletrans import Translator
translator = Translator()
# The link we're interested in
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
#
# Extracts the meta-data
#
article = Article(url, language='es')
article.download()
article.parse()
article.nlp()
#
# Makes these into strings so they'll get into the list
#
authors = str(article.authors)
date = str(article.publish_date)
maintext = translator.translate(article.summary).text
# Makes the list we'll append
elements = [authors+ "\n", date+ "\n", maintext+ "\n", url]
for x in elements:
print(x)