There is a python Library - Newspaper3k, which makes life easier to get content of web pages. [newspaper][1]
for title retrieval:
import newspaper
a = Article(url)
print(a.title)
for content retrieval:
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.text
I want get info about web pages (sometimes title, sometimes actual content)there is my code to fetch content/text of web pages:
from newspaper import Article
import nltk
nltk.download('punkt')
fil=open("laborURLsml2.csv","r")
# 3, below read every line in fil
Lines = fil.readlines()
for line in Lines:
print(line)
article = Article(line)
article.download()
article.html
article.parse()
print("[[[[[")
print(article.text)
print("]]]]]")
The content of "laborURLsml2.csv" file is: [laborURLsml2.csv][2]
My issue is: my code reads first URL and prints content but failed to read 2 URL on-wards
I noted that some of the URLs in your CSV file have a trailing whitespace, which was causing an issue. I also noted that one of your links isn't available and others are the same story distributed to subsidiaries for publication.
The code below handles the first two issues, but it doesn't handle the data redundancy issue.
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
with open('laborURLsml2.csv', 'r') as file:
csv_file = file.readlines()
for url in csv_file:
try:
article = Article(url.strip(), config=config)
article.download()
article.parse()
print(article.title)
# the replace is used to remove newlines
article_text = article.text.replace('\n', ' ')
print(article_text)
except ArticleException:
print('***FAILED TO DOWNLOAD***', article.url)
You might find this newspaper3K overview document that I created and shared on my Github page useful.