pythonpython-3.xweb-scrapingnewspaper3k

Get web article information (content , title, ...) from multiple web pages-python code


There is a python Library - Newspaper3k, which makes life easier to get content of web pages. [newspaper][1]

for title retrieval:

import newspaper
a = Article(url)
print(a.title)

for content retrieval:

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.text

I want get info about web pages (sometimes title, sometimes actual content)there is my code to fetch content/text of web pages:

from newspaper import Article
import nltk
nltk.download('punkt')
fil=open("laborURLsml2.csv","r") 
# 3, below read every line in fil
Lines = fil.readlines()
for line in Lines:
    print(line)
    article = Article(line)
    article.download()
    article.html
    article.parse()
    print("[[[[[")
    print(article.text)
    print("]]]]]")

The content of "laborURLsml2.csv" file is: [laborURLsml2.csv][2]

My issue is: my code reads first URL and prints content but failed to read 2 URL on-wards


Solution

  • I noted that some of the URLs in your CSV file have a trailing whitespace, which was causing an issue. I also noted that one of your links isn't available and others are the same story distributed to subsidiaries for publication.

    The code below handles the first two issues, but it doesn't handle the data redundancy issue.

    from newspaper import Config
    from newspaper import Article
    from newspaper import ArticleException
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
    
    config = Config()
    config.browser_user_agent = USER_AGENT
    config.request_timeout = 10
    
    with open('laborURLsml2.csv', 'r') as file:
        csv_file = file.readlines()
        for url in csv_file:
            try:
                article = Article(url.strip(), config=config)
                article.download()
                article.parse()
                print(article.title)
                # the replace is used to remove newlines
                article_text = article.text.replace('\n', ' ')
                print(article_text)
            except ArticleException:
                print('***FAILED TO DOWNLOAD***', article.url)
    

    You might find this newspaper3K overview document that I created and shared on my Github page useful.