python python-3.x web-scraping python-newspaper

ArticleException error in web scraping news articles by python

I am trying to web scrape news articles by certain keywords. I use Python 3. However, I am not able to get all the articles from the newspaper. After scraping some articles as output in the csv file I get ArticleException error. Could anyone help me with this? Ideally, I would like to solve the problem and download all the related articles from the newspaper website. Otherwise, it would also be useful to just skip the URL that shows error and continue from the next one. Thanks in advance for your help.

This is the code I am using:

import urllib.request
import newspaper
from newspaper import Article
import csv, os
from bs4 import BeautifulSoup
import urllib

req_keywords = ['coronavirus', 'covid-19']

newspaper_base_url = 'http://www.thedailystar.net'
category = 'country'

def checkif_kw_exist(list_one, list_two):
    common_kw = set(list_one) & set(list_two)
    if len(common_kw) == 0: return False, common_kw
    else: return True, common_kw

def get_article_info(url):
    a = Article(url)
    a.download()
    a.parse()
    a.nlp()
    success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
    if success:
        return [url, a.publish_date, a.title, a.text]
    else: return False


output_file = "J:/B/output.csv"
if not os.path.exists(output_file):
    open(output_file, 'w').close() 


for index in range(1,50000,1):

    page_soup = BeautifulSoup( urllib.request.urlopen(page_url).read())

    primary_tag = page_soup.find_all("h4", attrs={"class": "pad-bottom-small"})

    for tag in primary_tag:

        url = tag.find("a")
        #print (url)
        url = newspaper_base_url + url.get('href')
        result = get_article_info(url)
        if result is not False:
            with open(output_file, 'a', encoding='utf-8') as f:
                writeFile = csv.writer(f)
                writeFile.writerow(result)
                f.close
        else: 
            pass

This is the error I am getting:

---------------------------------------------------------------------------
ArticleException                          Traceback (most recent call last)
<ipython-input-1-991b432d3bd0> in <module>
     65         #print (url)
     66         url = newspaper_base_url + url.get('href')
---> 67         result = get_article_info(url)
     68         if result is not False:
     69             with open(output_file, 'a', encoding='utf-8') as f:

<ipython-input-1-991b432d3bd0> in get_article_info(url)
     28     a = Article(url)
     29     a.download()
---> 30     a.parse()
     31     a.nlp()
     32     success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())

~\Anaconda3\lib\site-packages\newspaper\article.py in parse(self)
    189 
    190     def parse(self):
--> 191         self.throw_if_not_downloaded_verbose()
    192 
    193         self.doc = self.config.get_parser().fromstring(self.html)

~\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
    530         elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
    531             raise ArticleException('Article `download()` failed with %s on URL %s' %
--> 532                   (self.download_exception_msg, self.url))
    533 
    534     def throw_if_not_parsed_verbose(self):

ArticleException: Article `download()` failed with HTTPSConnectionPool(host='www.thedailystar.net', port=443): Read timed out. (read timeout=7) on URL http://www.thedailystar.net/ugc-asks-private-universities-stop-admissions-grades-without-test-for-coronavirus-pandemic-1890151

Solution

The quickest way to 'skip' failures related to the downloaded content is to use a try/except as follows:

def get_article_info(url):
  a = Article(url)
  try:
    a.download()
    a.parse()
    a.nlp()
    success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
    if success:
      return [url, a.publish_date, a.title, a.text]
    else: return False
  except:
    return False

Using an except to catch every possible exception, and ignore it, isn't recommended, and this answer would be downvoted if I didn't suggest that you deal with exceptions a little better. You did also ask about solving the issue. Without reading the documentation for libraries you import, you won't know what exceptions might occur, so printing out details of exceptions while you're skipping them will give you the details, like the ArticleException you are getting now. And you can start added individual except sections to deal with them for the ones you have already encountered:

def get_article_info(url):
  a = Article(url)
  try:
    a.download()
    a.parse()
    a.nlp()
    success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
    if success:
      return [url, a.publish_date, a.title, a.text]
    else: 
      return False
   except ArticleException as ae:
     print (ae)
     return False
   except Exception as e:
     print(e)
     return False

The ArticleException you are getting is telling you that you are getting a timeout error, which means the response from the Daily Star hasn't completed within a time limit. Maybe it's very busy :) You could try downloading several times before giving up.