I am trying to web scrape news articles by certain keywords. I use Python 3. However, I am not able to get all the articles from the newspaper. After scraping some articles as output in the csv
file I get ArticleException
error. Could anyone help me with this? Ideally, I would like to solve the problem and download all the related articles from the newspaper website. Otherwise, it would also be useful to just skip the URL that shows error and continue from the next one. Thanks in advance for your help.
This is the code I am using:
import urllib.request
import newspaper
from newspaper import Article
import csv, os
from bs4 import BeautifulSoup
import urllib
req_keywords = ['coronavirus', 'covid-19']
newspaper_base_url = 'http://www.thedailystar.net'
category = 'country'
def checkif_kw_exist(list_one, list_two):
common_kw = set(list_one) & set(list_two)
if len(common_kw) == 0: return False, common_kw
else: return True, common_kw
def get_article_info(url):
a = Article(url)
a.download()
a.parse()
a.nlp()
success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
if success:
return [url, a.publish_date, a.title, a.text]
else: return False
output_file = "J:/B/output.csv"
if not os.path.exists(output_file):
open(output_file, 'w').close()
for index in range(1,50000,1):
page_soup = BeautifulSoup( urllib.request.urlopen(page_url).read())
primary_tag = page_soup.find_all("h4", attrs={"class": "pad-bottom-small"})
for tag in primary_tag:
url = tag.find("a")
#print (url)
url = newspaper_base_url + url.get('href')
result = get_article_info(url)
if result is not False:
with open(output_file, 'a', encoding='utf-8') as f:
writeFile = csv.writer(f)
writeFile.writerow(result)
f.close
else:
pass
This is the error I am getting:
---------------------------------------------------------------------------
ArticleException Traceback (most recent call last)
<ipython-input-1-991b432d3bd0> in <module>
65 #print (url)
66 url = newspaper_base_url + url.get('href')
---> 67 result = get_article_info(url)
68 if result is not False:
69 with open(output_file, 'a', encoding='utf-8') as f:
<ipython-input-1-991b432d3bd0> in get_article_info(url)
28 a = Article(url)
29 a.download()
---> 30 a.parse()
31 a.nlp()
32 success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
~\Anaconda3\lib\site-packages\newspaper\article.py in parse(self)
189
190 def parse(self):
--> 191 self.throw_if_not_downloaded_verbose()
192
193 self.doc = self.config.get_parser().fromstring(self.html)
~\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
530 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
531 raise ArticleException('Article `download()` failed with %s on URL %s' %
--> 532 (self.download_exception_msg, self.url))
533
534 def throw_if_not_parsed_verbose(self):
ArticleException: Article `download()` failed with HTTPSConnectionPool(host='www.thedailystar.net', port=443): Read timed out. (read timeout=7) on URL http://www.thedailystar.net/ugc-asks-private-universities-stop-admissions-grades-without-test-for-coronavirus-pandemic-1890151
The quickest way to 'skip' failures related to the downloaded content is to use a try/except
as follows:
def get_article_info(url):
a = Article(url)
try:
a.download()
a.parse()
a.nlp()
success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
if success:
return [url, a.publish_date, a.title, a.text]
else: return False
except:
return False
Using an except
to catch every possible exception, and ignore it, isn't recommended, and this answer would be downvoted if I didn't suggest that you deal with exceptions a little better. You did also ask about solving the issue. Without reading the documentation for libraries you import, you won't know what exceptions might occur, so printing out details of exceptions while you're skipping them will give you the details, like the ArticleException
you are getting now. And you can start added individual except
sections to deal with them for the ones you have already encountered:
def get_article_info(url):
a = Article(url)
try:
a.download()
a.parse()
a.nlp()
success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
if success:
return [url, a.publish_date, a.title, a.text]
else:
return False
except ArticleException as ae:
print (ae)
return False
except Exception as e:
print(e)
return False
The ArticleException
you are getting is telling you that you are getting a timeout
error, which means the response from the Daily Star hasn't completed within a time limit. Maybe it's very busy :) You could try downloading several times before giving up.