pythonweb-scrapingnlppython-newspaper

Handling Article Exceptions in Newspaper


I have a bit of code that uses newspaper to go take a look at various media outlets and download articles from them. This has been working fine for a long time but has recently started acting up. I can see what the problem is but as I'm new to Python I'm not sure about the best way to address it. Basically (I think) I need to make a modification to keep the occasional malformed web address from crashing the script entirely and instead allow it to dispense with that web address and move on to the others.

The origins of the error is when I attempt to download an article using:

article.download()

Some articles (they change every day obviously) will throw the following error but the script continues to run:

    Traceback (most recent call last):
      File "C:\Anaconda3\lib\encodings\idna.py", line 167, in encode
        raise UnicodeError("label too long")
   UnicodeError: label too long

   The above exception was the direct cause of the following exception:

   Traceback (most recent call last):
     File "C:\Anaconda3\lib\site-packages\newspaper\mthreading.py", line 38, in run
       func(*args, **kargs)
     File "C:\Anaconda3\lib\site-packages\newspaper\source.py", line 350, in download_articles
       html = network.get_html(url, config=self.config)
     File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 39, in get_html return get_html_2XX_only(url, config, response)
     File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 60, in get_html_2XX_only url=url, **get_request_kwargs(timeout, useragent))
     File "C:\Anaconda3\lib\site-packages\requests\api.py", line 72, in get return request('get', url, params=params, **kwargs)
     File "C:\Anaconda3\lib\site-packages\requests\api.py", line 58, in request return session.request(method=method, url=url, **kwargs)
     File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 502, in request resp = self.send(prep, **send_kwargs)
     File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 612, in send r = adapter.send(request, **kwargs)
     File "C:\Anaconda3\lib\site-packages\requests\adapters.py", line 440, in send timeout=timeout
     File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked)
     File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 356, in _make_request conn.request(method, url, **httplib_request_kw)
     File "C:\Anaconda3\lib\http\client.py", line 1107, in request self._send_request(method, url, body, headers)
     File "C:\Anaconda3\lib\http\client.py", line 1152, in _send_request self.endheaders(body)
     File "C:\Anaconda3\lib\http\client.py", line 1103, in endheaders     self._send_output(message_body)
     File "C:\Anaconda3\lib\http\client.py", line 934, in _send_output self.send(msg)
     File "C:\Anaconda3\lib\http\client.py", line 877, in send     self.connect()
     File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 166, in connect conn = self._new_conn()
     File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 141, in _new_conn  (self.host, self.port), self.timeout, **extra_kw)
     File "C:\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
     File "C:\Anaconda3\lib\socket.py", line 733, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
 UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)

The next bit is supposed to then parse and run natural language processing on each article and write certain elements to a dataframe so I then have:

for paper in papers:    
for article in paper.articles:
    article.parse()
    print(article.title)
    article.nlp()
    if article.publish_date is None:
        d = datetime.now().date()
    else:
        d = article.publish_date.date()
    stories.loc[i] = [paper.brand, d, datetime.now().date(), article.title, article.summary, article.keywords, article.url]
    i += 1

(This might be a little sloppy too but that's a problem for another day)

This runs fine until it gets to one of those URLs with the error and then tosses an article exception and the script crashes:

    C:\Anaconda3\lib\site-packages\PIL\TiffImagePlugin.py:709: UserWarning: Corrupt EXIF data.  Expecting to read 2 bytes but only got 0.
   warnings.warn(str(msg))

   ArticleException                          Traceback (most recent call last) <ipython-input-17-2106485c4bbb> in <module>()
          4 for paper in papers:
          5     for article in paper.articles:
    ----> 6         article.parse()
          7         print(article.title)
          8         article.nlp()

   C:\Anaconda3\lib\site-packages\newspaper\article.py in parse(self)
       183 
       184     def parse(self):
   --> 185         self.throw_if_not_downloaded_verbose()
       186 
       187         self.doc = self.config.get_parser().fromstring(self.html)

   C:\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
       519         if self.download_state == ArticleDownloadState.NOT_STARTED:
       520             print('You must `download()` an article first!')
   --> 521             raise ArticleException()
       522         elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
       523             print('Article `download()` failed with %s on URL %s' %

  ArticleException: 

So what's the best way to keep this from terminating my script? Should I address it in the download stage where I'm getting the unicode error or at the parse stage by telling it to overlook those bad addresses? And how would I go about implementing that correction?

Really appreciate any advice.


Solution

  • I had the same issue and although in general using except: pass is not recommended, the following worked for me:

        try:
            a.parse()
            file.write( a.title+'\n')
        except :
            pass