I'm trying to scrape a number of webpages using newspaper3k
and my program is throwing 503 Exceptions. Can anyone help me identify the reason for this and help me get around it? To be exact, I'm not looking to catch these exceptions but to understand why they are occurring and prevent them if possible.
from newspaper import Article
dates = list()
titles = list()
urls = ['https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-02',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-mps-hearing-may-21',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-05-06',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-fsr-hearing-may-21',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-03-04',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-2019-20-reserve-bank-annual-review',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-12-02',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-28',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-22',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-19',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-09-14']
for url in urls:
speech = Article(url)
speech.download()
speech.parse()
dates.append(speech.publish_date)
titles.append(speech.title)
Here's my Traceback:
---------------------------------------------------------------------------
ArticleException Traceback (most recent call last)
<ipython-input-5-217a6cafe26a> in <module>
20 speech = Article(url)
21 speech.download()
---> 22 speech.parse()
23 dates.append(speech.publish_date)
24 titles.append(speech.title)
/opt/anaconda3/lib/python3.8/site-packages/newspaper/article.py in parse(self)
189
190 def parse(self):
--> 191 self.throw_if_not_downloaded_verbose()
192
193 self.doc = self.config.get_parser().fromstring(self.html)
/opt/anaconda3/lib/python3.8/site-packages/newspaper/article.py in throw_if_not_downloaded_verbose(self)
529 raise ArticleException('You must `download()` an article first!')
530 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
--> 531 raise ArticleException('Article `download()` failed with %s on URL %s' %
532 (self.download_exception_msg, self.url))
533
ArticleException: Article `download()` failed with 503 Server Error: Service Temporarily Unavailable
for url: https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29
on URL https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29
Here is how you can troubleshoot the 503 Server Error: Service Temporarily Unavailable
error with the Python Package Requests.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
}
base_url = 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29'
req = requests.get(base_url, headers=headers)
print(req.status_code)
# output
503
Why are we getting a 503 Server Error?
Let's look at the content being returned by the server.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
}
base_url = 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29'
req = requests.get(base_url, headers=headers)
print(req.text)
# output
truncated...
<title>Website unavailable - Reserve Bank of New Zealand - Te Pūtea Matua</title>
truncated...
<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
truncated...
<form class="challenge-form" id="challenge-form" action="/research-and-publications/speeches/2021/speech2021-06-29?__cf_chl_jschl_tk__=73ad3f68fb15cc9284b25b7802626dd4ebe102cd-1625840173-0-ATQAZ5g7wCwLU2Q7agCqc1p59qs6ghpsYPVhDNwDN5r7vefk0P1UbjR4AJOUl0kUCZmDi-EVWX8XekL6VkqOgKTd1zqd5QWWlT3f2Dp_aUWQgCAH3bnS4x0wyc8-xGOLm-tcMKCXcTXH-OpiGoUX8u__bk1TIZ0gI_TYMB-oy0nJi7dMYLgJnvJhwhTllDoYUbCzmo2h2idIJPqIjNaAwupvbdpvHnrogPDnFhCe8Cco9-eKlq4w0G563f_OJ3M7YQChBjCoHYlT8baMoOLzP-Kb33rNmlG0uXhzoiIBROsPw9pavOrO1vsbqf31ZArDRuy0y7rsfrhAD7iU113zmypN81tgqgL_F8YTzygRvI_z3Cs2YOMxjB53-jq1pWwqsW_ItTaY7I3vh5lg_12EUzEddcwmuIj1wI2NbnA7EU06QNHYYn_Ye4TKM0gu9k4031hGybszE3nRKCdTXgMSgJbYhTJ6bJYPSb_2IHMUHlYyHksxePJ4C_5-5X8qIdJApSTFBfCLLLAZLrkFnBk7ep4" method="POST" enctype="application/x-www-form-urlencoded">
truncated...
var a = document.getElementById('cf-content');
truncated...
<p>Your access to the Reserve Bank website has been restricted. If you think you should be able to access our website please email <a href="mailto:web@rbnz.govt.nz">web@rbnz.govt.nz</a>.
If we looked at the returned text we can see that the website is asking for your browser to complete a challenge-form.
. If you look at the additional data points (e.g. cf-content
) in the text you can see that the website is being protected by CloudFlare.
Bypassing this protection is extremely difficult. Here is one of my recent answers on the complexity bypassing this protection.