I have this URL for downloading a Zipfile. We can download the zipfile by clicking the link.
But this code I have does not work.
from datetime import datetime, timedelta
import requests, zipfile, io
from zipfile import BadZipFile
TwoMonthsAgo = datetime.now() - timedelta(60)
zip_file_url = 'https://www.statssa.gov.za/timeseriesdata/Excel/P6420%20Food%20and%20beverages%20('+ datetime.strftime(TwoMonthsAgo, '%Y%m') +').zip'
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
This returns an error:
BadZipFile: File is not a zip file
When I try to download the zipfile to cwd and extract it, there is once again the same error.
import urllib.request
filename = 'P6420 Food and beverages ('+ datetime.strftime(TwoMonthsAgo, '%Y%m') +').zip'
urllib.request.urlretrieve(zip_file_url, filename)
import zipfile
with zipfile.ZipFile(filename, 'r') as zip_ref:
zip_ref.extractall()
BadZipFile: File is not a zip file
This downloaded file does not extract when I try to open manually. While downloading, there is some corruption. How to overcome?
You are not downloading a zip file at all.
Printing the content returned from the URL shows that the URL returns HTML with an embedded script tag rather that a zip file
from datetime import datetime, timedelta
import requests, zipfile, io
from zipfile import BadZipFile
TwoMonthsAgo = datetime.now() - timedelta(60)
zip_file_url = 'https://www.statssa.gov.za/timeseriesdata/Excel/P6420%20Food%20and%20beverages%20('+ datetime.strftime(TwoMonthsAgo, '%Y%m') +').zip'
r = requests.get(zip_file_url)
print(r.content)
output is
b'<html>\r\n<head>\r\n<META NAME="robots" CONTENT="noindex,nofollow">\r\n<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">\r\n</script>\r\n<body>\r\n</body></html>\r\n'
Adding a user agent, as suggested by Mark Setchell, sorts the issue
from datetime import datetime, timedelta
import requests, zipfile, io
from zipfile import BadZipFile
TwoMonthsAgo = datetime.now() - timedelta(60)
zip_file_url = 'https://www.statssa.gov.za/timeseriesdata/Excel/P6420%20Food%20and%20beverages%20('+ datetime.strftime(TwoMonthsAgo, '%Y%m') +').zip'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
r = requests.get(zip_file_url, headers=headers)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
running that creates this file
$ ls Excel/
'Food and beverages from 2005.xlsx'