pythonpython-zipfile

Extracting contents from Zipfile in Python downloaded from URL


I have this URL for downloading a Zipfile. We can download the zipfile by clicking the link.

But this code I have does not work.

from datetime import datetime, timedelta
import requests, zipfile, io
from zipfile import BadZipFile

TwoMonthsAgo = datetime.now() - timedelta(60)
zip_file_url = 'https://www.statssa.gov.za/timeseriesdata/Excel/P6420%20Food%20and%20beverages%20('+ datetime.strftime(TwoMonthsAgo, '%Y%m') +').zip'

r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

This returns an error:

BadZipFile: File is not a zip file

When I try to download the zipfile to cwd and extract it, there is once again the same error.

import urllib.request 
filename = 'P6420 Food and beverages ('+ datetime.strftime(TwoMonthsAgo, '%Y%m') +').zip'
urllib.request.urlretrieve(zip_file_url, filename)

import zipfile
with zipfile.ZipFile(filename, 'r') as zip_ref:
    zip_ref.extractall()

BadZipFile: File is not a zip file

This downloaded file does not extract when I try to open manually. While downloading, there is some corruption. How to overcome?


Solution

  • You are not downloading a zip file at all.

    Printing the content returned from the URL shows that the URL returns HTML with an embedded script tag rather that a zip file

    from datetime import datetime, timedelta
    import requests, zipfile, io
    from zipfile import BadZipFile
    
    TwoMonthsAgo = datetime.now() - timedelta(60)
    zip_file_url = 'https://www.statssa.gov.za/timeseriesdata/Excel/P6420%20Food%20and%20beverages%20('+ datetime.strftime(TwoMonthsAgo, '%Y%m') +').zip'
    
    r = requests.get(zip_file_url)
    
    print(r.content)
    

    output is

    b'<html>\r\n<head>\r\n<META NAME="robots" CONTENT="noindex,nofollow">\r\n<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">\r\n</script>\r\n<body>\r\n</body></html>\r\n'
    

    Adding a user agent, as suggested by Mark Setchell, sorts the issue

    from datetime import datetime, timedelta
    import requests, zipfile, io
    from zipfile import BadZipFile
    
    
    TwoMonthsAgo = datetime.now() - timedelta(60)
    zip_file_url = 'https://www.statssa.gov.za/timeseriesdata/Excel/P6420%20Food%20and%20beverages%20('+ datetime.strftime(TwoMonthsAgo, '%Y%m') +').zip'
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
    
    r = requests.get(zip_file_url, headers=headers)
    
    z = zipfile.ZipFile(io.BytesIO(r.content))
    
    z.extractall()
    

    running that creates this file

    $ ls Excel/
    'Food and beverages from 2005.xlsx'