pythondownloadaiohttppython-aiofiles

How to Download Large Zip (4GB>) Through aiofiles, aiohttp


What happens? I get a timeout error most of the time. Sometimes, "Response payload is not completed". If I were to download this through the site button, it takes around an hour (no issue). I would like to do this through code because there are multiple different files which all take around at least an hour to download, and another hour to unzip.

This is my first time attempting an async download, and I am not really sure. I set the timeout to 2 hours and it uses the whole two hours and only has 70% of the zip downloaded. (EVEN THOUGH, IT USING DOUBLE THE TIME OF JUST DROPPING THE LINK IN A BROWSER!!)

I read on another post that increasing the TCPConnector limit could help, but its still timeouting.

Update: I commented out the logger lines as I did not want to reproduce how to setup a logger, as its not relevant.

import logger
import asyncio
import aiohttp
import aiofiles
import time

async def retrieve_data(base_link, payload, filename):
    # connector=aiohttp.TCPConnector(limit=200)
    async with aiohttp.ClientSession() as session:
        try:
            chunksize = 8*1024*1024 ## 8 MB
            # , timeout=10800
            async with session.get(base_link, params=payload, timeout=7200) as response:
                file_size = int(response.headers['Content-Length'])
                async with aiofiles.open(filename, 'wb') as fd:
                    progress = 0
                    async for chunk in response.content.iter_chunked(chunksize):
                        await asyncio.sleep(0)
                        await fd.write(chunk)
                        progress += len(chunk)
                        percentage = (progress / file_size) * 100
                        # logger.info(f"Download Progress: {percentage:.2f}%")
                # logger.info(f"{time.strftime('%X')} - Downloaded file: <{filename}>")
                return filename
        except asyncio.TimeoutError:
            # logger.info(f"{time.strftime('%X')} - Error: asyncio.TimeoutError")
            return ""
        except Exception as e:
            # logger.info(f"{time.strftime('%X')} - Error: {e}")
            return ""

I have tried using any readany() and read() but I still have the same issues.

while True:
    chunk = await response.content.readany()
    # chunk = await response.content.read(chunksize)
    await asyncio.sleep(0)
    if not chunk:
    break

Solution

  • [response edit]

    Ultimately, it is a server-side issue, i.e. discrepancy b/w header indications and reality, one way to proceed, would be to, catch the error and reading whatever data is sent, effectively ignoring the Content-Length header, though potentially receiving corrupt contents, because ideally, this discrepancy should not exist - but this is not a concern if one is sure that the remote server, atleast seems to provide valid data when downloaded thru other means - here: thru browser.

    If the issue seems to be, 'Not enough data to satisfy content length error' or the likes, try using the following, particularly with read_until_eof=True which seems to only be available with the request primitive of aiohttp, as mentioned in the documentation:

    Basic API
    
    While we encourage ClientSession usage we also provide simple coroutines for making HTTP requests.
    
    Basic API is good for performing simple HTTP requests without keepaliving, cookies and complex connection stuff like properly configured SSL certification chaining.
    
    async aiohttp.request(method, url, *, params=None, data=None, json=None, cookies=None, headers=None, skip_auto_headers=None, auth=None, allow_redirects=True, max_redirects=10, compress=False, chunked=None, expect100=False, raise_for_status=None, read_until_eof=True, proxy=None, proxy_auth=None, timeout=sentinel, ssl=True, server_hostname=None, proxy_headers=None, trace_request_ctx=None, read_bufsize=None, auto_decompress=None, max_line_size=None, max_field_size=None, version=aiohttp.HttpVersion11, connector=None)
    

    it should default to true:

    read_until_eof (bool) – Read response until EOF if response does not have Content-Length header. True by default (optional).
    

    but I figured, maybe worth trying. Ultimately, it is a server-side issue, i.e. discrepancy b/w header indications and reality, one way to proceed, would be to, catch the error and reading whatever data is sent, effectively ignoring the Content-Length header, though potentially receiving corrupt contents, because ideally, this discrepancy should not exist - but this is not a concern if one is sure that the remote server, atleast seems to provide valid data when downloaded thru other means - here: thru browser.

    {thanks to sam_bull for pointing out the incompleteness of my initial response and generally being extremely helpful}

    [latest - edit3]

    Browsers implement a ton of optimizations in this context. For downloading a singular large file, asynchronicity does not provide large benefits, and certainly nothing over the browser's implementation as far optimizing for network and I/O goes. All that aside, such a huge disparity in speeds (1 hour vs only 70% in 2 hours) isn't really accountable to minor I/O inefficiencies, atleast not at this scale. I did initially think, more than likely, the culprit is on the network side i.e server-side load balancing, ISP throttling, etc. But Sam Bull's comment [below] lead me to this: another question, including the aiofiles' author's response on the matter. It might be quite inefficient, and for your purposes, you would certainly be just fine doing this synchronously; there is also the consideration that your chunk size being too small, isn't utilizing your network connection as well as a browser would for example, which would make the download process very inefficient. (and that is on top of these inefficiencies furthering buffer/caching related slowdowns of aiofiles - atleast eliminating any gain)

    I recommend, experimenting with the chunk size to make proper use of your network speed, and then trying aio.. - if you do see atleast similar speeds to your browser, you might be gaining some speed out of asynchronicity; you could also test that against synchronous download, but the difference should be minimal. (and as you mentioned, the requests module is probably your best bet incase aiofiles is really causing a lot of the ruckus)

    [older]
    ---

    You aren't setting the timeout for time b/w chunks.

    aiohttp.ClientTimeout(
        total=300.0,          # The timeout you were setting
        sock_connect=10.0,    # Connection timeout
        sock_read=20.0        # Time between chunks timeout
    )
    

    pass the above to ClientSession instead of 7200

    EDIT:

    But are you sure that the timeout is the problem, given it is using up the entire two hours, and downloading about 70% - is a pretty clear sign that async downloading, is in fact slower for you. I do think some sources simply aren't good at async downloading; you might be better off otherwise, but I'm not sure.

    EDIT 2:

    After light digging, when downloading a single large file, async might actually end up slower because it doesn't magically improve your network metrics; the small overhead of async is likely to make this method a worse way to download singular large files. Also, apologies for being hasty in the past responses.