pythonpython-3.xmultithreadingcurlpython-requests

Downloading a large file in parts using multiple parallel threads


I have a use case, where a large remote file needs to be downloaded in parts, by using multiple threads. Each thread must run simultaneously (in parallel), grabbing a specific part of the file. The expectation is to combine the parts into a single (original) file, once all parts were successfully downloaded.

Perhaps using the requests library could do the job, but then I am not sure how I would multithread this into a solution that combines the chunks together.

url = 'https://url.com/file.iso'
headers = {"Range": "bytes=0-1000000"}  # first megabyte
r = get(url, headers=headers)

I was also thinking of using curl where Python would orchestrate the downloads, but I am not sure that's the correct way to go. It just seems to be too complex and swaying away from the vanilla Python solution. Something like this:

curl --range 200000000-399999999 -o file.iso.part2

Can someone explain how you'd go about something like this? Or post a code example of something that works in Python 3? I usually find the Python-related answers quite easily, but the solution to this problem seems to be eluding me.


Solution

  • Here is a version using Python 3 with Asyncio, it's just an example, it can be improved, but you should be able to get everything you need.

    import asyncio
    import concurrent.futures
    import functools
    import requests
    import os
    
    
    # WARNING:
    # Here I'm pointing to a publicly available sample video.
    # If you are planning on running this code, make sure the
    # video is still available as it might change location or get deleted.
    # If necessary, replace it with a URL you know is working.
    URL = 'https://download.samplelib.com/mp4/sample-30s.mp4'
    OUTPUT = 'video.mp4'
    
    
    async def get_size(url):
        response = requests.head(url)
        size = int(response.headers['Content-Length'])
        return size
    
    
    def download_range(url, start, end, output):
        headers = {'Range': f'bytes={start}-{end}'}
        response = requests.get(url, headers=headers)
    
        with open(output, 'wb') as f:
            for part in response.iter_content(1024):
                f.write(part)
    
    
    async def download(run, loop, url, output, chunk_size=1000000):
        file_size = await get_size(url)
        chunks = range(0, file_size, chunk_size)
    
        tasks = [
            run(
                download_range,
                url,
                start,
                start + chunk_size - 1,
                f'{output}.part{i}',
            )
            for i, start in enumerate(chunks)
        ]
    
        await asyncio.wait(tasks)
    
        with open(output, 'wb') as o:
            for i in range(len(chunks)):
                chunk_path = f'{output}.part{i}'
    
                with open(chunk_path, 'rb') as s:
                    o.write(s.read())
    
                os.remove(chunk_path)
    
    
    if __name__ == '__main__':
        executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
        loop = asyncio.new_event_loop()
        run = functools.partial(loop.run_in_executor, executor)
    
        asyncio.set_event_loop(loop)
    
        try:
            loop.run_until_complete(
                download(run, loop, URL, OUTPUT)
            )
        finally:
            loop.close()