pythonlarge-filesminiolarge-file-upload

Upload a large file to Minio from Minio objects


I have the next case:

Get a certain count of N objects from Minio and create zip archive and upload it zip to Minio as one object.

Problem:

  1. I have many objects, that are up to 40gb in size
  2. I can't load all objects bytes in memory - server memory size is 4gb
  3. The server hard drive size is 240gb

I use miniopy-async for work with Minio.

May be have any ideas?


Solution

  • You’re right to be concerned about memory here. If you try to get_object and just read() it all into memory, you’ll blow past your 4 GB limit very quickly with 40 GB objects. The trick is to stream both the download from Minio and the writing into the zip file, instead of buffering everything.

    A couple of ideas that should fit your case:

    1. Use streaming API from minio-py / minio-async
      With miniopy-async, get_object returns an async stream. Instead of doing .read(), you can iterate over chunks (await res.stream()) and directly feed those chunks into your zip writer.

    2. Zip files without holding them fully in memory
      Python’s built-in zipfile module supports file-like objects, but unfortunately it expects seekable files. A common workaround is to write to a temporary file on disk instead of BytesIO. Since you mentioned you have ~240 GB disk, you can safely buffer your zip archive there and then upload it back to Minio.

      Something like:

      import asyncio
      import aiofiles
      import tempfile
      import zipfile
      from miniopy_async import Minio
      
      client = Minio("localhost:9000", access_key="xxx", secret_key="xxx", secure=False)
      
      async def stream_to_zip(bucket, keys, output_bucket, output_key):
          # create a temp file for the zip
          with tempfile.NamedTemporaryFile(delete=False) as tmp:
              zip_path = tmp.name
      
          # write objects into the zip one by one
          with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_STORED) as zf:
              for key in keys:
                  resp = await client.get_object(bucket, key)
                  with zf.open(key, "w") as dest:
                      async for chunk in resp.stream():
                          dest.write(chunk)
                  await resp.close()
      
          # now upload the zip file back to minio
          await client.fput_object(output_bucket, output_key, zip_path)
      
      

      This way you never hold the full 40 GB object in RAM — only the chunk you’re currently processing. The zip is streamed to a temp file on disk, and when finished you upload that single file back to Minio.

    3. Alternative: multipart upload
      If you can’t afford to store the full zip on disk either (e.g. if N objects * 40 GB is bigger than 240 GB), then you’d need to do multipart upload directly to Minio while streaming into the zip writer. That’s more complex because zipfile expects a seekable target, so you’d need a streaming zip implementation (there are third-party libraries like zipstream that can do this).

    So the practical, simplest approach: use a temp file on disk, write to it chunk by chunk, then push it back to Minio. With your 240 GB disk and 40 GB objects, you should be fine as long as you’re not zipping too many at once.