azureazure-blob-storageserverlesscloud-storage

Inter-container blob transfer is very slow


My ETL pipeline involves moving blobs (zip files) from one storage container to another temporary storage container, by unzipping them in memory and downloading the contents into this latter container. This unzipping is done in memory, as this pipeline is run serverlessly (Azure functions)

My blobs in the storage container are structured as follows:

└── blob.zip
    ├── folder1
    │   ├── file1.ext
    │   └── file2.ext
...

Upon running the following portion of my script, the temporary container would include the unzipped blob.

import zipfile
import io
# container_client_* is instance of ContainerClient
blob_client_in = container_client_in.get_blob_client('blob_container')
with io.BytesIO() as b:
    download_stream = blob_client_in.download_blob()
    download_stream.readinto(b)
    with zipfile.ZipFile(b, compression=zipfile.ZIP_LZMA) as z:
        for i in z.namelist():
            with z.open(i, mode='r') as f:
                container_client_out.get_blob_client(i).upload_blob(f)

This works for blobs with a small amount of items, but when the blob size is "large" (around an arbitrary 3 gb), it takes several hours for this transfer to occur.

Where might the bottleneck be taking place, and what can be done to address it?


Solution

  • I would suggest using the Azure Data Factory pipeline to copy and unzip files between containers.

    If you are using Consumption plan, Azure Functions have a limit of one CPU and 1.5 GB of RAM per instance. An instance of the host is the entire function app, meaning all functions within a function app share resources within an instance and scale at the same time.

    Video instructions from youtube for using Azure Data Factory