azurepdfazure-functionsfile-processing

Splitting large documents on azure


We are trying to implement a pipeline that load files (PDFs, Word) from an azure storage data lake, split those documents into pages (maybe), then store the final pages in another storage account. Whenever there will be a new document coming, this must trigger the process of splitting.

What azure services can be use to implement this pipeline ?

Is azure functions suitable for this purpose or should we go with Azure Data Factory ?

This piece should be part of LLM Architecture, so the files must be indexed finally in a vector database


Solution

  • Is azure functions suitable for this purpose or should we go with Azure Data Factory ?

    Yes, you can use simply use Azure functions. In Azure functions, you can use Blob trigger, to get fired whenever a new blob comes in.

    enter image description here

    You can use c#, python, java any other language to split the blob and re-upload it in one function itself. Or you can even use Azure Logic apps to perform these actions. I would prefer functions as anything is possible through coding.

    Alternatively you can use one function for trigger new blob, and the other for dividing and uploading it using http trigger function. For http trigger function you have sdk clients through which you can upload, download, rename an etc.

    References to split the blobs: