pythonazurepdfazure-blob-storagepymupdf

How to edit pdf from azure blob storage without downloading it locally? (using Fitz)


I have a pdf that is already in the blob storage. I need to highlight few lines in it and store it as a new pdf (again in blob storage). I tried finding it in the links below but couldn't. Below is the pseudo code:

import fitz


def edit_pdfs(path_to_pdf_from_blob)  

    ### READ pdf from blob storage
    doc = fitz.open(path_to_pdf_from_blob)

    ## EDIT doc (fitz.fitz.Document) - I already have working code to edit the doc , but won't put it here to avoid complexity


    ### WRITE pdf to blob storage
    doc.save(new_path_to_pdf_from_blob)

Answers already seen:

Access data within the blob storage without downloading
How can I read a text file from Azure blob storage directly without downloading it to a local file(using python)?
Azure Blobstore: How can I read a file without having to download the whole thing first?


Solution

  • I tried in my environment and got the below results:

    Initially, I had one pdf document in my container with the name important.pdf with content like below.

    enter image description here

    You can use the below code to edit the pdf without downloading it locally.

    Code:

    from io import BytesIO
    import fitz
    from azure.storage.blob import BlobServiceClient
    
    connection_string = "your-connection-string"
    blob_name = "important.pdf"
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
    blob_client = blob_service_client.get_blob_client(container="test", blob=blob_name)
    
    # Download the PDF file as bytes
    pdf_bytes = blob_client.download_blob().content_as_bytes()
    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
    page = doc[0]
    rect = fitz.Rect(50, 50, 200, 200)
    highlight = page.add_highlight_annot(rect)  
    # Set the color of the highlight annotation
    highlight.update()
    
    new_blob_name = "demo.pdf"
    modified_pdf_stream = BytesIO()
    doc.save(modified_pdf_stream)
    modified_pdf_bytes = modified_pdf_stream.getvalue() 
    
    # Get a BlobClient object for the new PDF file
    new_blob_client = blob_service_client.get_blob_client(container="test", blob=new_blob_name)
    new_blob_client.upload_blob(modified_pdf_bytes, overwrite=True)
    
    #delete an original file
    blob_client = blob_service_client.get_blob_client(container="test", blob=blob_name)
    blob_client.delete_blob()
    

    Output:

    enter image description here