pythonpython-3.xazure-storageazure-data-lake-gen2pdfminer

Read pdf file from storage account (Azure Data lake) without downloading it using python


I am trying to read a pdf file which I have uploaded on an Azure storage account. I am trying to do this using python. I have tried using the SAS token/URL of the file and pass it thorugh PDFMiner but I am not able get the path of the file which will be accepted by PDFMiner. I am using something like the below code:

from azure.storage.filedatalake import DataLakeServiceClient
from azure.storage.filedatalake import generate_file_sas
import os
storage_account_name = "mystorageaccount"
storage_account_key = "mystoragekey"
container_name = "mycontainer"
directory_name = 'mydirectory'

service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
        "https", storage_account_name), credential=storage_account_key)
file_system_client = service_client.get_file_system_client(file_system=container_name)
directory_client   = file_system_client.get_directory_client(directory_name)
file_client = directory_client.get_file_client('XXX.pdf')
download = file_client.download_file()
downloaded_bytes = download.readall()

file_sas = generate_file_sas(account_name= storage_account_name,file_system_name= container_name,directory_name= directory_name,file_name= dir_name,credential= storage_account_key)

from pdfminer.pdfpage import PDFPage
with open(downloaded_bytes, 'rb') as infile:
    PDFPage.get_pages(infile, check_extractable=False)

from pdfminer.pdfpage import PDFPage
with open(file_sas, 'rb') as infile:
    PDFPage.get_pages(infile, check_extractable=False)

Neither of the options are working. Initially the input_dir was setup locally, so the code was able to fetch the pdf file and read it. Is there a different way to pass the URL/path of the file from the storage account to the pdf's read function? Any help is appreciated.


Solution

  • I tried in my environment and got below results:

    Initially, I tried with same process without downloading the Pdf files from azure Datalake storage account and got no results. But AFAIK, to read the pdf file with downloading is possible way.

    I tried with below code to read pdf file with Module PyPDF2, and it executed with content successfully.

    Code:

    from azure.storage.filedatalake import DataLakeFileClient
    import PyPDF2
    
    service_client = DataLakeFileClient.from_connection_string("<your storage connection string>",file_system_name="test",file_path="dem.pdf")
    with open("dem.pdf", 'wb') as  file: 
      data = service_client.download_file()
      data.readinto(file) 
    
    object=open("dem.pdf",'rb')
    pdfread=PyPDF2.PdfFileReader(object)
    print("Number of pages:",pdfread.numPages)
    pageObj = pdfread.getPage(0)
    print(pageObj.extractText())
    

    Console:

    enter image description here

    You can also read the pdf file through browser using file URL:

     https://<storage account name >.dfs.core.windows.net/test/dem.pdf+? sas-token
    

    Browser: enter image description here