I am trying to read a pdf file which I have uploaded on an Azure storage account. I am trying to do this using python. I have tried using the SAS token/URL of the file and pass it thorugh PDFMiner but I am not able get the path of the file which will be accepted by PDFMiner. I am using something like the below code:
from azure.storage.filedatalake import DataLakeServiceClient
from azure.storage.filedatalake import generate_file_sas
import os
storage_account_name = "mystorageaccount"
storage_account_key = "mystoragekey"
container_name = "mycontainer"
directory_name = 'mydirectory'
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", storage_account_name), credential=storage_account_key)
file_system_client = service_client.get_file_system_client(file_system=container_name)
directory_client = file_system_client.get_directory_client(directory_name)
file_client = directory_client.get_file_client('XXX.pdf')
download = file_client.download_file()
downloaded_bytes = download.readall()
file_sas = generate_file_sas(account_name= storage_account_name,file_system_name= container_name,directory_name= directory_name,file_name= dir_name,credential= storage_account_key)
from pdfminer.pdfpage import PDFPage
with open(downloaded_bytes, 'rb') as infile:
PDFPage.get_pages(infile, check_extractable=False)
from pdfminer.pdfpage import PDFPage
with open(file_sas, 'rb') as infile:
PDFPage.get_pages(infile, check_extractable=False)
Neither of the options are working. Initially the input_dir was setup locally, so the code was able to fetch the pdf file and read it. Is there a different way to pass the URL/path of the file from the storage account to the pdf's read function? Any help is appreciated.
I tried in my environment and got below results:
Initially, I tried with same process without downloading the Pdf files from azure Datalake
storage account and got no results. But AFAIK, to read the pdf file with downloading is possible way.
I tried with below code to read pdf file with Module PyPDF2
, and it executed with content successfully.
Code:
from azure.storage.filedatalake import DataLakeFileClient
import PyPDF2
service_client = DataLakeFileClient.from_connection_string("<your storage connection string>",file_system_name="test",file_path="dem.pdf")
with open("dem.pdf", 'wb') as file:
data = service_client.download_file()
data.readinto(file)
object=open("dem.pdf",'rb')
pdfread=PyPDF2.PdfFileReader(object)
print("Number of pages:",pdfread.numPages)
pageObj = pdfread.getPage(0)
print(pageObj.extractText())
Console:
You can also read the pdf file through browser using file URL:
https://<storage account name >.dfs.core.windows.net/test/dem.pdf+? sas-token
Browser: