I build a script using textract, which reads the content of pdf files. Which contains the following function:
import textract
import tempfile
def read_file(bytes):
with tempfile.NamedTemporaryFile('wb', delete=True) as temp:
temp.write(bytes)
temp.flush()
context = textract.process(temp.name, encoding='utf-8',extension=".pdf")
return context.decode('utf-8')
This script works locally, but when deployed on a function app, but it does not. This is the error message it returns:
pdf2txt.py /tmp/tmpe3yo9gax` failed because the executable
`pdf2txt.py` is not installed on your system. Please make
sure the appropriate dependencies are installed before using
textract:
http://textract.readthedocs.org/en/latest/installation.html
Both textract and pdf2text are in the requirements.txt of the function app, so it should be installed on deployment. Anyone has an idea why this does not work? It seems like the library pdf2text refuses to install via pip on the function app.
Create one HttpTrigger Function with the code below to extract text from pdf file with textract and PyPDF2
My function_app.py:-
import azure.functions as func
import logging
import textract
import os
import PyPDF2
app = func.FunctionApp(http_auth_level=func.AuthLevel.ANONYMOUS)
@app.route(route="http_trigger")
def http_trigger(req: func.HttpRequest) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
name = req.params.get('name')
if not name:
try:
req_body = req.get_json()
name = req_body.get('name')
except ValueError:
pass
if name:
pdf_file_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "HelloTest.pdf")
text = extract_text_from_pdf(pdf_file_path)
response = f"Hello, {name}. This HTTP triggered function executed successfully.\nExtracted text from PDF: {text}"
return func.HttpResponse(response)
else:
return func.HttpResponse(
"This HTTP triggered function executed successfully. Pass a name in the query string or in the request body for a personalized response.",
status_code=200
)
def extract_text_from_pdf(pdf_file_path):
try:
with open(pdf_file_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ''
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text()
return text
except Exception as e:
logging.error(f"Error extracting text from PDF: {e}")
return "Error extracting text from PDF"
My requirements.txt:-
azure-functions
textract
pdf2text
PyPDF2
My Function Folder with PDF File:-
Deployed this Function successfully:-
When I triggered the url, I received the text from my pdf file:-