pythonazureazure-functionstext-extractionpdf-reader

Pdf2text not working in Azure function app


I build a script using textract, which reads the content of pdf files. Which contains the following function:

import textract
import tempfile

def read_file(bytes):

    with tempfile.NamedTemporaryFile('wb', delete=True) as temp:
        temp.write(bytes)
        temp.flush()
        context = textract.process(temp.name, encoding='utf-8',extension=".pdf")
    
    return context.decode('utf-8')

This script works locally, but when deployed on a function app, but it does not. This is the error message it returns:

pdf2txt.py /tmp/tmpe3yo9gax` failed because the executable
`pdf2txt.py` is not installed on your system. Please make
sure the appropriate dependencies are installed before using
textract:

    http://textract.readthedocs.org/en/latest/installation.html

Both textract and pdf2text are in the requirements.txt of the function app, so it should be installed on deployment. Anyone has an idea why this does not work? It seems like the library pdf2text refuses to install via pip on the function app.


Solution

  • Create one HttpTrigger Function with the code below to extract text from pdf file with textract and PyPDF2

    My function_app.py:-

    import azure.functions as func
    import logging
    import textract
    import os
    import PyPDF2
    
    app = func.FunctionApp(http_auth_level=func.AuthLevel.ANONYMOUS)
    
    @app.route(route="http_trigger")
    def http_trigger(req: func.HttpRequest) -> func.HttpResponse:
        logging.info('Python HTTP trigger function processed a request.')
    
        name = req.params.get('name')
        if not name:
            try:
                req_body = req.get_json()
                name = req_body.get('name')
            except ValueError:
                pass
    
        if name:
            pdf_file_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "HelloTest.pdf")
            text = extract_text_from_pdf(pdf_file_path)
            response = f"Hello, {name}. This HTTP triggered function executed successfully.\nExtracted text from PDF: {text}"
            return func.HttpResponse(response)
        else:
            return func.HttpResponse(
                 "This HTTP triggered function executed successfully. Pass a name in the query string or in the request body for a personalized response.",
                 status_code=200
            )
    
    def extract_text_from_pdf(pdf_file_path):
        try:
            with open(pdf_file_path, 'rb') as pdf_file:
                pdf_reader = PyPDF2.PdfReader(pdf_file)
                text = ''
                for page_num in range(len(pdf_reader.pages)):
                    page = pdf_reader.pages[page_num]
                    text += page.extract_text()
                return text
        except Exception as e:
            logging.error(f"Error extracting text from PDF: {e}")
            return "Error extracting text from PDF"
    

    My requirements.txt:-

    azure-functions
    textract
    pdf2text
    PyPDF2
    

    My Function Folder with PDF File:-

    enter image description here

    Deployed this Function successfully:-

    enter image description here

    When I triggered the url, I received the text from my pdf file:-

    enter image description here

    enter image description here