azureazure-cognitive-servicesazure-form-recognizer

Azure Form Recognizer - to read all PDF files(Custom format)


Can anyone point/guide me as to how to use a Trained model to extract data from custom PDF using Azure form recognizer via Python. I have the Python code that the trained model has generated. I then used it by setting the correct parameters but unfortunately I am not able to successfully read this. Any pointers please.

(Note - I have been through all StackOverflow related questions but none of them answers it clearly on my scenario)

This code sample shows Custom Extraction Model operations with the Azure Form Recognizer client library. 
The async versions of the samples require Python 3.6 or later.

To learn more, please visit the documentation - Quickstart: Form Recognizer Python client library SDKs
https://learn.microsoft.com/azure/applied-ai-services/form-recognizer/quickstarts/get-started-v3-sdk-rest-api?view=doc-intel-3.1.0&pivots=programming-language-python
"""

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

"""
Remember to remove the key from your code when you're done, and never post it publicly. For production, use
secure methods to store and access your credentials. For more information, see 
https://docs.microsoft.com/en-us/azure/cognitive-services/cognitive-services-security?tabs=command-line%2Ccsharp#environment-variables-and-application-configuration
"""
endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
key = "YOUR_FORM_RECOGNIZER_KEY"

model_id = "YOUR_CUSTOM_BUILT_MODEL_ID"
formUrl = "YOUR_DOCUMENT"

document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint, credential=AzureKeyCredential(key)
)

# Make sure your document's type is included in the list of document types the custom model can analyze
poller = document_analysis_client.begin_analyze_document_from_url(model_id, formUrl)
result = poller.result()

for idx, document in enumerate(result.documents):
    print("--------Analyzing document #{}--------".format(idx + 1))
    print("Document has type {}".format(document.doc_type))
    print("Document has confidence {}".format(document.confidence))
    print("Document was analyzed by model with ID {}".format(result.model_id))
    for name, field in document.fields.items():
        field_value = field.value if field.value else field.content
        print("......found field of type '{}' with value '{}' and with confidence {}".format(field.value_type, field_value, field.confidence))


# iterate over tables, lines, and selection marks on each page
for page in result.pages:
    print("\nLines found on page {}".format(page.page_number))
    for line in page.lines:
        print("...Line '{}'".format(line.content.encode('utf-8')))
    for word in page.words:
        print(
            "...Word '{}' has a confidence of {}".format(
                word.content.encode('utf-8'), word.confidence
            )
        )
    for selection_mark in page.selection_marks:
        print(
            "...Selection mark is '{}' and has a confidence of {}".format(
                selection_mark.state, selection_mark.confidence
            )
        )

for i, table in enumerate(result.tables):
    print("\nTable {} can be found on page:".format(i + 1))
    for region in table.bounding_regions:
        print("...{}".format(i + 1, region.page_number))
    for cell in table.cells:
        print(
            "...Cell[{}][{}] has content '{}'".format(
                cell.row_index, cell.column_index, cell.content.encode('utf-8')
            )
        )
print("-----------------------------------")

I am using below to read contents of my blob


Solution

  • In the code below I have commented about where the endpoint and URL from Azure.

    replace these values - YOUR_FORM_RECOGNIZER_ENDPOINT, YOUR_FORM_RECOGNIZER_API_KEY, YOUR_TRAINED_MODEL_ID, and the PDF path) with your actual values.

    from azure.core.credentials import AzureKeyCredential
    from azure.ai.formrecognizer import FormRecognizerClient
    from azure.ai.formrecognizer import FormRecognizerApiVersion
    import os
    
    # Set your Azure Form Recognizer endpoint and API key
    endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
    api_key = "YOUR_FORM_RECOGNIZER_API_KEY"
    
    # Set the model ID for your trained model
    model_id = "YOUR_TRAINED_MODEL_ID"
    
    # Initialize the Form Recognizer client
    form_recognizer_client = FormRecognizerClient(
        endpoint=endpoint,
        credential=AzureKeyCredential(api_key),
        api_version=FormRecognizerApiVersion.V2_1
    )
    
    # Specify the path to the PDF file you want to analyze
    pdf_path = "path/to/your/custom.pdf"
    
    # Extract data from the PDF using the trained model
    with open(pdf_path, "rb") as pdf_file:
        poller = form_recognizer_client.begin_recognize_custom_forms(
            model_id=model_id,
            form=pdf_file,
            content_type="application/pdf"
        )
        result = poller.result()
    
    # Output the extracted data
    for recognized_form in result:
        for name, field in recognized_form.fields.items():
            print("Field: {} with value {}".format(name, field.value))

    check this -

    https://yourstorageaccount.blob.core.windows.net/yourcontainer/yourfile.txt?<storage-account-key>