pythonazureazure-document-intelligence

Azure documen intelligence python SDK doesn't separate pages


When trying to extract content from a MS Word .docx file using Azure Document Intelligence, I expected the returned response to contain a page element for each page in the document and for each of those page elements to contain multiple lines in line with the documentation.

Instead, I always receive as a single page with no (None) lines and the entire document's contents as a list of words.

Sample document: enter image description here

Minimal reproducible example:

from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import DocumentAnalysisFeature, AnalyzeResult, AnalyzeDocumentRequest

def main():
    client = DocumentIntelligenceClient(
        'MY ENDPOINT',
        AzureKeyCredential('MY KEY')
    )

    document = 'small_test_document.docx'

    with open(document, "rb") as f:
        poller = client.begin_analyze_document(
            "prebuilt-layout",
            analyze_request=f,
            content_type="application/octet-stream"
        )
    result = poller.result()

    print(f'Found {len(result.pages)} page(s)')
    for page in result.pages:
        print(f'Page #{page.page_number}')
        print(f'  {page.lines=}')
        print(f'  {len(page.words)=}')

if __name__ == '__main__':
    main()

Expected output:

Found 2 page(s)
Page #1
  page.lines=6
  len(page.words)=58
Page #2
  page.lines=1
  len(page.words)=8

Actual output:

Found 1 page(s)
Page #1
  page.lines=None
  len(page.words)=66

My question is: Why, and what should I do differently to get the expected output?


Solution

  • As you have shown in your Actual output, all the 66 characters in your document are considered as one page.

    This is the expected behavior. As mentioned in the Docs on how the page units are computed: 3,000 characters are considered as one page unit in Word Document.

    File format Computed page unit Total pages
    Word (DOCX) Up to 3,000 characters = 1 page unit, embedded or linked images not supported Total pages of up to 3,000 characters each

    So each 3000 characters is considered as 1 page. The page breaks in your document are not considered. Additionally the following features are not supported for the Microsoft Office (DOCX, XLSX, PPTX) and HTML files:

    Reference.

    So Document Intelligence has limited support for docx files. Your best option is to use the PDF files. You will get the content analyzed page by page in the PDF files. Not ideal, but if you do need to work with docx files, then first convert them to pdf files (using relevant API's) and process them with the Document Intelligence.