When trying to extract content from a MS Word .docx file using Azure Document Intelligence, I expected the returned response to contain a page element for each page in the document and for each of those page elements to contain multiple lines in line with the documentation.
Instead, I always receive as a single page with no (None
) lines and the entire document's contents as a list of words.
Minimal reproducible example:
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import DocumentAnalysisFeature, AnalyzeResult, AnalyzeDocumentRequest
def main():
client = DocumentIntelligenceClient(
'MY ENDPOINT',
AzureKeyCredential('MY KEY')
)
document = 'small_test_document.docx'
with open(document, "rb") as f:
poller = client.begin_analyze_document(
"prebuilt-layout",
analyze_request=f,
content_type="application/octet-stream"
)
result = poller.result()
print(f'Found {len(result.pages)} page(s)')
for page in result.pages:
print(f'Page #{page.page_number}')
print(f' {page.lines=}')
print(f' {len(page.words)=}')
if __name__ == '__main__':
main()
Expected output:
Found 2 page(s)
Page #1
page.lines=6
len(page.words)=58
Page #2
page.lines=1
len(page.words)=8
Actual output:
Found 1 page(s)
Page #1
page.lines=None
len(page.words)=66
My question is: Why, and what should I do differently to get the expected output?
As you have shown in your Actual output, all the 66 characters in your document are considered as one page.
This is the expected behavior. As mentioned in the Docs on how the page units are computed: 3,000 characters are considered as one page unit in Word Document.
File format | Computed page unit | Total pages |
---|---|---|
Word (DOCX) | Up to 3,000 characters = 1 page unit, embedded or linked images not supported | Total pages of up to 3,000 characters each |
So each 3000 characters is considered as 1 page. The page breaks in your document are not considered. Additionally the following features are not supported for the Microsoft Office (DOCX, XLSX, PPTX) and HTML files:
So Document Intelligence has limited support for docx files. Your best option is to use the PDF files. You will get the content analyzed page by page in the PDF files. Not ideal, but if you do need to work with docx files, then first convert them to pdf files (using relevant API's) and process them with the Document Intelligence.