google-cloud-platformcloud-document-ai

Document AI - Multi-page files performance affect



I’ve noticed that it’s possible to upload multi-page files to Document AI, such that all pages are connected to each other by being associated to the same file.

My use case is invoice files that I would like to extract data from, using a custom extractor.
Most of the invoices are 1-pagers, but some of them span over 2 pages, meaning that the second page usually is leaner than the first page, and does not contain most of the information.

My question is - will there be a difference in a trained model performance between the following file upload mechanisms:
  1. Uploading each page as a separate file, even when an invoice spans over multiple pages (I preprocess it beforehand)
  2. Uploading each file without splitting it to pages

I assume that the performance of option # 2 will be equal or greater than option # 1 - my question is mainly whether it makes a difference or not, as uploading pages separately has its own advantages for us (our use case is a bit more complicated, I simplified it for the explanation).


Solution

  • Considering performance, keeping the number of pages minimal likely won't cause a noticeable difference between options. However, option 1 requires more processing time due to the added pre-processing step of splitting multi-page documents.

    While option 2 offers a simpler workflow, it may process irrelevant information beyond the first page, potentially impacting accuracy and efficiency compared to option 1, which analyzes each page individually.

    Ultimately, the best choice depends on your specific needs. While option 2 is straightforward, option 1 might yield better accuracy for your custom extractor due to its focus on individual pages.

    You might find this discussion interesting as it touches on processing multi-page invoices in Document AI.