[SOLVED] How to automatically extract data from large texts with queries

How to automatically extract data from large texts with queries

I have large pdf files (100s of pages in French) that describe a set of rules for my sector of activity.

I am looking for a service that would allow me to query the pdfs (or the text I extract from them) one at a time to get the information automatically.

(Example: What is the maximum authorized length of x ?)

I looked at openAI's chatGPT and ran into maximum tokens problems because as said previously the texts are huge.

I looked at Amazon's Textract that does have a query system but it seems built for image treatment so it wouldn't seem optimal to transform my text into images especially since the images would need to be very big (I couldn't yet find software to merge those pdfs into one very very large image without running into memory issues, and I'm pretty certain Textract could not handle those).

I looked at other solutions online but nothing seemed to answer to my large text needs combined with complex queries.

Solution

Amazon Textract support PDFs as input so you wouldn't need to convert your pdfs to text and back to images.

 PDF and TIFF files have a 500 MB limit. PDF and TIFF files have a limit of 3,000 pages.

Here is a tutorial to use queries with Textract. For using with multi-page you need to use the asynchronous API using .start_document_analysis https://aws-samples.github.io/amazon-textract-textractor/notebooks/using_queries.html

The relevant code is here:

from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractcaller import QueriesConfig, Query

extractor = Textractor(profile_name="default")

document1 = extractor.start_document_analysis(    
    file_source='./multipage.pdf',    
    features=[TextractFeatures.QUERIES],
    s3_upload_path='<YOUR_S3_BUCKET>',
    s3_output_path='<YOUR_S3_BUCKET>',
    save_image=True,
    queries=QueriesConfig([Query("What is the first row value")])
)
document1.queries[0].result

0.129853474