artificial-intelligencetext-extractionpdf-extractioncustom-training

Custom training Extract PDF into table


I have a PDF file that includes a table and I want to convert it into table structured data.

My PDF file includes a pretty complex table which makes most tool insufficient. For example, I tried to use the following tools and they didn't extract it well: AWS Textract, Google AI Document, Google Vision, Microsoft Text Recognition. Actually, Google AI Document managed to do about 70% correct but it is not good enough.

So, I searched for a way to customize train model, so that when extracting this table, it will extract it properly. I tried Power Apps AI Builder and Google AutoML Entity Extraction, but both of them didn't help (BTW, I wasn't what AutoML's purpose, is it for prediction or also possible to customize table extraction?).

I would like to know which tools are good for my use case and if there is any (AI) tool that I can use to train these kind of tables, so that the text extraction will be better.

enter image description here


Solution

  • Most text extractors should hold that structure if it is rendered crisp enough, but layout can be many a fickle mis-trees.

    Here it correctly picked up the mis-spelling of reaar but failed in first line on 05.05.1983

    enter image description here

    On an identical secondpass the failings are different

     3      29.06.1983      Part of Ground Floor of       05.05.1983      GM315727
            2 (part of)     Conavon Court                25 years from
                                                         1.3.1983
     4      31.01.1984      Part of Third Floor Conavon   30.12.1983      GM335793
            4 (part of)     Court                        25 years from
                                                         12.8.1983
     5      19.04.1984      I?art of Basement Floor of     23.01.1984      GM342693
            l (part of), 2  Conavon C:ourt                25 years from
             (part of), 3                                 20.01.1984
             (part Of ) , 4
             (part of)
            NOTE: The Lease also grants a right of way for the purpose only of
            loading and unloading and reserves a right of way in case of emergency
            only from the  boiler house adjacent hereto
     6      14.06.1984      Part of Third Floor Conavon   31.10.1983      GM347623
            3 (part of)     Court                        25 years from
                                                         31.10.1983
     7      14.06.1984      Part of the Third Floor       31.10.1983      GM347623
            3 (part: of}, 4  Conavon Court                25 years from
             (part of)                                    31.10.1983
     8      01.10.1984      "The Italian Stallion''       17.08.1984      GM357142
            4 (part of)     Conavon Court (Basement)      25 years from
                                                         20.1.1984
            NOTE: The Lease also grants a right of way for the purpose only of
            loading and unloading and a right of access through the security door
            at the reaar of the building
     9      06.07.2016      3rd floor 14-16 Blackfriars   28.06.2016
            4 (part of}, 5  Streec                       5 years from
             (part of)                                    25/06/2016
    

    That's the beauty of OCR, every run can be a different pass rate per character so experience says use best of three estimates. Thus run 3 different ways and comparing character by character keep those that are in agreement.