I have a PDF
file that includes a table and I want to convert it into table structured data.
My PDF
file includes a pretty complex table which makes most tool insufficient. For example,
I tried to use the following tools and they didn't extract it well: AWS Textract
, Google AI Document
, Google Vision
, Microsoft Text Recognition
.
Actually, Google AI Document
managed to do about 70% correct but it is not good enough.
So, I searched for a way to customize train model, so that when extracting this table, it will extract it properly. I tried Power Apps AI Builder and Google AutoML
Entity Extraction, but both of them didn't help (BTW, I wasn't what AutoML's purpose, is it for prediction or also possible to customize table extraction?).
I would like to know which tools are good for my use case and if there is any (AI) tool that I can use to train these kind of tables, so that the text extraction will be better.
Most text extractors should hold that structure if it is rendered crisp enough, but layout can be many a fickle mis-trees.
Here it correctly picked up the mis-spelling of reaar but failed in first line on 05.05.1983
On an identical secondpass the failings are different
3 29.06.1983 Part of Ground Floor of 05.05.1983 GM315727
2 (part of) Conavon Court 25 years from
1.3.1983
4 31.01.1984 Part of Third Floor Conavon 30.12.1983 GM335793
4 (part of) Court 25 years from
12.8.1983
5 19.04.1984 I?art of Basement Floor of 23.01.1984 GM342693
l (part of), 2 Conavon C:ourt 25 years from
(part of), 3 20.01.1984
(part Of ) , 4
(part of)
NOTE: The Lease also grants a right of way for the purpose only of
loading and unloading and reserves a right of way in case of emergency
only from the boiler house adjacent hereto
6 14.06.1984 Part of Third Floor Conavon 31.10.1983 GM347623
3 (part of) Court 25 years from
31.10.1983
7 14.06.1984 Part of the Third Floor 31.10.1983 GM347623
3 (part: of}, 4 Conavon Court 25 years from
(part of) 31.10.1983
8 01.10.1984 "The Italian Stallion'' 17.08.1984 GM357142
4 (part of) Conavon Court (Basement) 25 years from
20.1.1984
NOTE: The Lease also grants a right of way for the purpose only of
loading and unloading and a right of access through the security door
at the reaar of the building
9 06.07.2016 3rd floor 14-16 Blackfriars 28.06.2016
4 (part of}, 5 Streec 5 years from
(part of) 25/06/2016
That's the beauty of OCR, every run can be a different pass rate per character so experience says use best of three estimates. Thus run 3 different ways and comparing character by character keep those that are in agreement.