[SOLVED] Extract / Identify Tables from PDF python

Extract / Identify Tables from PDF python

Are there any open source libraries that support table identification & extraction?

By this I mean:

Identify a table structure exists
Classify the table from its contents
Extract data from the table in a useful output format e.g. JSON / CSV etc.

I have looked through similar questions on this topic and found the following:

PDFMiner which addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong)
pdf-table-extract which attempts to address problem 1 but according to the To-Do list, cannot currently identify tables that are separated by whitespace. This is a problem as all tables in my PDFs are separated by whitespace!

Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!

Solution

You should definitely have a look at this answer of mine:

Extracting table contents from a collection of PDF files

and also have a look at all the links included therein.

Tabula/TabulaPDF is currently the best table extraction tool that is available for PDF scraping.