pythonpdfpdf-parsingpdf-scraping

Extract / Identify Tables from PDF python


Are there any open source libraries that support table identification & extraction?

By this I mean:

  1. Identify a table structure exists
  2. Classify the table from its contents
  3. Extract data from the table in a useful output format e.g. JSON / CSV etc.

I have looked through similar questions on this topic and found the following:

Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!


Solution

  • You should definitely have a look at this answer of mine:

    and also have a look at all the links included therein.

    Tabula/TabulaPDF is currently the best table extraction tool that is available for PDF scraping.