pythonpdftabulapython-camelot

tabula vs camelot for table extraction from PDF


I need to extract tables from pdf, these tables can be of any type, multiple headers, vertical headers, horizontal header etc.

I have implemented the basic use cases for both and found tabula doing a bit better than camelot still not able to detect all tables perfectly, and I am not sure whether it will work for all kinds or not.

So seeking suggestions from experts who have implemented similar use case.

Example PDFs: PDF1 PDF2 PDF3

Tabula Implementation:

import tabula
tab = tabula.read_pdf('pdfs/PDF1.pdf', pages='all')
for t in tab:
    print(t, "\n=========================\n")

Camelot Implementation:

import camelot
tables = camelot.read_pdf('pdfs/PDF1.pdf', pages='all', split_text=True)
tables
for tabs in tables:
    print(tabs.df, "\n=================================\n")

Solution

  • Please read this: https://camelot-py.readthedocs.io/en/master/#why-camelot

    The main advantage of Camelot is that this library is rich in parameters, through which you can improve the extraction.

    Obviously, the application of these parameters requires some study and various attempts.

    Here you can find comparision of Camelot with other PDF Table Extraction libraries.