pythonpdfocrpython-camelottabula-py

How to extract a single row table data from a pdf using python?


I need to extract tabular data from pdfs. Some tables in the pdf comprise of only a single row. I have been trying to extract the data using camelot library.

Code for extraction using Camelot:

pip install camelot-py[cv] tabula-py here
import camelot
file = 'xyz.pdf'
tables = camelot.read_pdf(file,pages ="all")
tables[6].df 

The above code is not able to extract a single row table info.

For instance, in the pdf: https://www.nirfindia.org/nirfpdfcdn/2022/pdf/Engineering/IR-E-U-0306.pdf, the tool is not able to detect the last table(under the heading Faculty Details) as it consists of only one row.

Can someone suggest a workaround?


Solution

  • As you can understand from the docs, if you want to detect smaller lines, you should increase line_scale parameter (default: 15).

    In your case, this command works fine:

    tables = camelot.read_pdf(file, pages ="all", line_scale=80)