pythonpandaspython-camelot

PDF table to pandas data frame using camelot


I'm trying to create a simple way to get data from pdf into a pandas data frame. Something like that:

import camelot
import pandas as pd

pdf = camelot.read_pdf("file1.pdf")

print(pdf[0].df)

The point is that I'm trying with two different files: File 1 and File 2 but for the second file I'm not able to get the info. It has more columns but I believe it shouldn't be a problem.

Also, the only way I could get a table from file 2 was using flavor="stream"

Result for File 1

Result for File 2


Solution

  • To correctly extract tables from the second file, it is necessary to process background lines, using the appropriate parameter (process_background) for lattice method, as you can see in the following code:

    import camelot
    
    tables=camelot.read_pdf('file2.pdf', process_background=True)
    
    for table in tables:
        print(table.df)