pythonpandasdataframepdftabula

How can I stop Tabula from automatically dropping empty columns?


I am trying to scrape data from a PDF so that I can reformat it and then insert it to a table in Oracle. I am trying to use Tabula to read the PDF and convert it to a list of tables, but Tabula seems to be dropping columns from tables if those columns only hold null values. Normally this wouldn't be an issue (the data is 'None' to begin with, so I don't care about preserving it), but dropping the 'null' values on certain columns but not on others makes it impossible for my code to identify which columns are which. Eg, it might go from:

0   1   2   3

x   x  n/a  x

x   x  n/a  x

x   x  n/a  x

to

0   1   2

x   x   x

x   x   x

x   x   x

There is no way to know during runtime which column is being dropped, so I can't just re-insert it to the necessary place.

The columns do not have any unique identifiers in the data. I can't just add a null column at the end because it is absolutely necessary that I keep the same ordering in the columns.

I have investigated the Tabula API, and while I found a number of handy guides for how to DROP null columns, I found nothing for ensuring that they stay present.

dflist = tabula.read_pdf(path, pages = '14-27', multiple_tables = True)
# dflist is a list of dataframes
# dflist[0] == a single dataframe

(Apologies for poor formatting; unfamiliar with stack overflow spacing)

Expected results:

0   1   2   3

X   NaN X   X   

X   NaN X   X   

X   NaN X   NaN

Actual results:

0   1   2

X   X   X   

X   X   X   

X   X   NaN

Solution

  • UPDATE: The best solution I could find was fiddling with the 'lattice' settings which determined how tables are read in Tabula (you can find documentation on their site). Unfortunately, these settings also offset some of the rows on my PDF, so I couldn't use it. I had to give up the idea of making it entirely automated, and now use a staging table where a human checks which columns will be dropped.