python-3.xpdftabula-py

Tabula-Py getting confused with column names


I have a pdf that has some text at the top in the first page and then table starts. The table extends throughout the pdf (of 156 pages). I want to extract this table into csv.I have succesfully done it from the Tabula web utility. There, after the output is as per my expectation (without any inappropriate columns, data, etc.), I have downloaded both csv and script (.sh file). upon opening script, I found the area value that I need to use. I have given the same as the input in my python script as follows:

firstPage = [317.209, 7.066, 800.647, 589.422] # for the first page
areaList = [firstPage]
areaList.extend([[28.634, 8.553, 834.859, 584]] * 155) # *155 for remaining 155 pages
df = tabula.read_pdf(r'input_data/bank_trans.pdf', output_format='dataframe', pages='all',
                     area=areaList, multiple_tables=False, stream=True, guess=False, silent=True)

The error:

    raise CSVParseError(message, e)
tabula.errors.CSVParseError: Error failed to create DataFrame with different column tables.
Try to set `multiple_tables=True`or set `names` option for `pandas_options`. 
, caused by ParserError('Error tokenizing data. C error: Expected 5 fields in line 65087, saw 6\n')

How should I go to 65087th line? I tried to check the csv in excel file. It only has 6k rows. but, error is sugggesting me ~65k row.


Solution

  • I have solved it by giving the x co-ordinates of the columns in the pdf. eg: columns = [1.0,3.0,7.0,9.2]