pythonpython-3.xpdftabular

How do I prevent losing a row when extracting a table from a PDF than spans multiple pages?


I have a PDF table with a total of 33 rows, however this number can change. The table expands onto a second page which means it looks like two separate tables.

My goal is to take all items in column 0, 2, and 3 and add to three separate lists. I have been able to get this working but I noticed one row is missing from table 2, which is the very first row on the second page.

My current Python script looks like:

import tabula

file_path = "address.pdf"
tables = tabula.read_pdf(file_path, pages="all", multiple_tables=True)

full_range_index = 0
full_range = []

starting_range_index = 2
starting_range = []

ending_range_index = 3
ending_range = []

table_one_row_count = 27
table_two_row_count = 6

# for i in range(table_one_row_count):
#     extracted_row = tables[0].iloc[i].values.tolist()

#     full_range.append(extracted_row[full_range_index])
#     starting_range.append(extracted_row[starting_range_index])
#     ending_range.append(extracted_row[ending_range_index])

for i in range(table_two_row_count):
    extracted_row = tables[1].iloc[i].values.tolist()

    full_range.append(extracted_row[full_range_index])
    starting_range.append(extracted_row[starting_range_index])
    ending_range.append(extracted_row[ending_range_index])


print(full_range)

An example of what full_range should look like is ['one', 'two', 'three', 'four', 'five', 'six'] however it looks like [nan, 'two', 'three', 'four', 'five', 'six'].

Is there something I can do to not lose the first row on the second page/table?


Solution

  • I think your problem is because the header of the table on the second page being mistaken as data by Tabula Lets try a solution with pandas to ignore headers.

    import tabula
    
    file_path = "address.pdf"
    tables = tabula.read_pdf(file_path, pages="all", multiple_tables=True, pandas_options={'header': None})
    
    full_range_index = 0
    full_range = []
    
    starting_range_index = 2
    starting_range = []
    
    ending_range_index = 3
    ending_range = []
    
    table_one_row_count = 27
    table_two_row_count = 6
    
    for table in tables:
        for i in range(len(table)):
            extracted_row = table.iloc[i].values.tolist()
    
            full_range.append(extracted_row[full_range_index])
            starting_range.append(extracted_row[starting_range_index])
            ending_range.append(extracted_row[ending_range_index])
    
    print(full_range)