I have a PDF table with a total of 33 rows, however this number can change. The table expands onto a second page which means it looks like two separate tables.
My goal is to take all items in column 0, 2, and 3 and add to three separate lists. I have been able to get this working but I noticed one row is missing from table 2, which is the very first row on the second page.
My current Python script looks like:
import tabula
file_path = "address.pdf"
tables = tabula.read_pdf(file_path, pages="all", multiple_tables=True)
full_range_index = 0
full_range = []
starting_range_index = 2
starting_range = []
ending_range_index = 3
ending_range = []
table_one_row_count = 27
table_two_row_count = 6
# for i in range(table_one_row_count):
# extracted_row = tables[0].iloc[i].values.tolist()
# full_range.append(extracted_row[full_range_index])
# starting_range.append(extracted_row[starting_range_index])
# ending_range.append(extracted_row[ending_range_index])
for i in range(table_two_row_count):
extracted_row = tables[1].iloc[i].values.tolist()
full_range.append(extracted_row[full_range_index])
starting_range.append(extracted_row[starting_range_index])
ending_range.append(extracted_row[ending_range_index])
print(full_range)
An example of what full_range
should look like is ['one', 'two', 'three', 'four', 'five', 'six']
however it looks like [nan, 'two', 'three', 'four', 'five', 'six']
.
Is there something I can do to not lose the first row on the second page/table?
I think your problem is because the header of the table on the second page being mistaken as data by Tabula
Lets try a solution with pandas to ignore headers.
import tabula
file_path = "address.pdf"
tables = tabula.read_pdf(file_path, pages="all", multiple_tables=True, pandas_options={'header': None})
full_range_index = 0
full_range = []
starting_range_index = 2
starting_range = []
ending_range_index = 3
ending_range = []
table_one_row_count = 27
table_two_row_count = 6
for table in tables:
for i in range(len(table)):
extracted_row = table.iloc[i].values.tolist()
full_range.append(extracted_row[full_range_index])
starting_range.append(extracted_row[starting_range_index])
ending_range.append(extracted_row[ending_range_index])
print(full_range)