Can someone help me to extract multiples tables from ONE pdf file. I have 5 pages, every page have a table with same header column exp:
Table exp in every page
student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4
I want to extract all this tables in one dataframe, First i did
df = tabula.read_pdf(file_path,pages='all',multiple_tables=True)
But i got a messy output so i try this lines of code that looks like this :
[student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4 ,student Score Rang
Maxim 43 34
Nourah 93 5]
so i edited my code like this import pandas as pd import tabula
file_path = "filePath.pdf"
# read my file
df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True)
df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True)
df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True)
It give me a dataframe for each table but i don't how to regroup it into one single dataframe and any other solution to avoid repeating the line of code.
According to the documentation of tabula, read_pdf
returns a list when passed the multiple_table=True
option.
Thus, you can use pandas.concat on its output to concatenate the dataframes:
df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))