pythonpandasdataframepdftabula

How to extract multiples tables from one PDF file using Pandas and tabula-py


Can someone help me to extract multiples tables from ONE pdf file. I have 5 pages, every page have a table with same header column exp:

Table exp in every page

student  Score Rang
Alex     50     23
Julia    80     12
Mariana  94     4

I want to extract all this tables in one dataframe, First i did

df = tabula.read_pdf(file_path,pages='all',multiple_tables=True)

But i got a messy output so i try this lines of code that looks like this :

[student  Score Rang
Alex     50     23
Julia    80     12
Mariana  94     4 ,student  Score Rang
Maxim    43     34
Nourah   93     5]

so i edited my code like this import pandas as pd import tabula

    file_path = "filePath.pdf"
    
    # read my file
    df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True)
    df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True)
    df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
    df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
    df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True)

It give me a dataframe for each table but i don't how to regroup it into one single dataframe and any other solution to avoid repeating the line of code.


Solution

  • According to the documentation of tabula, read_pdf returns a list when passed the multiple_table=True option.

    Thus, you can use pandas.concat on its output to concatenate the dataframes:

    df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))