I have a pdf with several pages, and I want to extract the data from every page and concatenate them all into one dataframe. I've managed to dig through Stack and other resources to create the below code, which successfully extracts and prints the tables as dataframes from every page. However, the next step would be to concatenate each of these individual dataframes row-wise (so that it's one dataframe instead of several separate dataframes).
import pdfplumber
import pandas as pd
pdf_file = "df.pdf"
tables=[]
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for i,pg in enumerate(pages):
tbl = pages[i].extract_table()
df = pd.DataFrame(tbl)
print(f'{df}')
I'm stuck trying to figure out how to concatenate each of the dataframes in this loop instead of just printing them out, and would love any help. Thanks!
Figured out how to do this. Was almost there, just needed to look through Stack to figure out how to append with a for loop. Thanks.
import pdfplumber
import pandas as pd
#Create df from table on first page to act as the first df:
pdf_file = "data.pdf"
pdf = pdfplumber.open(pdf_file)
pages = pdf.pages
tbl = pages[0].extract_table()
original_df = pd.DataFrame(tbl,columns=["category",0])
#Append data from remaining tables/pages:
tables=[]
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for i,pg in enumerate(pages):
tbl = pages[i].extract_table()
df = pd.DataFrame(tbl,columns=["category",i+1])
original_df = original_df.merge(df,on='category')