I'm writing a pdf to xlsx converter for my work documents with tabula package.
I need to extract tables from all the pages from PDF and insert them into xlsx file.
i use read_pdf with "all" pages and get 1 page table with column names:
and all other pages have 1,2,3,4 column names:
So my problem is that when i try to concat all the pages, they turn into a stairs from the 2nd page, like this:
So how can i fix this, so every page will concat one by one, without moving to the right?
The the concatenation would work as you expect, if both dataframes would have the same column names. So the first thought is to rename the second dataframe like df2.columns = df1.columns
. Then the concatenation should work. The problem here is that the columns of the second dataframe looks like valid data to me and this would be lost.
Here is a trick to save this line.
df2.loc[-1] = df2.columns
df2 = df2.sort_index().reset_index(drop=True)
df2.columns = df1.columns
pd.concat([df1, df2], ignore_index=True)
This should do it.
Example
Here is a small example. I think this matches your problem.
df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[1,2,3,4]})
>>> df1
a b
0 1 1
1 2 2
2 3 3
3 4 4
df2 = pd.DataFrame({'1':[1,2,3,4], '2':[1,2,3,4]})
>>> df1
1 2 # << valid data here
0 1 1
1 2 2
2 3 3
3 4 4
df2.loc[-1] = df2.columns
df2 = df2.sort_index().reset_index(drop=True)
df2.columns = df1.columns
df = pd.concat([df1, df2], ignore_index=True)
>>> df
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 1 2 # this data was saved
5 1 1
6 2 2
7 3 3
8 4 4
Edit
If you have a list of DataFrames, because of multiple pages, you can loop over this list:
# df_list = [df1, ... dfn]
for i, item in enumerate(df_list[1:], start=1):
df_temp = item
df_temp.loc[-1] = df_temp.columns
df_temp = df_temp.sort_index().reset_index(drop=True)
df_temp.columns = df_list[0].columns
df_list[i] = df_temp
final_df = pd.concat(df_list, ignore_index=True)