pandasconcatenationtabula

Python, Pandas concat ladder


I'm writing a pdf to xlsx converter for my work documents with tabula package.

I need to extract tables from all the pages from PDF and insert them into xlsx file.

i use read_pdf with "all" pages and get 1 page table with column names:

enter image description here

and all other pages have 1,2,3,4 column names:

enter image description here

So my problem is that when i try to concat all the pages, they turn into a stairs from the 2nd page, like this:

enter image description here

So how can i fix this, so every page will concat one by one, without moving to the right?


Solution

  • The the concatenation would work as you expect, if both dataframes would have the same column names. So the first thought is to rename the second dataframe like df2.columns = df1.columns. Then the concatenation should work. The problem here is that the columns of the second dataframe looks like valid data to me and this would be lost.

    Here is a trick to save this line.

    df2.loc[-1] = df2.columns
    df2 = df2.sort_index().reset_index(drop=True)
    df2.columns = df1.columns
    pd.concat([df1, df2], ignore_index=True)
    

    This should do it.

    Example

    Here is a small example. I think this matches your problem.

    df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[1,2,3,4]})
    >>> df1
       a  b
    0  1  1
    1  2  2
    2  3  3
    3  4  4
    df2 = pd.DataFrame({'1':[1,2,3,4], '2':[1,2,3,4]})
    >>> df1
       1  2 # << valid data here
    0  1  1
    1  2  2
    2  3  3
    3  4  4
    df2.loc[-1] = df2.columns
    df2 = df2.sort_index().reset_index(drop=True)
    df2.columns = df1.columns
    df = pd.concat([df1, df2], ignore_index=True)
    >>> df
       a  b
    0  1  1
    1  2  2
    2  3  3
    3  4  4
    4  1  2 # this data was saved
    5  1  1
    6  2  2
    7  3  3
    8  4  4
    

    Edit

    If you have a list of DataFrames, because of multiple pages, you can loop over this list:

    # df_list = [df1, ... dfn]
    for i, item in enumerate(df_list[1:], start=1):
        df_temp = item
        df_temp.loc[-1] = df_temp.columns
        df_temp = df_temp.sort_index().reset_index(drop=True)
        df_temp.columns = df_list[0].columns
        df_list[i] = df_temp
    
    final_df = pd.concat(df_list, ignore_index=True)