pythonpandasdataframe

Pandas Dataframe align/merge/join in similar way to text based file comparison


I want to create a dataframe that aligns the rows of two input dataframes in a similar manner to how a text comparison tool would align text when comparing two files.

For instance, consider the image below, where df_A and df_B are the two input dataframes. They both have triple index year, pos, and score. In this case there is an intersection between the two indexes, but that won't necessarily be the case.

Image of sample input and required output

I want to create a dataframe (or dataframes) that aligns the rows of df_A and df_B giving the output shown. The empty cells can contain NaN, or be empty. Note that this is not quite the same as the result that is generated by any of the standard Pandas merge, join, concat or align methods (as far as I can tell.)

For instance join (see image below) doesn't give the correct NaN/empty values for pos values 1 or 8. And it duplicates the values for pos 4 too many times. (Note that I understand what join is doing, and why it does it - it just isn't what I am needing to do here.)

Image of output from Pandas.DataFrame.join

The required output is more akin to what would be generated by a text comparison tool if each row of the dataframes was considered a line of text. That is, rows of each dataframe are "shuffled" down until the next matching line is found.

I can write a custom function to do this, but am wondering if I'm missing something that may be available in Pandas already.

import pandas as pd

data_A = {"year":[2023]*7,"pos":[1,2,4,4,4,8,8],"score":[15,20,30,30,30,60,60],"value":["a","b","c","c","c","d","d"]}
df_A = pd.DataFrame(data_A)
df_A = df_A.set_index(["year","pos","score"])

data_B = {"year":[2023]*9,"pos":[1,1,1,3,3,4,4,8,10],"score":[15,15,15,25,25,30,30,60,80],"value":["v","v","v","w","w","x","x","y","z"]}
df_B = pd.DataFrame(data_B)
df_B = df_B.set_index(["year","pos","score"])

df = pd.merge(df_A,df_B,on=["year","pos","score"],how="outer")
print(df)

df = df_A.join(df_B,how="outer",lsuffix='_left', rsuffix='_right')
print(df)

llll, rrrr = df_A.align(df_B)
print(llll)
print(rrrr)

Solution

  • No, I don't think you're missing anything in the pandas library. pandas likes it when dataframes have unique indexing per row. My first thought was to use pd.DataFrame.compare, but this doesn't work because of duplicate indexes a dataframe.

    With that in mind, I created a new index level making each row unique and doing the following:

    df_A1 = df_A.set_index(df_A.groupby(level=[0,1,2]).cumcount(), append=True)
    
    df_B1 = df_B.set_index(df_B.groupby(level=[0,1,2]).cumcount(), append=True)
    
    pd.concat([df_A1, df_B1], axis=1, join='outer').sort_index()
    

    Output:

    
                     value value
    year pos score              
    2023 1   15    0     a     v
                   1   NaN     v
                   2   NaN     v
         2   20    0     b   NaN
         3   25    0   NaN     w
                   1   NaN     w
         4   30    0     c     x
                   1     c     x
                   2     c   NaN
         8   60    0     d     y
                   1     d   NaN
         10  80    0   NaN     z
    

    Or using df.compare with reindexing each dataframe alike,

    indx = df_A1.index.union(df_B1.index)
    print(df_A1.reindex(indx).compare(df_B1.reindex(indx)))
    

    Output:

                     value      
                      self other
    year pos score              
    2023 1   15    0     a     v
                   1   NaN     v
                   2   NaN     v
         2   20    0     b   NaN
         3   25    0   NaN     w
                   1   NaN     w
         4   30    0     c     x
                   1     c     x
                   2     c   NaN
         8   60    0     d     y
                   1     d   NaN
         10  80    0   NaN     z