pythonpandasinterleave

Interleave 2 Dataframes on certain columns


I have 2 dataframes

df1:

StartLocation,StartDevice,StartPort,EndLocation,EndDevice,EndPort,LinkType,Speed 
DD1,Switch1,P1,AD1,Switch2,P2,MTP,1000
DD2,Switch2,P3,AD2,Switch3,P2,MTP,1000
DD3,Switch3,P5,AD3,Switch4,P6,MTP,1000

df2:

StartLocation,StartDevice,StartPort,EndLocation,EndDevice,EndPort
AB11,RU15,P1,AJ11,RU25,P2
AB12,RU18,P2,AB11,RU35,P2
AB13,RU19,P3,AB11,RU40,P4

I want to interleave the two dataframes and I've tried a few options but can't seem to get it to work. I'm close to the functionality with the below code but it doesn't join on the appropriate columns

import pandas as pd
from toolz import interleave

df3 = pd.DataFrame(interleave([df1.values, df2.values]), columns=df1)

Expected Output would look like

StartLocation,StartDevice,StartPort,EndLocation,EndDevice,EndPort,LinkType,Speed 
DD1,Switch1,P1,AD1,Switch2,P2,MTP,1000
AB11,RU15,P1,AJ11,RU25,P2,nan,nan
DD2,Switch2,P3,AD2,Switch3,P2,MTP,1000
AB12,RU18,P2,AB11,RU35,P2,nan,nan
DD3,Switch3,P5,AD3,Switch4,P6,MTP,1000
AB13,RU19,P3,AB11,RU40,P4,nan,nan

I think it should be pretty simple but I can't find the appropriate syntax. Can anyone give any ideas?

Thanks in advance for the help!


Solution

  • If columns names are same, only difference is some new columns names in one of DataFrame is possible use:

    df3 = pd.DataFrame(interleave([df1.values, df2.values]), columns=df1.columns)
    print (df3)
      StartLocation StartDevice StartPort EndLocation EndDevice EndPort LinkType  \
    0           DD1     Switch1        P1         AD1   Switch2      P2      MTP   
    1          AB11        RU15        P1        AJ11      RU25      P2     None   
    2           DD2     Switch2        P3         AD2   Switch3      P2      MTP   
    3          AB12        RU18        P2        AB11      RU35      P2     None   
    4           DD3     Switch3        P5         AD3   Switch4      P6      MTP   
    5          AB13        RU19        P3        AB11      RU40      P4     None   
    
        Speed  
    0  1000.0  
    1     NaN  
    2  1000.0  
    3     NaN  
    4  1000.0  
    5     NaN  
    

    More general solution working for any columns names is use DataFrame.align before for prevent correct align columns for each DataFrame:

    print (df1)
      EndDevice EndLocation EndPort LinkType  Speed StartDevice StartLocation  \
    0   Switch2         AD1      P2      MTP   1000     Switch1           DD1   
    1   Switch3         AD2      P2      MTP   1000     Switch2           DD2   
    2   Switch4         AD3      P6      MTP   1000     Switch3           DD3   
    
      StartPort  
    0        P1  
    1        P3  
    2        P5  
    
    print (df2)
      EndDevice EndLocation EndPort  LinkType  Speed StartDevice StartLocation  \
    0      RU25        AJ11      P2       NaN    NaN        RU15          AB11   
    1      RU35        AB11      P2       NaN    NaN        RU18          AB12   
    2      RU40        AB11      P4       NaN    NaN        RU19          AB13   
    
      StartPort  
    0        P1  
    1        P2  
    2        P3  
    

    df3 = pd.DataFrame(interleave([df1.values, df2.values]), columns=df1.columns)
    print (df3)
      EndDevice EndLocation EndPort LinkType   Speed StartDevice StartLocation  \
    0   Switch2         AD1      P2      MTP  1000.0     Switch1           DD1   
    1      RU25        AJ11      P2      NaN     NaN        RU15          AB11   
    2   Switch3         AD2      P2      MTP  1000.0     Switch2           DD2   
    3      RU35        AB11      P2      NaN     NaN        RU18          AB12   
    4   Switch4         AD3      P6      MTP  1000.0     Switch3           DD3   
    5      RU40        AB11      P4      NaN     NaN        RU19          AB13   
    
      StartPort  
    0        P1  
    1        P1  
    2        P3  
    3        P2  
    4        P5  
    5        P3  
    

    Another idea with Index.union and DataFrame.reindex:

    cols = df1.columns.union(df2.columns, sort=False)
    
    df1 = df1.reindex(cols, axis=1)
    df2 = df2.reindex(cols, axis=1)
    print (df1)
      StartLocation StartDevice StartPort EndLocation EndDevice EndPort LinkType  \
    0           DD1     Switch1        P1         AD1   Switch2      P2      MTP   
    1           DD2     Switch2        P3         AD2   Switch3      P2      MTP   
    2           DD3     Switch3        P5         AD3   Switch4      P6      MTP   
    
       Speed  
    0   1000  
    1   1000  
    2   1000  
    
    print (df2)
      StartLocation StartDevice StartPort EndLocation EndDevice EndPort  LinkType  \
    0          AB11        RU15        P1        AJ11      RU25      P2       NaN   
    1          AB12        RU18        P2        AB11      RU35      P2       NaN   
    2          AB13        RU19        P3        AB11      RU40      P4       NaN   
    
       Speed  
    0    NaN  
    1    NaN  
    2    NaN  
    

    df3 = pd.DataFrame(interleave([df1.values, df2.values]), columns=cols)
    print (df3)
      StartLocation StartDevice StartPort EndLocation EndDevice EndPort LinkType  \
    0           DD1     Switch1        P1         AD1   Switch2      P2      MTP   
    1          AB11        RU15        P1        AJ11      RU25      P2      NaN   
    2           DD2     Switch2        P3         AD2   Switch3      P2      MTP   
    3          AB12        RU18        P2        AB11      RU35      P2      NaN   
    4           DD3     Switch3        P5         AD3   Switch4      P6      MTP   
    5          AB13        RU19        P3        AB11      RU40      P4      NaN   
    
        Speed  
    0  1000.0  
    1     NaN  
    2  1000.0  
    3     NaN  
    4  1000.0  
    5     NaN