pythonpandasindexingcopy

Assigning column from different data frame - role of index


import pandas as pd

df_1 = pd.DataFrame({'col1': ['a', 'a', 'a']})
df_2 = pd.DataFrame({'col1': ['b', 'b', 'b']})
df_2.index = [4,5,6]

df_1['col2'] = df_2.col1

I expect a simple copy in the above example, but 'col2' in df_1 is all NAs. I find it strange. What is the rational for this choice? Similar example works differently in R.


Solution

  • The rationale is to keep the data aligned. The same way columns are aligned when you perform an operation.

    This enables to perform complex operations without having to worry about the order of the rows/columns like you would with numpy:

    A = pd.DataFrame({'blue': [1, 2], 'green': [3, 4]}, index=['circle', 'square'])
    B = pd.DataFrame({'green': [10, 20], 'blue': [30, 40]}, index=['square', 'circle'])
    
    A+B
    
            blue  green
    circle    41     23
    square    32     14
    

    Or to handle missing data:

    df = pd.DataFrame({'col1': ['a1', 'a2', 'a3']})
    df['col2'] = pd.Series(['b1', 'b3'], index=[0, 2])
    
    print(df)
    
      col1 col2
    0   a1   b1
    1   a2  NaN
    2   a3   b3
    

    Or to perform broadcasting:

    idx1 = pd.MultiIndex.from_product([['a', 'b'], [1, 2]],
                                      names=['letter', 'number'])
    df = pd.DataFrame({'col1': [1, 2, 3, 4]}, index=idx1)
    
    idx2 = pd.Index([1, 2], name='number')
    df['col2'] = df['col1'] * pd.Series([1, 10], index=idx2)
    
    print(df)
    
                   col1  col2
    letter number            
    a      1          1     1
           2          2    20
    b      1          3     3
           2          4    40
    

    If you want to avoid this you can convert to array:

    import pandas as pd
    
    df_1 = pd.DataFrame({'col1': ['a1', 'a2', 'a3']})
    df_2 = pd.DataFrame({'col1': ['b1', 'b2', 'b3']})
    df_2.index = [2,3,1]
    
    df_1['col2'] = df_2['col1']
    df_1['col3'] = df_2['col1'].values # or df_2['col1'].to_numpy()
    

    Output:

      col1 col2 col3
    0   a1  NaN   b1
    1   a2   b3   b2
    2   a3   b1   b3