pythonpandasassign

Copying a column from one DataFrame to another gives NaN values?


I have two DataFrames with the same number of rows - df1 like so:

        date     hour      var1
a   2017-05-01  00:00:00   456585
b   2017-05-01  01:00:00   899875
c   2017-05-01  02:00:00   569566
d   2017-05-01  03:00:00   458756
e   2017-05-01  04:00:00   231458
f   2017-05-01  05:00:00   986545

and df2 like so:

      MyVar1     MyVar2 
 0  6169.719338 3688.045368
 1  5861.148007 3152.238704
 2  5797.053347 2700.469871
 3  5779.102340 2730.471948
 4  6708.219647 3181.298291
 5  8550.380343 3793.580394

I want to merge the data from the date and hour columns of df1 into df2, to get a result like:

       MyVar1    MyVar2        date        hour
 0  6169.719338 3688.045368  2017-05-01  00:00:00
 1  5861.148007 3152.238704  2017-05-01  01:00:00
 2  5797.053347 2700.469871  2017-05-01  02:00:00
 3  5779.102340 2730.471948  2017-05-01  03:00:00
 4  6708.219647 3181.298291  2017-05-01  04:00:00
 5  8550.380343 3793.580394  2017-05-01  05:00:00

I tried simply assigning the columns like so:

df2['date'] = df1['date']
df2['hour'] = df1['hour']

but I get a result with NaN values in the date and hour columns instead:

       MyVar1    MyVar2      date       hour
 0  6169.719338 3688.045368  NaN        NaN
 1  5861.148007 3152.238704  NaN        NaN
 2  5797.053347 2700.469871  NaN        NaN

Why does this happen? How can I simply assign the values such that the data from the first row of df1 is shown in the first row of df2, etc.?


Solution

  • The culprit is unalignable indexes

    Your DataFrames' indexes are different (and correspondingly, the indexes for each columns), so when trying to assign a column of one DataFrame to another, pandas will try to align the indexes, and failing to do so, insert NaNs.

    Consider the following examples to understand what this means:

    # Setup
    A = pd.DataFrame(index=['a', 'b', 'c']) 
    B = pd.DataFrame(index=['b', 'c', 'd', 'f'])                                  
    C = pd.DataFrame(index=[1, 2, 3])
    
    # Example of alignable indexes - A & B (complete or partial overlap of indexes)
    A.index B.index
          a        
          b       b   (overlap)
          c       c   (overlap)
                  d
                  f
    
    # Example of unalignable indexes - A & C (no overlap at all)
    A.index C.index
          a        
          b        
          c        
                  1
                  2
                  3
    

    When there are no overlaps, pandas cannot match even a single value between the two DataFrames to put in the result of the assignment, so the output is a column full of NaNs.

    If you're working on an IPython notebook, you can check that this is indeed the root cause using,

    df1.index.equals(df2.index)
    # False
    df1.index.intersection(df2.index).empty
    # True
    

    You can use any of the following solutions to fix this issue.

    Solution 1: Assign the underlying NumPy arrays (bypassing pandas index alignment)

    This solution will only work if the lengths of the two DataFrames match (i.e. have same number of rows).

    df2['date'] = df1['date'].to_numpy()
    
    # (In pandas < 0.24, do df2['date'] = df1['date'].values )
    

    To assign multiple columns easily, use,

    df2[['date', 'hour']] = df1[['date', 'hour']].to_numpy()
    

    Solution 2: Reset both DataFrames' indexes

    For when the two DataFrames have different lengths. Also maybe preferable if you didn't mean to have different indices in the first place, or if you don't particularly care about preserving the index.

    # Optional, if you want a RangeIndex => [0, 1, 2, ...]
    # df1.index = pd.RangeIndex(len(df))
    # Homogenize the index values,
    df2.index = df1.index
    # Assign the columns.
    df2[['date', 'hour']] = df1[['date', 'hour']]
    

    If you want to keep the existing index, but as a column, you may use reset_index() instead.