pythonpandasmergeeda

Combing two columns from two different data frames to remove missing values in Pandas


I am working on the Titanic dataset as my first project. To impute missing values of the variable 'Age', I had run a linear regression model. Now, I have 2 dataframes as follows -

train_data.tail()

          Survived  Pclass     Sex   Age  SibSp  Parch   Fare Embarked
    886         0       2    male  27.0      0      0  13.00        S
    887         1       1  female  19.0      0      0  30.00        S
    888         0       3  female   NaN      1      2  23.45        S
    889         1       1    male  26.0      0      0  30.00        C
    890         0       3    male  32.0      0      0   7.75        Q

imp_age.head()

          Age
    859  27.0
    863  -8.0
    868  27.0
    878  27.0
    888  23.0

The second dataframe given above has values for age that I want to impute in place of 'NaN' values of first dataframe. Both the dataframes have this data under the column name 'Age'.

I tried running the following code to get the merged df -

merged_df = train_data.merge(imp_age,how='outer',left_index=True,right_index=True)

But the output creates an additional 'Age_y' column instead of merging it with the old column -

     Survived  Pclass     Sex  Age_x  SibSp  Parch   Fare Embarked  Age_y
886         0       2    male   27.0      0      0  13.00        S    NaN
887         1       1  female   19.0      0      0  30.00        S    NaN
888         0       3  female    NaN      1      2  23.45        S   23.0
889         1       1    male   26.0      0      0  30.00        C    NaN
890         0       3    male   32.0      0      0   7.75        Q    NaN

Can someone please help me to get the below desired output. I have done lot of tos and fros on this but since I am new to Python, I am struggling a little -

      Survived  Pclass     Sex  Age    SibSp  Parch   Fare Embarked  
886         0       2    male   27.0      0      0  13.00        S   
887         1       1  female   19.0      0      0  30.00        S   
888         0       3  female   23.0      1      2  23.45        S   
889         1       1    male   26.0      0      0  30.00        C   
890         0       3    male   32.0      0      0   7.75        Q   

Solution

  • Try fillna,

    train_data['Age'] = train_data['Age'].fillna(imp_age['Age'])