pythonpandasdataframeinterpolationpandas-merge

After merging, how to interpolate only non-null intervals?


I have two sample dataframes:

df1 = pd.DataFrame({'Depth':[1100, 1110, 1120, 1130, 1140], 'GR':[40, 50, 60, np.nan, 70]})
df2 = pd.DataFrame({'Depth':[1100, 1112, 1118, 1128, 1138], 'VSH':[60, 70, np.nan, 40, 70]})
   Depth    GR
0   1100  40.0
1   1110  50.0
2   1120  60.0
3   1130   NaN
4   1140  70.0
   Depth   VSH
0   1100  60.0
1   1112  70.0
2   1118   NaN
3   1128  40.0
4   1138  70.0

The first data shows that 'GR' value is null from Depth 1120 till 1140. Likewise, in the second dataframe 'VSH' is null from Depth from 1112 till 1128. I want to join these datasets by outer join on 'Depth' and fill null values by interpolation where actually 'GR' is known. What I mean is that in the final joined dataset 'GR' values should not be interpolated (kept null) where Depth of 'GR' is between 1120 and 1140 because at this interval 'GR' is unknown as in the first dataset. Likewise, 'VSH' values between Depth of 1112 and 1128 should be null (not interpolated). The output after interpolation in specific intervals should be like:

merged_df = pd.DataFrame({'Depth':[1100, 1110, 1112, 1118, 1120, 1128, 1130, 1138, 1140], 'GR':[40, 50, 53.3, 56.6, 60, np.nan, np.nan, np.nan, 70], 'VSH':[60, 65, 70, np.nan, np.nan, 40, 55, 70, np.nan]})
   Depth    GR   VSH
0   1100  40.0  60.0
1   1110  50.0  65.0
2   1112  53.3  70.0
3   1118  56.6   NaN
4   1120  60.0   NaN
5   1128   NaN  40.0
6   1130   NaN  55.0
7   1138   NaN  70.0
8   1140  70.0  NaN

Note: I just approximately interpolated numbers for GR and VSH where they don't have corresponding Depth in df2 and df1 respectively.

How could you do it in pandas?


Solution

  • Another proposition with a real Depth interpolation:

    import scipy as sp
    
    # Create interpolation function
    gr = sp.interpolate.interp1d(df1['Depth'], df1['GR'], kind='linear', bounds_error=False)
    vsh = sp.interpolate.interp1d(df2['Depth'], df2['VSH'], kind='linear', bounds_error=False)
    
    # Compute interpolation for other depths
    df1a = pd.DataFrame({'Depth': df2['Depth'], 'GR': gr(df2['Depth'])})
    df2a = pd.DataFrame({'Depth': df1['Depth'], 'VSH': vsh(df1['Depth'])})
    
    # Final dataframe
    out = (pd.merge(pd.concat([df1, df1a]),
                    pd.concat([df2, df2a]))
             .drop_duplicates('Depth')
             .sort_values('Depth', ignore_index=True))
    

    Output:

    >>> out
       Depth    GR        VSH
    0   1100  40.0  60.000000
    1   1110  50.0  68.333333
    2   1112  52.0  70.000000
    3   1118  58.0        NaN
    4   1120  60.0        NaN
    5   1128   NaN  40.000000
    6   1130   NaN  46.000000
    7   1138   NaN  70.000000
    8   1140  70.0        NaN
    

    For GR-1112 and GR-1118, you expect 53.3 and 56.6 because there are two empty values between GR-1110 (50) and GR-1120 (60) but this doesn't take into account the actual value of depth. It makes more sense to me to interpolate based on Depth than the number of missing values, right?