I have two sample dataframes:
df1 = pd.DataFrame({'Depth':[1100, 1110, 1120, 1130, 1140], 'GR':[40, 50, 60, np.nan, 70]})
df2 = pd.DataFrame({'Depth':[1100, 1112, 1118, 1128, 1138], 'VSH':[60, 70, np.nan, 40, 70]})
Depth GR
0 1100 40.0
1 1110 50.0
2 1120 60.0
3 1130 NaN
4 1140 70.0
Depth VSH
0 1100 60.0
1 1112 70.0
2 1118 NaN
3 1128 40.0
4 1138 70.0
The first data shows that 'GR' value is null from Depth 1120 till 1140. Likewise, in the second dataframe 'VSH' is null from Depth from 1112 till 1128. I want to join these datasets by outer join on 'Depth' and fill null values by interpolation where actually 'GR' is known. What I mean is that in the final joined dataset 'GR' values should not be interpolated (kept null) where Depth of 'GR' is between 1120 and 1140 because at this interval 'GR' is unknown as in the first dataset. Likewise, 'VSH' values between Depth of 1112 and 1128 should be null (not interpolated). The output after interpolation in specific intervals should be like:
merged_df = pd.DataFrame({'Depth':[1100, 1110, 1112, 1118, 1120, 1128, 1130, 1138, 1140], 'GR':[40, 50, 53.3, 56.6, 60, np.nan, np.nan, np.nan, 70], 'VSH':[60, 65, 70, np.nan, np.nan, 40, 55, 70, np.nan]})
Depth GR VSH
0 1100 40.0 60.0
1 1110 50.0 65.0
2 1112 53.3 70.0
3 1118 56.6 NaN
4 1120 60.0 NaN
5 1128 NaN 40.0
6 1130 NaN 55.0
7 1138 NaN 70.0
8 1140 70.0 NaN
Note: I just approximately interpolated numbers for GR and VSH where they don't have corresponding Depth in df2 and df1 respectively.
How could you do it in pandas?
Another proposition with a real Depth interpolation:
import scipy as sp
# Create interpolation function
gr = sp.interpolate.interp1d(df1['Depth'], df1['GR'], kind='linear', bounds_error=False)
vsh = sp.interpolate.interp1d(df2['Depth'], df2['VSH'], kind='linear', bounds_error=False)
# Compute interpolation for other depths
df1a = pd.DataFrame({'Depth': df2['Depth'], 'GR': gr(df2['Depth'])})
df2a = pd.DataFrame({'Depth': df1['Depth'], 'VSH': vsh(df1['Depth'])})
# Final dataframe
out = (pd.merge(pd.concat([df1, df1a]),
pd.concat([df2, df2a]))
.drop_duplicates('Depth')
.sort_values('Depth', ignore_index=True))
Output:
>>> out
Depth GR VSH
0 1100 40.0 60.000000
1 1110 50.0 68.333333
2 1112 52.0 70.000000
3 1118 58.0 NaN
4 1120 60.0 NaN
5 1128 NaN 40.000000
6 1130 NaN 46.000000
7 1138 NaN 70.000000
8 1140 70.0 NaN
For GR-1112 and GR-1118, you expect 53.3 and 56.6 because there are two empty values between GR-1110 (50) and GR-1120 (60) but this doesn't take into account the actual value of depth. It makes more sense to me to interpolate based on Depth than the number of missing values, right?