I have two dataframes, one with a "full" time series and another with far fewer elements representing special points in time for an experiment with many trials. The time variable runs from 0 to x for a given trial then resets to 0 for the next trial. The data looks like this (images) where the full time series has labels for each trial and t_coords (time) that will increase until trial changes to 2. At that point t_coords resets for a new trial and increases again until trial 3 and so on:
coords_df: full time series per trial
subject trial t_coords x_coords y_coords
0 1010 1 1239 6.50 0.50
1 1010 1 1351 6.50 0.50
2 1010 1 1918 6.38 0.52
3 1010 1 2031 6.28 0.53
4 1010 1 2143 6.16 0.55
5 1010 1 2256 6.06 0.57
6 1010 1 2370 5.95 0.58
7 1010 1 2486 5.83 0.59
8 1010 1 2600 5.73 0.57
9 1010 1 2709 5.61 0.54
full_navs1_df: special time points; need to use object_appear_time and trial_num
subject x_coords y_coords t_coords trial object_num
0 1010 10.50 3.50 8023 1 1
1 1010 3.50 9.50 68156 1 2
2 1010 8.50 11.50 82894 1 3
3 1010 4.50 2.50 116226 1 4
4 1010 10.50 6.50 131060 1 5
5 1010 5.50 5.50 161710 1 6
6 1010 3.50 10.50 8018 2 1
7 1010 9.50 10.50 51981 2 2
8 1010 2.50 6.50 106287 2 3
9 1010 7.50 6.50 123758 2 4
I need to find the index of the closest value to each of the special time points for each trial, so the answer has to also take into account trial. The values in the special time points df won't have exact matches in the full time series df, hence desire to get the closest value. I have been unsuccessful trying to figure out how to perform a nearest value function while also looping through each trial. By way of example the first value of the full_navs_df['object_appear_time'] is 8023. The closest value in the coords_df['t_coords'] column appears at index 62 (not shown, but the output for the 8023 value for trial would be 62). That process would repeat for each object_appear_time AND trial.
I've seen solutions and variants such as:
df.iloc[(df['num']-input).abs().argsort()[:2]]
but this doesn't require iterating over multiple rows in multiple columns from different dataframes. Other solutions that use merge wouldn't work due to the fact that there are multiple time series in the same column (resets every trial).
Edit with solution attempts:
Reproducible Example:
df1 = pd.DataFrame({'subject': subject_ex,
'trial': trial_ex,
't_coords': t_coords_ex})
subject trial t_coords
0 1 1 1304
1 1 1 1603
2 1 1 3000
3 1 1 3658
4 1 1 4763
5 1 1 5364
6 1 1 6129
7 1 2 1298
8 1 2 1874
9 1 2 3328
10 1 2 4192
11 1 2 4783
12 1 2 5439
13 1 2 6193
14 2 1 1307
15 2 1 1787
16 2 1 2599
17 2 1 3675
18 2 1 4783
19 2 1 5362
20 2 1 6126
df2 = pd.DataFrame({'subject': subject2_ex,
'trial': trial2_ex,
't_coords': t_coords2_ex})
subject trial t_coords
0 1 1 3230
1 1 1 6090
2 1 2 1909
3 1 2 4801
4 2 1 2499
5 2 1 5400
Expected Output: Some way to get the indices that correspond to the t_coords in df2 in df1:
index
0 2
1 6
2 8
3 11
4 16
5 19
Attempted the following:
df1_sorted = df1.sort_values(['subject', 'trial'])
df2_sorted = df2.sort_values(['subject', 'trial'])
time_indxs = pd.merge_asof(df2_sorted, df1_sorted, on='t_coords', direction = 'nearest').sort_values(['subject','trial'])
I get the error:
ValueError: left keys must be sorted
After looking into your example, I also couldn't figure it out with merge_asof
.
Here is what I would do.
You basically go through each row of your second df, filter the first df by the current group (subject and trial) and then build the difference between the current t_coords with all times of the first (filtered) df. The smallest result of that difference will be the row you are searching for (which is achieved with idxmin
).
find_index = (
df2
.apply(lambda x:
(
df.loc[(df['subject']==x['subject']) & (df['trial']==x['trial']), 't_coords'] - x['t_coords']
)
.abs()
.idxmin()
, axis=1)
)
print(find_index)
Output:
0 2
1 6
2 8
3 11
4 16
5 19
dtype: int64