I have the following dataframe:
data = [
[ID, '2022-04-23T03:36:26Z', 60, 10, 83],
[ID, '2022-04-23T03:37:30Z', Nan, Nan, Nan],
[ID, '2022-04-23T03:37:48Z', Nan, Nan, Nan],
[ID, '2022-04-23T03:38:24Z', 61, 11, 72],
[ID, '2022-04-23T03:44:20Z', 63, 13, 75],
[ID, '2022-04-23T03:45:02Z', Nan, Nan, Nan],
[ID, '2022-04-23T03:45:06Z', Nan, Nan, Nan],
[ID, '2022-04-23T03:45:08Z', Nan, Nan, Nan],
[ID, '2022-04-23T03:45:12Z', Nan, Nan, Nan],
[ID, '2022-04-23T03:45:48Z', 69, 15, 61]
]
df = pd.DataFrame(data=data,
columns=['ID', 'time', 'latitude', 'longitude', 'speed')
The problem is that for some rows I have only the time value e.g. row 2 and 3. For these rows, I want to calculate the average speed based on time, latitude and longitude of the row preceding (row 1) and following (row 4) the Nan speed rows.
For example, the speed value in row 2 and 3 should be an average speed value which is based on the travelled distance (maybe using Haversine formula) divided by the total amount of time ('2022-04-23T03:38:24Z' - '2022-04-23T03:36:26Z').
How can I write this in Python?
pandas.DataFrame.interpolate
may be what you're looking for if you're looking for a naive approach (there's other options if you're looking for something more specific just see the docs):
df[["latitude", "longitude", "speed"]] = df.interpolate() \
[["latitude", "longitude", "speed"]].round().astype(int)
Result:
ID time latitude longitude speed
0 ID 2022-04-23T03:36:26Z 60 10 83
1 ID 2022-04-23T03:37:30Z 60 10 79
2 ID 2022-04-23T03:37:48Z 60 10 75
3 ID 2022-04-23T03:38:24Z 61 11 72
4 ID 2022-04-23T03:44:20Z 63 13 75
5 ID 2022-04-23T03:45:02Z 64 13 72
6 ID 2022-04-23T03:45:06Z 65 13 69
7 ID 2022-04-23T03:45:08Z 66 14 66
8 ID 2022-04-23T03:45:12Z 67 14 63
9 ID 2022-04-23T03:45:48Z 69 15 61