How to fill missing values of a pandas dataframe based on the relationship between an existing preceeding row (predictions for a commodity), and an associated existing value in another column (actual values for a commodity).
I have a pandas dataframe with 10 columns and 40 rows. The columns are Date, Actual, time_from_actual_1, time_from_actual_2, time_from_actual_3...
up to time_from_actual_8
.
The Actual
column contains actual values for a commodity with hourly timestamps in the Date
column. The time_from_actual
columns are predictions forward in time for the same commodity. These are produced once per day, hence the pre-existing values at Index 0
and Index 1
. And so, there are 23 missing observations per day.
I would like to fill those missing values in a very specific way. I want the values for "time_from_actual" at index 1 up to 24 to follow the same pattern as the first column with regards to the differences between the different timesteps and the actual values.
I've succeeded doing this with a nested for loop
, but I would very much like to see suggestions for more elegant approaches. Below you will find a complete attempt with sample data, code and output. Thank you for any suggestions!
# imports
import pandas as pd
import numpy as np
# Random seed
np.random.seed(42)
# Sample data
data = {
'Date': pd.date_range(start='2023-01-01', periods=40, freq='H'),
'Actual': [100, 99.72, 101.02, 104.06, 103.60, 103.13, 106.29, 107.82, 106.88, 107.97,
107.04, 106.11, 106.59, 102.77, 99.32, 98.19, 96.17, 96.80, 94.98, 92.15,
95.09, 94.63, 94.77, 91.92, 90.83, 91.05, 88.75, 89.50, 88.30, 87.72,
86.51, 90.22, 90.19, 88.08, 89.72, 87.28, 87.70, 83.78, 81.12, 131.52],
'time_from_actual_1': [97] + [np.nan]*23 + [90] + [np.nan]*15,
'time_from_actual_2': [99] + [np.nan]*23 + [89] + [np.nan]*15,
'time_from_actual_3': [98] + [np.nan]*23 + [88] + [np.nan]*15,
'time_from_actual_4': [97] + [np.nan]*23 + [87] + [np.nan]*15,
'time_from_actual_5': [96] + [np.nan]*23 + [86] + [np.nan]*15,
'time_from_actual_6': [95] + [np.nan]*23 + [85] + [np.nan]*15,
'time_from_actual_7': [94] + [np.nan]*23 + [84] + [np.nan]*15,
'time_from_actual_8': [93] + [np.nan]*23 + [83] + [np.nan]*15,
}
# dataframe
df = pd.DataFrame(data)
# copy of the dataframe to reference original values only
original_df = df.copy()
# Fill missing values for columns starting with "time_from_actual"
time_cols = [col for col in df.columns if col.startswith('time_from_actual')]
for col in time_cols:
for i in range(1, len(df)):
if pd.isnull(df.loc[i, col]):
j = i
while j < len(df) and pd.isnull(original_df.loc[j, col]):
previous_actual = df.loc[j - 1, 'Actual']
previous_time = df.loc[j - 1, col]
current_actual = df.loc[j, 'Actual']
difference = previous_time - previous_actual
df.loc[j, col] = current_actual + difference
j += 1
For each column containing 'time_from_actual'
:
Create a group for each part of the dataframe where the first row is not null.
Fill the null values with df["Actual"].diff()
.
Finally, get the cumsum
for each group (that's summing the existing value and df["Actual"].diff()
for the other rows).
for col in df.filter(like="time_from_actual"):
g = df[col].notna().cumsum()
df[col] = df[col].fillna(df["Actual"].diff())
df[col] = df.groupby(g).cumsum()[col]
Date Actual time_from_actual_1 ... time_from_actual_6 time_from_actual_7 time_from_actual_8
0 2023-01-01 00:00:00 100.00 97.00 ... 95.00 94.00 93.00
1 2023-01-01 01:00:00 99.72 96.72 ... 94.72 93.72 92.72
2 2023-01-01 02:00:00 101.02 98.02 ... 96.02 95.02 94.02
3 2023-01-01 03:00:00 104.06 101.06 ... 99.06 98.06 97.06
4 2023-01-01 04:00:00 103.60 100.60 ... 98.60 97.60 96.60
5 2023-01-01 05:00:00 103.13 100.13 ... 98.13 97.13 96.13
6 2023-01-01 06:00:00 106.29 103.29 ... 101.29 100.29 99.29
7 2023-01-01 07:00:00 107.82 104.82 ... 102.82 101.82 100.82
8 2023-01-01 08:00:00 106.88 103.88 ... 101.88 100.88 99.88
9 2023-01-01 09:00:00 107.97 104.97 ... 102.97 101.97 100.97
10 2023-01-01 10:00:00 107.04 104.04 ... 102.04 101.04 100.04
11 2023-01-01 11:00:00 106.11 103.11 ... 101.11 100.11 99.11
12 2023-01-01 12:00:00 106.59 103.59 ... 101.59 100.59 99.59
13 2023-01-01 13:00:00 102.77 99.77 ... 97.77 96.77 95.77
14 2023-01-01 14:00:00 99.32 96.32 ... 94.32 93.32 92.32
15 2023-01-01 15:00:00 98.19 95.19 ... 93.19 92.19 91.19
16 2023-01-01 16:00:00 96.17 93.17 ... 91.17 90.17 89.17
17 2023-01-01 17:00:00 96.80 93.80 ... 91.80 90.80 89.80
18 2023-01-01 18:00:00 94.98 91.98 ... 89.98 88.98 87.98
19 2023-01-01 19:00:00 92.15 89.15 ... 87.15 86.15 85.15
20 2023-01-01 20:00:00 95.09 92.09 ... 90.09 89.09 88.09
21 2023-01-01 21:00:00 94.63 91.63 ... 89.63 88.63 87.63
22 2023-01-01 22:00:00 94.77 91.77 ... 89.77 88.77 87.77
23 2023-01-01 23:00:00 91.92 88.92 ... 86.92 85.92 84.92
24 2023-01-02 00:00:00 90.83 90.00 ... 85.00 84.00 83.00
25 2023-01-02 01:00:00 91.05 90.22 ... 85.22 84.22 83.22
26 2023-01-02 02:00:00 88.75 87.92 ... 82.92 81.92 80.92
27 2023-01-02 03:00:00 89.50 88.67 ... 83.67 82.67 81.67
28 2023-01-02 04:00:00 88.30 87.47 ... 82.47 81.47 80.47
29 2023-01-02 05:00:00 87.72 86.89 ... 81.89 80.89 79.89
30 2023-01-02 06:00:00 86.51 85.68 ... 80.68 79.68 78.68
31 2023-01-02 07:00:00 90.22 89.39 ... 84.39 83.39 82.39
32 2023-01-02 08:00:00 90.19 89.36 ... 84.36 83.36 82.36
33 2023-01-02 09:00:00 88.08 87.25 ... 82.25 81.25 80.25
34 2023-01-02 10:00:00 89.72 88.89 ... 83.89 82.89 81.89
35 2023-01-02 11:00:00 87.28 86.45 ... 81.45 80.45 79.45
36 2023-01-02 12:00:00 87.70 86.87 ... 81.87 80.87 79.87
37 2023-01-02 13:00:00 83.78 82.95 ... 77.95 76.95 75.95
38 2023-01-02 14:00:00 81.12 80.29 ... 75.29 74.29 73.29
39 2023-01-02 15:00:00 131.52 130.69 ... 125.69 124.69 123.69