pythonpandasmissing-data

How to fill missing values based on relationships between existing data


Question

How to fill missing values of a pandas dataframe based on the relationship between an existing preceeding row (predictions for a commodity), and an associated existing value in another column (actual values for a commodity).

Details

I have a pandas dataframe with 10 columns and 40 rows. The columns are Date, Actual, time_from_actual_1, time_from_actual_2, time_from_actual_3... up to time_from_actual_8.

The Actual column contains actual values for a commodity with hourly timestamps in the Date column. The time_from_actual columns are predictions forward in time for the same commodity. These are produced once per day, hence the pre-existing values at Index 0 and Index 1. And so, there are 23 missing observations per day.

Input dataframe

inptu_df_vscode

I would like to fill those missing values in a very specific way. I want the values for "time_from_actual" at index 1 up to 24 to follow the same pattern as the first column with regards to the differences between the different timesteps and the actual values.

Output dataframe

output_df_vscode

I've succeeded doing this with a nested for loop, but I would very much like to see suggestions for more elegant approaches. Below you will find a complete attempt with sample data, code and output. Thank you for any suggestions!

Code

# imports
import pandas as pd
import numpy as np

# Random seed
np.random.seed(42)

# Sample data
data = {
    'Date': pd.date_range(start='2023-01-01', periods=40, freq='H'),
    'Actual': [100, 99.72, 101.02, 104.06, 103.60, 103.13, 106.29, 107.82, 106.88, 107.97,
               107.04, 106.11, 106.59, 102.77, 99.32, 98.19, 96.17, 96.80, 94.98, 92.15,
               95.09, 94.63, 94.77, 91.92, 90.83, 91.05, 88.75, 89.50, 88.30, 87.72,
               86.51, 90.22, 90.19, 88.08, 89.72, 87.28, 87.70, 83.78, 81.12, 131.52],
    'time_from_actual_1': [97] + [np.nan]*23 + [90] + [np.nan]*15,
    'time_from_actual_2': [99] + [np.nan]*23 + [89] + [np.nan]*15,
    'time_from_actual_3': [98] + [np.nan]*23 + [88] + [np.nan]*15,
    'time_from_actual_4': [97] + [np.nan]*23 + [87] + [np.nan]*15,
    'time_from_actual_5': [96] + [np.nan]*23 + [86] + [np.nan]*15,
    'time_from_actual_6': [95] + [np.nan]*23 + [85] + [np.nan]*15,
    'time_from_actual_7': [94] + [np.nan]*23 + [84] + [np.nan]*15,
    'time_from_actual_8': [93] + [np.nan]*23 + [83] + [np.nan]*15,
}

# dataframe
df = pd.DataFrame(data)

# copy of the dataframe to reference original values only
original_df = df.copy()

# Fill missing values for columns starting with "time_from_actual"
time_cols = [col for col in df.columns if col.startswith('time_from_actual')]

for col in time_cols:
    for i in range(1, len(df)):
        if pd.isnull(df.loc[i, col]):
            j = i
            while j < len(df) and pd.isnull(original_df.loc[j, col]):
                previous_actual = df.loc[j - 1, 'Actual']
                previous_time = df.loc[j - 1, col]
                current_actual = df.loc[j, 'Actual']
                difference = previous_time - previous_actual
                df.loc[j, col] = current_actual + difference
                j += 1

Solution

  • For each column containing 'time_from_actual':

    1. Create a group for each part of the dataframe where the first row is not null.

    2. Fill the null values with df["Actual"].diff().

    3. Finally, get the cumsum for each group (that's summing the existing value and df["Actual"].diff() for the other rows).

    for col in df.filter(like="time_from_actual"):
        g = df[col].notna().cumsum()
        df[col] = df[col].fillna(df["Actual"].diff())
        df[col] = df.groupby(g).cumsum()[col]
    
                      Date  Actual  time_from_actual_1  ...  time_from_actual_6  time_from_actual_7  time_from_actual_8
    0  2023-01-01 00:00:00  100.00               97.00  ...               95.00               94.00               93.00
    1  2023-01-01 01:00:00   99.72               96.72  ...               94.72               93.72               92.72
    2  2023-01-01 02:00:00  101.02               98.02  ...               96.02               95.02               94.02
    3  2023-01-01 03:00:00  104.06              101.06  ...               99.06               98.06               97.06
    4  2023-01-01 04:00:00  103.60              100.60  ...               98.60               97.60               96.60
    5  2023-01-01 05:00:00  103.13              100.13  ...               98.13               97.13               96.13
    6  2023-01-01 06:00:00  106.29              103.29  ...              101.29              100.29               99.29
    7  2023-01-01 07:00:00  107.82              104.82  ...              102.82              101.82              100.82
    8  2023-01-01 08:00:00  106.88              103.88  ...              101.88              100.88               99.88
    9  2023-01-01 09:00:00  107.97              104.97  ...              102.97              101.97              100.97
    10 2023-01-01 10:00:00  107.04              104.04  ...              102.04              101.04              100.04
    11 2023-01-01 11:00:00  106.11              103.11  ...              101.11              100.11               99.11
    12 2023-01-01 12:00:00  106.59              103.59  ...              101.59              100.59               99.59
    13 2023-01-01 13:00:00  102.77               99.77  ...               97.77               96.77               95.77
    14 2023-01-01 14:00:00   99.32               96.32  ...               94.32               93.32               92.32
    15 2023-01-01 15:00:00   98.19               95.19  ...               93.19               92.19               91.19
    16 2023-01-01 16:00:00   96.17               93.17  ...               91.17               90.17               89.17
    17 2023-01-01 17:00:00   96.80               93.80  ...               91.80               90.80               89.80
    18 2023-01-01 18:00:00   94.98               91.98  ...               89.98               88.98               87.98
    19 2023-01-01 19:00:00   92.15               89.15  ...               87.15               86.15               85.15
    20 2023-01-01 20:00:00   95.09               92.09  ...               90.09               89.09               88.09
    21 2023-01-01 21:00:00   94.63               91.63  ...               89.63               88.63               87.63
    22 2023-01-01 22:00:00   94.77               91.77  ...               89.77               88.77               87.77
    23 2023-01-01 23:00:00   91.92               88.92  ...               86.92               85.92               84.92
    24 2023-01-02 00:00:00   90.83               90.00  ...               85.00               84.00               83.00
    25 2023-01-02 01:00:00   91.05               90.22  ...               85.22               84.22               83.22
    26 2023-01-02 02:00:00   88.75               87.92  ...               82.92               81.92               80.92
    27 2023-01-02 03:00:00   89.50               88.67  ...               83.67               82.67               81.67
    28 2023-01-02 04:00:00   88.30               87.47  ...               82.47               81.47               80.47
    29 2023-01-02 05:00:00   87.72               86.89  ...               81.89               80.89               79.89
    30 2023-01-02 06:00:00   86.51               85.68  ...               80.68               79.68               78.68
    31 2023-01-02 07:00:00   90.22               89.39  ...               84.39               83.39               82.39
    32 2023-01-02 08:00:00   90.19               89.36  ...               84.36               83.36               82.36
    33 2023-01-02 09:00:00   88.08               87.25  ...               82.25               81.25               80.25
    34 2023-01-02 10:00:00   89.72               88.89  ...               83.89               82.89               81.89
    35 2023-01-02 11:00:00   87.28               86.45  ...               81.45               80.45               79.45
    36 2023-01-02 12:00:00   87.70               86.87  ...               81.87               80.87               79.87
    37 2023-01-02 13:00:00   83.78               82.95  ...               77.95               76.95               75.95
    38 2023-01-02 14:00:00   81.12               80.29  ...               75.29               74.29               73.29
    39 2023-01-02 15:00:00  131.52              130.69  ...              125.69              124.69              123.69