pandasmachine-learningdata-cleaningdata-processing

How can I process data with a lot of missing values because of different sources?


I have data that looks like this:

Timestamp Variable A Variable B Variable C
2023-01-01 00:00:00.000 Value Nan Nan
2023-01-01 00:00:00.050 Value Nan Nan
2023-01-01 00:00:00.150 Nan Value Value
2023-01-01 05:00:00.450 Nan Value Value
2023-01-01 05:00:00.500 Value Nan Nan

Variable A comes from one source and Variable B and C from a second, so the timestamps are different which gives me a lot of missing data (the data comes from a dozen sources, this is just a simplified example).

I tried to use the resample method of pandas. But this method creates too many lines for me, in the example above there would be new lines between 2023-01-01 00:00:00.150 and 2023-01-01 05:00:00.450, which I do not want.

Can you help me? Thank you


Solution

  • One option could be to use the successive time difference and a threshold to group your data with groupby.first (or groupby.mean):

    out = df.groupby(df['Timestamp'].diff().ge('1s').cumsum(), as_index=False).first()
    

    Output:

                    Timestamp  Variable A  Variable B  Variable C
    0 2023-01-01 00:00:00.000         1.0         3.0         4.0
    1 2023-01-01 05:00:00.450         7.0         5.0         6.0
    

    With resample:

    out = df.resample('5H', on='Timestamp').mean().reset_index()
    

    Output:

                Timestamp  Variable A  Variable B  Variable C
    0 2023-01-01 00:00:00         1.5         3.0         4.0
    1 2023-01-01 05:00:00         7.0         5.0         6.0
    

    Used input:

                    Timestamp  Variable A  Variable B  Variable C
    0 2023-01-01 00:00:00.000         1.0         NaN         NaN
    1 2023-01-01 00:00:00.050         2.0         NaN         NaN
    2 2023-01-01 00:00:00.150         NaN         3.0         4.0
    3 2023-01-01 05:00:00.450         NaN         5.0         6.0
    4 2023-01-01 05:00:00.500         7.0         NaN         NaN