pythonpandasdataframewide-format-data

Convert pandas dataframe hourly values in column names (H1, H2,... ) to a series in a separate column


I am trying to convert a dataframe in which hourly data appears in distinct columns, like here:

1

... to a dataframe that only contains two columns ['datetime', 'value'].

For example:

Datetime value
2020-01-01 01:00:00 0
2020-01-01 02:00:00 0
... ...
2020-01-01 09:00:00 106
2020-01-01 10:00:00 2852

Any solution without using a for-loop?


Solution

  • You can do it by applying several function to DataFrame:

    from datetime import datetime
    
    # Example DataFrame
    df = pd.DataFrame({'date': ['1/1/2020', '1/2/2020', '1/3/2020'],
                       'h1': [0, 222, 333],
                       'h2': [44, 0, 0],
                       "h3": [1, 2, 3]})
    
    # To simplify I used only hours in range 1...3, so You must change it to 25
    HOURS_COUNT = 4
    
    df["hours"] = df.apply(lambda row: [h for h in range(1, HOURS_COUNT)], axis=1)
    df["hour_values"] = df.apply(lambda row: {h: row[f"h{h}"] for h in range(1, HOURS_COUNT)}, axis=1)
    
    df = df.explode("hours")
    
    df["value"] = df.apply(lambda row: row["hour_values"][row["hours"]], axis=1)
    df["date_full"] = df.apply(lambda row: datetime.strptime(f"{row['date']} {row['hours']}", "%m/%d/%Y %H"), axis=1)
    
    df = df[["date_full", "value"]]
    df = df.loc[df["value"] > 0]
    

    So initial DataFrame is:

           date   h1  h2  h3
    0  1/1/2020    0  44   1
    1  1/2/2020  222   0   2
    2  1/3/2020  333   0   3
    

    And result DataFrame is:

                date_full  value
    0 2020-01-01 02:00:00     44
    0 2020-01-01 03:00:00      1
    1 2020-01-02 01:00:00    222
    1 2020-01-02 03:00:00      2
    2 2020-01-03 01:00:00    333
    2 2020-01-03 03:00:00      3