pythonpandastime-seriesfeature-extractiontsfresh

TSFRESH - features extracted by a symmetric sliding window


As raw data we have measurements m_{i,j}, measured every 30 seconds (i=0, 30, 60, 90,...720,..) for every subject j in the dataset.

I wish use TSFRESH (package) to extract time-series features, such that for a point of interest at time i, features are calculated based on symmetric rolling window.

We wish to calculate the feature vector of time point i,j based on measurements of 3 hours of context before i and 3 hours after i. Thus, the 721-dim feature vector represents a point of interest surrounded by 6 hours “context”, i.e. 360 measurements before and 360 measurements after the point of interest. For every point of interest, features should be extracted based on 721 measurements of m_{i,j}.

I've tried using rolling_direction param in roll_time_series(), but the only options are either roll backwards or forwards in “time” - I'm looking for a way to include both "past" and "future" data in features calculation.


Solution

  • A "workaround" solution:

    Use the "roll_time_series" function twice; one for "backward" rolling (setting rolling_direction=1) and the second for "forward" (rolling_direction=-1), and then combine them into one.

    This will provide, for each time point in the original dataset m_{i,j}$, a time series rolling object with 360 values "from the past" and 360 values "from the future" (i.e., the time point is at the center of the window and max_timeshift=360)

    Note to the use of pandas functions below: concat(), sort_values(), drop_duplicates() - which are mandatory for this solution to work.

    import numpy as np
    import pandas as pd
    from tsfresh.utilities.dataframe_functions import roll_time_series
    from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters
    
    rolled_backward = roll_time_series(activity_data,
                                               column_id=id_column,
                                               column_sort=sort_column,
                                               column_kind=None,
                                               rolling_direction=1,
                                               max_timeshift=360)
    
    rolled_farward = roll_time_series(activity_data,
                                               column_id=id_column,
                                               column_sort=sort_column,
                                               column_kind=None,
                                               rolling_direction=-1,
                                               max_timeshift=360)
    
            # merge into one dataframe, with rolled_farward and rolled_backward window for every time point (sample)
            df = pd.concat([rolled_backward, rolled_farward])
    
            # important! - sort and drop duplicates
            df.sort_values(by=[id_column, sort_column], inplace=True)
            df.drop_duplicates(subset=[id_column, sort_column, 'activity'], inplace=True, keep='first')