I need some help for feature extraction in time series, maybe using the TSFRESH package.
I have circa 5000 CSV files, and each one of them is a single time series (they may differ in length). The CSV-time-series is pretty straight forward:
Example of a CSV-Time-Series file: | Date | Value | | ------ | ----- | | 1/1/1904 01:00:00,000000 | 1,464844E-3 | | 1/1/1904 01:00:01,000000 | 1,953125E-3 | | 1/1/1904 01:00:02,000000 | 4,882813E-4 | | 1/1/1904 01:00:03,000000 | -2,441406E-3 | | 1/1/1904 01:00:04,000000 | -9,765625E-4 | | ... | ... |
Along with these CSV files, I also have a metadata file (in a CSV format), where each row refers to one of those 5000 CSV-time-series, and reports more general information about that time series such as the energy, etc.
Example of the metadata-CSV file: | Path of the CSV-timeseries | Label | Energy | Penetration | Porosity | | ------ | ----- | ------ | ----- | ----- | ----------- | | ... | ... | ... | ... | ... | ... | | ... | ... | ... | ... | ... | ... | | ... | ... | ... | ... | ... | ... |
The most important column is the "Label" one since it reports if a CSV-time-series was labeled as:
I should also consider the energy, penetration, and porosity columns since those values have a big role in the labeling of the time series. (I already tried a decision tree by looking at only the features, now I would like to analyze the time series to extract knowledge)
I intend to extract features from the time series such that I can understand what are the features that make one time series be labeled as "Good" or "Bad".
How can I do this with TSFRESH? There are other ways to do this?
Could you show me how to do it? Thank you :)
I'm doing something similar currently and this example jupyter notebook from github helped me.
The basic process is in short:
X = extract_features(...)
X_filtered = select_features(X, y)
with y
being your label, good or bad being e.g. 1 and 0.