I have multiple timeseries datasets. Each dataset represents a manufacturing process, has 36000 rows and 4 columns, is labeled, and some of them contain anomalies. There is:
I want to train a machine learning model to identify anomalies on other timeseries data of the same kind. I'd also like to predict anomalies. I'd like to train an Isolation Forest model, or a KNN or neural network amongst others.
But I am having trouble handling multiple multivariate and multi-label timeseries data.
I tried a library called Darts in python, made to handle this kind of problem, but I don't know how to train an Isolation Forest model on multivariate timeseries using this library, and I don't find it on the documentation.
My data is stored in csv files that I import as pandas dataframes. I use Pyhton 3.11.2.
This is a very general question and it is related more to research of what is the best and how to do it.
First of all, Darts
is great for time-series tasks, but it doesn't include Isolation Forest. scikit-learn
on the other hand does so you need to use both.
Below is a toy example to illustrate this:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score
# fake data
np.random.seed(0)
# fake timestamps
timestamps = pd.date_range(start='2023-08-01', periods=36000, freq='1S')
# fake Flow and Pressure
data1 = pd.DataFrame({
'Timestamp': timestamps,
'Flow': np.random.normal(100, 10, 36000),
'Pressure': np.random.normal(50, 5, 36000),
'Flow anomaly': np.random.randint(0, 2, 36000),
'Pressure anomaly': np.random.randint(0, 2, 36000),
})
data2 = pd.DataFrame({
'Timestamp': timestamps,
'Flow': np.random.normal(90, 8, 36000),
'Pressure': np.random.normal(55, 6, 36000),
'Flow anomaly': np.random.randint(0, 2, 36000),
'Pressure anomaly': np.random.randint(0, 2, 36000),
})
data3 = pd.DataFrame({
'Timestamp': timestamps,
'Flow': np.random.normal(110, 12, 36000),
'Pressure': np.random.normal(45, 4, 36000),
'Flow anomaly': np.random.randint(0, 2, 36000),
'Pressure anomaly': np.random.randint(0, 2, 36000),
})
# Concatenate all datasets
combined_data = pd.concat([data1, data2, data3], ignore_index=True)
combined_data['Timestamp'] = pd.to_datetime(combined_data['Timestamp'])
combined_data.set_index('Timestamp', inplace=True)
# split features and targets
features = combined_data[['Flow', 'Pressure']]
labels = combined_data[['Flow anomaly', 'Pressure anomaly']]
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=0)
# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# model fitting
isolation_forest_flow = IsolationForest()
isolation_forest_pressure = IsolationForest()
isolation_forest_flow.fit(X_train_scaled)
isolation_forest_pressure.fit(X_train_scaled)
# predict the test set
pred_flow = isolation_forest_flow.predict(X_test_scaled)
pred_pressure = isolation_forest_pressure.predict(X_test_scaled)
# predictions back to labels (0 == inliers, 1 == anomalies)
pred_flow_labels = np.where(pred_flow == -1, 1, 0)
pred_pressure_labels = np.where(pred_pressure == -1, 1, 0)
# accuracy
accuracy_flow = accuracy_score(y_test['Flow anomaly'], pred_flow_labels)
accuracy_pressure = accuracy_score(y_test['Pressure anomaly'], pred_pressure_labels)
print(f"Flow Accuracy: {accuracy_flow}")
print(f"Pressure Accuracy: {accuracy_pressure}")
The above prints:
Flow Accuracy: 0.4999537037037037
Pressure Accuracy:0.4988888888888889