Ok so I've been working on this project where I am trying to detect an anomaly and relate it to some certain phenomenon. I know that pandas have builtin functions i.e. pd.rolling(window= frequency).statistics_of_my_choice() but for some reasons I am not getting the desired results. I have calculated rolling mean, r.median, r.upper & lower = mean +- 1.6 r.std.
But when I plot it, the upper and lower bounds are always above the data. IDK what's happening here, it doesn't make sense. Please take a look at the figure for a better understanding.
Here's what I am getting:
and here's what I want to achieve:
Here's the paper that I am trying to implement: https://www.researchgate.net/publication/374567172_Analysis_of_Ionospheric_Anomalies_before_the_Tonga_Volcanic_Eruption_on_15_January_2022/figures
Here's my code snippet
def gen_features(df):
df["ma"] = df.TEC.rolling(window="h").mean()
df["mstd"] = df.TEC.rolling(window="h").std()
df["upper"] = df["ma"] + (1.6* df.mstd)
df["lower"] = df["ma"] - (1.6* df.mstd)
return df
From the publication:
"Since the solar activity cycle is 27 days, this paper uses 27 days as the sliding window to detect the ionospheric TEC perturbation condition before the volcanic eruption. The upper bound of TEC anomaly is represented as UB =Q2+ 1.5 IQR and the lower bound as LB =Q2−1.5IQR"
Implementing this in pandas:
# no seed for random, to try it many times
dataLength = 1000 # datalength
data = np.random.randint(1, 100, dataLength) # generate random data
outlierPercentage = 1 # controls amount of outliers in the data
outlierCount = int(dataLength/100 * outlierPercentage) # count of outliers
outlierIdx = np.random.choice(dataLength, outlierCount, replace=False) # choose randomly between the index of the outlier
data[outlierIdx] = np.random.randint(-300, 300, outlierCount) # choose a random int between -300 and 300
df = pd.DataFrame({'Data': data}) # generate the datafrane
winSize = 5 # define size of window
# the statistics calculations...
Mean = df["Data"].rolling(window=winSize).mean()
Q1 = df["Data"].rolling(window=winSize).quantile(0.25)
Q3 = df["Data"].rolling(window=winSize).quantile(0.75)
IQR = Q3 - Q1
# assigning the upper limit and lower limit
df["UL"] = Mean + 1.5 * IQR
df["LL"] = Mean - 1.5 * IQR
# detect the outliers
outliersAboveUL = df[(df['Data'] > df['UL'])].index
outliersBelowLL = df[(df['Data'] < df['LL'])].index
Plotting gives you this:
Imported packages:
import pandas as pd
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
As you can see, this is a very basic example. I mainly added the correct calculation of the IQR. If you want a more detailed answer, I would need a sample of your data...
V2.0: with data from OP
This is currently what I have with the same approach:
df = pd.read_csv("airaStation.csv", index_col=0, parse_dates=True)
winSize = "29D" # define size of window
# the statistics calculations...
Mean = df["TEC"].rolling(window=winSize).mean()
Q1 = df["TEC"].rolling(window=winSize).quantile(0.25)
Q3 = df["TEC"].rolling(window=winSize).quantile(0.75)
IQR = Q3 - Q1
# assigning the upper limit and lower limit
df["UL"] = Mean + 1.5 * IQR
df["LL"] = Mean - 1.5 * IQR
# detect the outliers
outliersAboveUL = df[(df['TEC'] > df['UL'])].index
outliersBelowLL = df[(df['TEC'] < df['LL'])].index
The plot: