Hello everyone
For a school project, I am stuck with the duration of an operation with Pandas Dataframe.
I have one dataframe df which shape is (250 000 000, 200). This dataframe contains values of variable describing the behaviours of sensors on a machine. They are organized by 'Cycle' (everytime the machine begins a new cycle, this variable is incremented by one). And in this cycle, 'CycleTime' describes the position of the row within the 'Cycle'.
In the 'mean' DataFrame, I compute the mean of each variables group by the 'CycleTime'
The 'anomaly_matrix' DataFrame represents the global anomaly of each cycle which is the sum of the square difference of each rows belonging to the Cycle with the mean of corresponding CycleTime.
An example of my code is below
df = pd.DataFrame({'Cycle': [0, 0, 0, 1, 1, 1, 2, 2], 'CycleTime': [0, 1, 2, 0, 1, 2, 0, 1], 'variable1': [0, 0.5, 0.25, 0.3, 0.4, 0.1, 0.2, 0.25], 'variable2':[1, 2, 1, 1, 2, 2, 1, 2], 'variable3': [100, 5000, 200, 900, 100, 2000, 300, 300]})
mean = df.drop(['Cycle'], axis = 1).groupby("CycleTime").agg('mean')
anomali_matrix = df.drop(['CycleTime'], axis = 1).groupby("Cycle").agg('mean')
anomaly_matrix = anomali_matrix - anomali_matrix
for index, row in df.iterrows():
cycle = row["Cycle"]
time = row["CycleTime"]
anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2
>>>anomaly_matrix
variable1 variable2 variable3
Cycle
0 0.047014 0.25 1.116111e+07
1 0.023681 0.25 3.917778e+06
2 0.018889 0.00 2.267778e+06
This calculation for my (250 000 000, 200) DataFrame last 6 hours, it is due to anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2
I tried to improve by using an apply function but I do not succeed in adding other DataFrame in that apply function. Same thing trying to vectorize pandas.
Do you have any idea how to accelerate this process ? Thanks
You can use:
df1 = df.set_index(['Cycle', 'CycleTime'])
mean = df1.sub(df1.groupby('CycleTime').transform('mean'))**2
df2 = mean.groupby('Cycle').sum()
print (df2)
variable1 variable2 variable3
Cycle
0 0.047014 0.25 1.116111e+07
1 0.023681 0.25 3.917778e+06
2 0.018889 0.00 2.267778e+06