I have data in a CSV file that go from 1/1/2015 00:00 to 1/1/2021 00:00 hour by hour and IDs go to ID15000.
My goal is to flag best (flag 1) and worst (flag 2) 4 prices in each day for each ID throughout all time. Thus, each day there will be 4x flag 1 and 4x flag 2, the rest (16x flag0)
I thought I should firstly melt this table for database format, and apply some pandas scripts as below. Here is my small CSV file also.
https://github.com/mocukutli/analytics/blob/master/file.csv
import pandas as pd
df = pd.read_csv("file.csv",header=0,delimiter=",")
df=df.melt(
id_vars='Date',
var_name='ids',
value_name='price')
df['large']=df.index.isin(df.groupby('Date').price.nlargest(4).index.get_level_values(1)).astype(int)
df['small']=df.index.isin(df.groupby('Date').price.nsmallest(4).index.get_level_values(1)).astype(int)
df['flag']=df.small*2+df.large
print(df.head(24))
Output
Date ids price large small flag
0 1.01.2015 00:00 ID1 28.03 1 1 3
1 1.01.2015 01:00 ID1 28.02 1 1 3
2 1.01.2015 02:00 ID1 28.09 1 1 3
3 1.01.2015 03:00 ID1 27.41 1 1 3
4 1.01.2015 04:00 ID1 26.22 1 1 3
5 1.01.2015 05:00 ID1 27.33 1 1 3
6 1.01.2015 06:00 ID1 28.75 1 1 3
7 1.01.2015 07:00 ID1 28.78 1 1 3
8 1.01.2015 08:00 ID1 29.55 1 1 3
9 1.01.2015 09:00 ID1 28.32 1 1 3
10 1.01.2015 10:00 ID1 27.67 1 1 3
11 1.01.2015 11:00 ID1 26.79 1 1 3
12 1.01.2015 12:00 ID1 25.71 1 1 3
13 1.01.2015 13:00 ID1 24.42 1 1 3
14 1.01.2015 14:00 ID1 25.01 1 1 3
15 1.01.2015 15:00 ID1 24.55 1 1 3
16 1.01.2015 16:00 ID1 28.66 1 1 3
17 1.01.2015 17:00 ID1 29.47 1 1 3
18 1.01.2015 18:00 ID1 29.10 1 1 3
19 1.01.2015 19:00 ID1 28.64 1 1 3
20 1.01.2015 20:00 ID1 27.27 1 1 3
21 1.01.2015 21:00 ID1 26.32 1 1 3
22 1.01.2015 22:00 ID1 25.70 1 1 3
23 1.01.2015 23:00 ID1 24.60 1 1 3
As seen, flags are not properly assigned.
Some thoughts and question;
Thank you guys.
Pandas can convert strings to dates when you load the csv file or you can do the conversion after loading
df = pd.read_csv("file.csv", header=0, delimiter=",", parse_dates=['Date'], infer_datetime_format=True)
You can then use the resample
method, which works as a specialized groupby
for datetime64ns
columns or indexes.