pandascsvgroup-bypandas-groupbytop-n

Adding flags column top/bottom N values with groupby


I have data in a CSV file that go from 1/1/2015 00:00 to 1/1/2021 00:00 hour by hour and IDs go to ID15000.

My goal is to flag best (flag 1) and worst (flag 2) 4 prices in each day for each ID throughout all time. Thus, each day there will be 4x flag 1 and 4x flag 2, the rest (16x flag0)

I thought I should firstly melt this table for database format, and apply some pandas scripts as below. Here is my small CSV file also.

https://github.com/mocukutli/analytics/blob/master/file.csv

enter image description here

import pandas as pd
df = pd.read_csv("file.csv",header=0,delimiter=",")

df=df.melt(
    id_vars='Date',
   var_name='ids',
  value_name='price')

df['large']=df.index.isin(df.groupby('Date').price.nlargest(4).index.get_level_values(1)).astype(int)
df['small']=df.index.isin(df.groupby('Date').price.nsmallest(4).index.get_level_values(1)).astype(int)
df['flag']=df.small*2+df.large
print(df.head(24))

Output

              Date  ids  price  large  small  flag
0   1.01.2015 00:00  ID1  28.03      1      1     3
1   1.01.2015 01:00  ID1  28.02      1      1     3
2   1.01.2015 02:00  ID1  28.09      1      1     3
3   1.01.2015 03:00  ID1  27.41      1      1     3
4   1.01.2015 04:00  ID1  26.22      1      1     3
5   1.01.2015 05:00  ID1  27.33      1      1     3
6   1.01.2015 06:00  ID1  28.75      1      1     3
7   1.01.2015 07:00  ID1  28.78      1      1     3
8   1.01.2015 08:00  ID1  29.55      1      1     3
9   1.01.2015 09:00  ID1  28.32      1      1     3
10  1.01.2015 10:00  ID1  27.67      1      1     3
11  1.01.2015 11:00  ID1  26.79      1      1     3
12  1.01.2015 12:00  ID1  25.71      1      1     3
13  1.01.2015 13:00  ID1  24.42      1      1     3
14  1.01.2015 14:00  ID1  25.01      1      1     3
15  1.01.2015 15:00  ID1  24.55      1      1     3
16  1.01.2015 16:00  ID1  28.66      1      1     3
17  1.01.2015 17:00  ID1  29.47      1      1     3
18  1.01.2015 18:00  ID1  29.10      1      1     3
19  1.01.2015 19:00  ID1  28.64      1      1     3
20  1.01.2015 20:00  ID1  27.27      1      1     3
21  1.01.2015 21:00  ID1  26.32      1      1     3
22  1.01.2015 22:00  ID1  25.70      1      1     3
23  1.01.2015 23:00  ID1  24.60      1      1     3

As seen, flags are not properly assigned.

Some thoughts and question;

Thank you guys.


Solution

  • Pandas can convert strings to dates when you load the csv file or you can do the conversion after loading

    df = pd.read_csv("file.csv", header=0, delimiter=",", parse_dates=['Date'], infer_datetime_format=True)
    

    You can then use the resample method, which works as a specialized groupby for datetime64ns columns or indexes.