pythonpandascounter

python pandas simple row numbering on monthly basis


I am failing to find a solution for trading research.

I have a large dataset of market data running from 1993-2024.

Since calendar days do not match the trading days. It might be implemented by "business days", but who knows if holidays/business days match the trading days. I need to get a trading day of month if certain conditions are met.

                  Day        Open        High         Low       Close     Volume  trading_day
Date
2010-01-04     Monday  112.370003  113.389999  111.510002  113.330002  118944600            1
2010-01-05    Tuesday  113.260002  113.680000  112.849998  113.629997  111579900            2
2010-01-06  Wednesday  113.519997  113.989998  113.430000  113.709999  116074400            3
2010-01-07   Thursday  113.500000  114.330002  113.180000  114.190002  131091100            4
2010-01-08     Friday  113.889999  114.620003  113.660004  114.570000  126402800            5
2010-01-11     Monday  115.080002  115.129997  114.239998  114.730003  106375700            6
2010-01-12    Tuesday  113.970001  114.209999  113.220001  113.660004  163333500            7
2010-01-13  Wednesday  113.949997  114.940002  113.370003  114.620003  161822000            8
2010-01-14   Thursday  114.489998  115.139999  114.419998  114.930000  115718800            9
2010-01-15     Friday  114.730003  114.839996  113.199997  113.639999  212283100           10
2010-01-19    Tuesday  113.620003  115.129997  113.589996  115.059998  139172700           11
2010-01-20  Wednesday  114.279999  114.449997  112.980003  113.889999  216490200           12
2010-01-21   Thursday  113.919998  114.269997  111.559998  111.699997  344859600           13
2010-01-22     Friday  111.199997  111.739998  109.089996  109.209999  345942400           14
2010-01-25     Monday  110.209999  110.410004  109.410004  109.769997  186937500           15
2010-01-26    Tuesday  109.339996  110.470001  109.040001  109.309998  211168800           16
2010-01-27  Wednesday  109.169998  110.080002  108.330002  109.830002  271863600           17
2010-01-28   Thursday  110.190002  110.250000  107.910004  108.570000  316104000           18
2010-01-29     Friday  109.040001  109.800003  107.220001  107.389999  310677600           19

Above by is an example what I am trying to achieve. I managed that by selecting one month for testing purpose with following code.

df_test['trading_day'] = range(1, len(df_test) + 1)

Of course, there are about 300+ months since 1993 so it would be hell to do it all manually.

I managed to df.groupby by month with the following code:

df_grouped_monthly = df.groupby(pd.Grouper(freq='M')) and tried to apply same counter as above by. Sadly, it doesn't work on grouped df. I tried .transform(add(1))

Possible solutions are either get grouped row position - no idea whether row number is inherited from initial df, or it is counted based on groupby.

Or add above by mentioned counter to grouped df. I failed to add extra column to grouped df.

Any suggestions how to achieve above by df on whole dataset based on monthly basis, please?


Solution

  • You can convert the 'Date' column to datetime and create new column before grouping.

    df["Month"] = pd.to_datetime(df.Date, format='%b', errors='coerce').dt.month
    

    Output DateFrame:

     Date   Day         Open    High    Low Close   Volume  Month
    2010-01-04  Monday  112.370003  113.389999  111.510002  113.330002  118944600   1
    2010-01-05  Tuesday 113.260002  113.68  112.849998  113.629997  111579900   1
    2010-01-06  Wednesday   113.519997  113.989998  113.43  113.709999  116074400   1
    2010-01-07  Thursday    113.5   114.330002  113.18  114.190002  131091100   1
    2010-01-08  Friday  113.889999  114.620003  113.660004  114.57  126402800   1
    2010-01-11  Monday  115.080002  115.129997  114.239998  114.730003  106375700   1
    2010-01-12  Tuesday 113.970001  114.209999  113.220001  113.660004  163333500   1
    2010-01-13  Wednesday   113.949997  114.940002  113.370003  114.620003  161822000   1
    2010-01-14  Thursday    114.489998  115.139999  114.419998  114.93  115718800   1
    2010-01-15  Friday  114.730003  114.839996  113.199997  113.639999  212283100   1
    2010-01-19  Tuesday 113.620003  115.129997  113.589996  115.059998  139172700   1
    2010-01-20  Wednesday   114.279999  114.449997  112.980003  113.889999  216490200   1
    2010-01-21  Thursday    113.919998  114.269997  111.559998  111.699997  344859600   1
    2010-01-22  Friday  111.199997  111.739998  109.089996  109.209999  345942400   1
    2010-01-25  Monday  110.209999  110.410004  109.410004  109.769997  186937500   1
    2010-01-26  Tuesday 109.339996  110.470001  109.040001  109.309998  211168800   1
    2010-01-27  Wednesday   109.169998  110.080002  108.330002  109.830002  271863600   1
    2010-01-28  Thursday    110.190002  110.25  107.910004  108.57  316104000   1
    2010-01-29  Friday  109.040001  109.800003  107.220001  107.389999  310677600   1
    2010-02-04  Saturday    109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-05  Sunday  109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-06  Monday  109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-07  Tuesday 109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-08  Wednesday   109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-09  Thursday    109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-10  Friday  109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-11  Saturday    109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-12  Sunday  109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-13  Monday  109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-14  Tuesday 109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-15  Wednesday   109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-16  Thursday    109.040001  109.800003  107.220001  107.389999  310677600   2
    2010-02-17  Friday  109.040001  109.800003  107.220001  107.389999  310677600   2
    

    You can use this modified dataframe to find the information you are looking for.

     #For instance, adding 'trading_Day' column
     df['Trading_day']= df.groupby(['Month'], sort=True).cumcount()+1
    

    Output:

           Date  Day        Month     Trading_day
    0  2010-01-04     Monday      1            1
    1  2010-01-05    Tuesday      1            2
    2  2010-01-06  Wednesday      1            3
    3  2010-01-07   Thursday      1            4
    4  2010-01-08     Friday      1            5
    5  2010-01-11     Monday      1            6
    6  2010-01-12    Tuesday      1            7
    7  2010-01-13  Wednesday      1            8
    8  2010-01-14   Thursday      1            9
    9  2010-01-15     Friday      1           10
    10 2010-01-19    Tuesday      1           11
    11 2010-01-20  Wednesday      1           12
    12 2010-01-21   Thursday      1           13
    13 2010-01-22     Friday      1           14
    14 2010-01-25     Monday      1           15
    15 2010-01-26    Tuesday      1           16
    16 2010-01-27  Wednesday      1           17
    17 2010-01-28   Thursday      1           18
    18 2010-01-29     Friday      1           19
    19 2010-02-04   Saturday      2            1
    20 2010-02-05     Sunday      2            2
    21 2010-02-06     Monday      2            3
    22 2010-02-07    Tuesday      2            4
    23 2010-02-08  Wednesday      2            5
    24 2010-02-09   Thursday      2            6
    25 2010-02-10     Friday      2            7
    26 2010-02-11   Saturday      2            8
    27 2010-02-12     Sunday      2            9
    28 2010-02-13     Monday      2           10
    29 2010-02-14    Tuesday      2           11
    30 2010-02-15  Wednesday      2           12
    31 2010-02-16   Thursday      2           13
    32 2010-02-17     Friday      2           14
    

    You can then perform your analysis.

    # When is Closing value > Opening Value
    grouped_df = df.query("Close >= Open")[['Month', 'trading_day']]
    print(grouped_df)
    

    Output:

            Month  trading_day
    0       1            1
    1       1            2
    2       1            3
    3       1            4
    4       1            5
    7       1            8
    8       1            9
    10      1           11
    16      1           17