pandasdataframeresample

Pandas resample stock 5min data not align


I have some stock 5min data, like as:

                  Date   Open   High    Low  Close    Volume
0  2024-11-19 09:35:00  11.75  11.79  11.55  11.78  32673600
1  2024-11-19 09:40:00  11.78  11.81  11.73  11.79  14802700
2  2024-11-19 09:45:00  11.79  11.84  11.79  11.82  13837400
3  2024-11-19 09:50:00  11.81  11.83  11.76  11.82   8534200
4  2024-11-19 09:55:00  11.82  11.87  11.80  11.87   8540500
5  2024-11-19 10:00:00  11.87  11.96  11.87  11.90  20659800
6  2024-11-19 10:05:00  11.89  11.90  11.82  11.82  11691000
7  2024-11-19 10:10:00  11.82  11.82  11.73  11.74   8762900
8  2024-11-19 10:15:00  11.74  11.74  11.71  11.73   6870500
9  2024-11-19 10:20:00  11.73  11.73  11.68  11.70   6244800
10 2024-11-19 10:25:00  11.70  11.70  11.66  11.69   5083000
11 2024-11-19 10:30:00  11.70  11.73  11.69  11.71   5342400
12 2024-11-19 10:35:00  11.72  11.74  11.71  11.73   3311800
13 2024-11-19 10:40:00  11.73  11.74  11.71  11.72   2331900
14 2024-11-19 10:45:00  11.72  11.72  11.70  11.72   3024100
15 2024-11-19 10:50:00  11.71  11.74  11.70  11.71   2774200
16 2024-11-19 10:55:00  11.70  11.72  11.70  11.71   1313000
17 2024-11-19 11:00:00  11.72  11.75  11.71  11.74   1737400
18 2024-11-19 11:05:00  11.75  11.75  11.73  11.75   1690600
19 2024-11-19 11:10:00  11.74  11.76  11.73  11.76   1751800
20 2024-11-19 11:15:00  11.76  11.76  11.72  11.73   2248700
21 2024-11-19 11:20:00  11.73  11.73  11.70  11.71   2464200
22 2024-11-19 11:25:00  11.71  11.71  11.69  11.70   1033600
23 2024-11-19 11:30:00  11.69  11.70  11.67  11.69   2063600

I use df.resample to convert them to 30m data, the code is:

df = df.set_index('Date')
df = df.resample('30T').agg({'Open':'first', 'High':'max', 'Low':'min','Close':'last',
                                 'Volume':'sum'}, closed='right', label = 'right').dropna()

But I got strange result like these:

                      Open   High    Low  Close    Volume
Date                                                     
2024-11-19 09:30:00  11.75  11.87  11.55  11.87  78388400
2024-11-19 10:00:00  11.87  11.96  11.66  11.69  59312000
2024-11-19 10:30:00  11.70  11.74  11.69  11.71  18097400
2024-11-19 11:00:00  11.72  11.76  11.69  11.70  10926300
2024-11-19 11:30:00  11.69  11.70  11.67  11.69   2063600

Here are correct 30m data export from my trading software:

Time    Open    High    Low Close   Volume
 2024/11/19-10:00   11.75   11.96   11.55   11.9    99048200
 2024/11/19-10:30   11.89   11.9    11.66   11.71   43994600
 2024/11/19-11:00   11.72   11.75   11.7    11.74   14492400
 2024/11/19-11:30   11.75   11.76   11.67   11.69   11252500
 

The data at 9:30 is irrelevant, mainly because the following data are not correct. but I did not find more parameters of df.sample. How can I correctly aggregate the data?


Solution

  • By default the reference in resample is the start of the day. It looks like you want the start of the data. You should set origin='start' instead of the default origin='start_day':

    (df.resample('30min', origin='start')
       .agg({'Open':'first', 'High':'max', 'Low':'min','Close':'last',
             'Volume':'sum'}, closed='right', label = 'right')
      .dropna()
    )
    

    Output:

                          Open   High    Low  Close    Volume
    Date                                                     
    2024-11-19 09:35:00  11.75  11.96  11.55  11.90  99048200
    2024-11-19 10:05:00  11.89  11.90  11.66  11.71  43994600
    2024-11-19 10:35:00  11.72  11.75  11.70  11.74  14492400
    2024-11-19 11:05:00  11.75  11.76  11.67  11.69  11252500
    

    And presumably, you want to pass the label parameter to resample:

    (df.resample('30min', origin='start', label='right')
       .agg({'Open':'first', 'High':'max', 'Low':'min','Close':'last',
             'Volume':'sum'})
     .dropna()
    )
    

    Output:

                          Open   High    Low  Close    Volume
    Date                                                     
    2024-11-19 10:05:00  11.75  11.96  11.55  11.90  99048200
    2024-11-19 10:35:00  11.89  11.90  11.66  11.71  43994600
    2024-11-19 11:05:00  11.72  11.75  11.70  11.74  14492400
    2024-11-19 11:35:00  11.75  11.76  11.67  11.69  11252500
    

    Finally, if you want to round to 30 minutes:

    (df.resample('30min', origin='start', label='right')
       .agg({'Open':'first', 'High':'max', 'Low':'min', 
             'Close':'last', 'Volume':'sum'})
       .dropna()
       .pipe(lambda x: x.set_axis(x.index.floor('30min')))
    )
    

    Output:

                          Open   High    Low  Close    Volume
    Date                                                     
    2024-11-19 10:00:00  11.75  11.96  11.55  11.90  99048200
    2024-11-19 10:30:00  11.89  11.90  11.66  11.71  43994600
    2024-11-19 11:00:00  11.72  11.75  11.70  11.74  14492400
    2024-11-19 11:30:00  11.75  11.76  11.67  11.69  11252500