python-3.xpandasdataframepandas-resample

Pandas .resample with .agg function failing on "IndexError: single positional indexer is out-of-bounds


I am attempting to resample a set of stock market data from a database to different time variances. I am using a pandas dataframe with the data already read in from my database with no issues. That code is below for reference.

vals = dbconn.dbread(checkquery,val)
vals = pd.DataFrame(vals)
#vals['t']=pd.to_datetime(vals['t'])
vals = vals.set_index('t')
print(vals)

with the below output.

                           o        c        h        l       v   vw     n
t                                                                         
2024-09-11 08:30:00  548.700  547.720  548.760  547.557  566501  548  5731
2024-09-11 09:01:00  543.850  544.630  544.780  543.780  394648  546  2891
2024-09-11 09:02:00  544.590  544.225  544.670  544.120  231857  547  1503
2024-09-11 09:03:00  544.250  544.200  544.349  544.110   65242  544  1297
2024-09-11 09:04:00  544.210  544.170  544.440  543.950  124656  544  1724
...                      ...      ...      ...      ...     ...  ...   ...
2024-09-11 13:45:00  552.285  551.910  552.332  551.900   93449  552   987
2024-09-11 13:46:00  551.910  551.715  552.020  551.710  169873  552  1512
2024-09-11 13:47:00  551.710  551.840  551.910  551.580  198264  552  1540
2024-09-11 13:48:00  551.845  552.250  552.280  551.840  124869  552  1222
2024-09-11 13:49:00  552.240  552.100  552.250  552.020  275982  551  1117

[290 rows x 7 columns]

I was attempting to follow this same example with the specific answer here. Although there were no comments or explanation it did appear that it does work from reading the other comments.

Attempt

After creating my DataFrame with the correct data I attempted to run the resample and it fails with the below output.

code:

vals = vals.resample('5min').agg({
    'o': lambda vals: vals.iloc[0],
    'h': lambda vals: vals.max(),
    'l': lambda vals: vals.min(),
    'c': lambda vals: vals.iloc[-1],
    'v': lambda vals: vals.sum(),
    'n': lambda vals: vals.sum()
})
print(vals)

output:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[18], line 11
      1 vals.agg({
      2     'o': lambda vals: vals.iloc[0],
      3     'h': lambda vals: vals.max(),
   (...)
      7     'n': lambda vals: vals.sum()
      8 })
      9 print(vals)
---> 11 vals = vals.resample('5min').agg({
     12     'o': lambda vals: vals.iloc[0],
     13     'h': lambda vals: vals.max(),
     14     'l': lambda vals: vals.min(),
     15     'c': lambda vals: vals.iloc[-1],
     16     'v': lambda vals: vals.sum(),
     17     'n': lambda vals: vals.sum()
     18 })
     19 print(vals)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pandas\core\resample.py:352, in Resampler.aggregate(self, func, *args, **kwargs)
    343 @final
    344 @doc(
    345     _shared_docs["aggregate"],
   (...)
...
   1683 len_axis = len(self.obj._get_axis(axis))
   1684 if key >= len_axis or key < -len_axis:
-> 1685     raise IndexError("single positional indexer is out-of-bounds")

IndexError: single positional indexer is out-of-bounds
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

Troubleshooting

I first thought I had something wrong with my .agg function so I tested that separate as shown below.

vals.agg({
    'o': lambda vals: vals.iloc[0],
    'h': lambda vals: vals.max(),
    'l': lambda vals: vals.min(),
    'c': lambda vals: vals.iloc[-1],
    'v': lambda vals: vals.sum(),
    'n': lambda vals: vals.sum()
})
print(vals)

Where I get the below output:

                           o        c        h        l       v   vw     n
t                                                                         
2024-09-11 08:30:00  548.700  547.720  548.760  547.557  566501  548  5731
2024-09-11 09:01:00  543.850  544.630  544.780  543.780  394648  546  2891
2024-09-11 09:02:00  544.590  544.225  544.670  544.120  231857  547  1503
2024-09-11 09:03:00  544.250  544.200  544.349  544.110   65242  544  1297
2024-09-11 09:04:00  544.210  544.170  544.440  543.950  124656  544  1724
...                      ...      ...      ...      ...     ...  ...   ...
2024-09-11 13:45:00  552.285  551.910  552.332  551.900   93449  552   987
2024-09-11 13:46:00  551.910  551.715  552.020  551.710  169873  552  1512
2024-09-11 13:47:00  551.710  551.840  551.910  551.580  198264  552  1540
2024-09-11 13:48:00  551.845  552.250  552.280  551.840  124869  552  1222
2024-09-11 13:49:00  552.240  552.100  552.250  552.020  275982  551  1117

[290 rows x 7 columns]

Conclusion


Solution

  • The issue is that resample will create empty Series for the missing periods (see below for more details), and iloc[0]/iloc[-1] will fail on those.

    What you could do is to check the length/validity of the Series and return None otherwise. For example with v.iloc[0] if len(v) else None or None if v.empty else v.iloc[0]:

    vals.resample('5min').agg({
        'o': lambda v: None if v.empty else v.iloc[0],
        'h': lambda v: v.max(),
        'l': lambda v: v.min(),
        'c': lambda v: v.iloc[-1] if len(v) else None,
        'v': lambda v: v.sum(),
        'n': lambda v: v.sum()
    })
    

    Note that in this particular case, you should rather go with the builtin 'first'/'last', and similarly for other standard aggregations (no need to use a custom function):

    vals.resample('5min').agg({
        'o': 'first',
        'h': 'max',
        'l': 'min',
        'c': 'last',
        'v': 'sum',
        'n': 'sum',
    })
    

    Output:

                               o        h        l       c       v     n
    t                                                                   
    2024-09-11 08:30:00  548.700  548.760  547.557  547.72  566501  5731
    2024-09-11 08:35:00      NaN      NaN      NaN     NaN       0     0
    2024-09-11 08:40:00      NaN      NaN      NaN     NaN       0     0
    2024-09-11 08:45:00      NaN      NaN      NaN     NaN       0     0
    2024-09-11 08:50:00      NaN      NaN      NaN     NaN       0     0
    ...                      ...      ...      ...     ...     ...   ...
    2024-09-11 13:25:00      NaN      NaN      NaN     NaN       0     0
    2024-09-11 13:30:00      NaN      NaN      NaN     NaN       0     0
    2024-09-11 13:35:00      NaN      NaN      NaN     NaN       0     0
    2024-09-11 13:40:00      NaN      NaN      NaN     NaN       0     0
    2024-09-11 13:45:00  552.285  552.332  551.580  552.10  862437  6378
    
    [64 rows x 6 columns]
    

    how resample works

    resample will create a Series for each group or datetimes that match the offset. If there is no datapoint for the period, then an empty Series is created.

    Thus iloc[0] wouldn't fail for example with resample('12h') since you have at least one value for each period of 12 hours:

                            o        h       l       c        v      n
    t                                                                 
    2024-09-11 00:00:00  None  548.760  543.78  544.17  1382904  13146
    2024-09-11 12:00:00  None  552.332  551.58  552.10   862437   6378
    

    However, for '5min' there is no data for 08:35−9:00 and pandas will create an empty Series to pass to the aggregation function.

    To see this for yourself, you can run for example lambda x: print(x) or x.max() as aggregation function, which will print each underlying Series before returning the max:

    # vals.resample('1h').agg({'o': lambda x: print(x, '\n') or x.max()})
    
    t                                         # datapoints for 8:00-9:00
    2024-09-11 08:30:00    548.7
    Name: o, dtype: float64
    
    t                                         # datapoints for 9:00-10:00
    2024-09-11 09:01:00    543.85
    2024-09-11 09:02:00    544.59
    2024-09-11 09:03:00    544.25
    2024-09-11 09:04:00    544.21
    Name: o, dtype: float64
    
    Series([], Name: o, dtype: float64)       # no datapoints for 10:00-11:00
    
    Series([], Name: o, dtype: float64)       # no datapoints for 11:00-12:00
    
    Series([], Name: o, dtype: float64)       # no datapoints for 12:00-13:00
    
    t                                         # datapoints for 12:00-13:00
    2024-09-11 13:45:00    552.285
    2024-09-11 13:46:00    551.910
    2024-09-11 13:47:00    551.710
    2024-09-11 13:48:00    551.845
    2024-09-11 13:49:00    552.240
    Name: o, dtype: float64
    
                               o             # this is the final output
    t                           
    2024-09-11 08:00:00  548.700
    2024-09-11 09:00:00  544.590
    2024-09-11 10:00:00      NaN             # missing periods are created
    2024-09-11 11:00:00      NaN             #
    2024-09-11 12:00:00      NaN             #
    2024-09-11 13:00:00  552.285
    

    You can observe that empty Series are created for hours 10/11/12 that are missing.