python-3.x pandas dataframe pandas-resample

Pandas .resample with .agg function failing on "IndexError: single positional indexer is out-of-bounds

I am attempting to resample a set of stock market data from a database to different time variances. I am using a pandas dataframe with the data already read in from my database with no issues. That code is below for reference.

vals = dbconn.dbread(checkquery,val)
vals = pd.DataFrame(vals)
#vals['t']=pd.to_datetime(vals['t'])
vals = vals.set_index('t')
print(vals)

with the below output.

                           o        c        h        l       v   vw     n
t                                                                         
2024-09-11 08:30:00  548.700  547.720  548.760  547.557  566501  548  5731
2024-09-11 09:01:00  543.850  544.630  544.780  543.780  394648  546  2891
2024-09-11 09:02:00  544.590  544.225  544.670  544.120  231857  547  1503
2024-09-11 09:03:00  544.250  544.200  544.349  544.110   65242  544  1297
2024-09-11 09:04:00  544.210  544.170  544.440  543.950  124656  544  1724
...                      ...      ...      ...      ...     ...  ...   ...
2024-09-11 13:45:00  552.285  551.910  552.332  551.900   93449  552   987
2024-09-11 13:46:00  551.910  551.715  552.020  551.710  169873  552  1512
2024-09-11 13:47:00  551.710  551.840  551.910  551.580  198264  552  1540
2024-09-11 13:48:00  551.845  552.250  552.280  551.840  124869  552  1222
2024-09-11 13:49:00  552.240  552.100  552.250  552.020  275982  551  1117

[290 rows x 7 columns]

I was attempting to follow this same example with the specific answer here. Although there were no comments or explanation it did appear that it does work from reading the other comments.

Attempt

After creating my DataFrame with the correct data I attempted to run the resample and it fails with the below output.

code:

vals = vals.resample('5min').agg({
    'o': lambda vals: vals.iloc[0],
    'h': lambda vals: vals.max(),
    'l': lambda vals: vals.min(),
    'c': lambda vals: vals.iloc[-1],
    'v': lambda vals: vals.sum(),
    'n': lambda vals: vals.sum()
})
print(vals)

output:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[18], line 11
      1 vals.agg({
      2     'o': lambda vals: vals.iloc[0],
      3     'h': lambda vals: vals.max(),
   (...)
      7     'n': lambda vals: vals.sum()
      8 })
      9 print(vals)
---> 11 vals = vals.resample('5min').agg({
     12     'o': lambda vals: vals.iloc[0],
     13     'h': lambda vals: vals.max(),
     14     'l': lambda vals: vals.min(),
     15     'c': lambda vals: vals.iloc[-1],
     16     'v': lambda vals: vals.sum(),
     17     'n': lambda vals: vals.sum()
     18 })
     19 print(vals)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pandas\core\resample.py:352, in Resampler.aggregate(self, func, *args, **kwargs)
    343 @final
    344 @doc(
    345     _shared_docs["aggregate"],
   (...)
...
   1683 len_axis = len(self.obj._get_axis(axis))
   1684 if key >= len_axis or key < -len_axis:
-> 1685     raise IndexError("single positional indexer is out-of-bounds")

IndexError: single positional indexer is out-of-bounds
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

Troubleshooting

I first thought I had something wrong with my .agg function so I tested that separate as shown below.

vals.agg({
    'o': lambda vals: vals.iloc[0],
    'h': lambda vals: vals.max(),
    'l': lambda vals: vals.min(),
    'c': lambda vals: vals.iloc[-1],
    'v': lambda vals: vals.sum(),
    'n': lambda vals: vals.sum()
})
print(vals)

Where I get the below output:

                           o        c        h        l       v   vw     n
t                                                                         
2024-09-11 08:30:00  548.700  547.720  548.760  547.557  566501  548  5731
2024-09-11 09:01:00  543.850  544.630  544.780  543.780  394648  546  2891
2024-09-11 09:02:00  544.590  544.225  544.670  544.120  231857  547  1503
2024-09-11 09:03:00  544.250  544.200  544.349  544.110   65242  544  1297
2024-09-11 09:04:00  544.210  544.170  544.440  543.950  124656  544  1724
...                      ...      ...      ...      ...     ...  ...   ...
2024-09-11 13:45:00  552.285  551.910  552.332  551.900   93449  552   987
2024-09-11 13:46:00  551.910  551.715  552.020  551.710  169873  552  1512
2024-09-11 13:47:00  551.710  551.840  551.910  551.580  198264  552  1540
2024-09-11 13:48:00  551.845  552.250  552.280  551.840  124869  552  1222
2024-09-11 13:49:00  552.240  552.100  552.250  552.020  275982  551  1117

[290 rows x 7 columns]

Conclusion

Why am I getting an index error when I have 290 rows in the dataframe and I am asking for a .iloc[0] as the first item in the list or a .iloc[-1] for the last item in the list.
I also am fairly new to using a lambda function so in this case I simply copied and pasted that part so if there is an issue there some understanding would help.
It did properly function one time when I had a time frame of .resample('3min') the first time then changed to .resample('5min') and it would not work again even after changing back to 3min.

Solution

The issue is that resample will create empty Series for the missing periods (see below for more details), and iloc[0]/iloc[-1] will fail on those.

What you could do is to check the length/validity of the Series and return None otherwise. For example with v.iloc[0] if len(v) else None or None if v.empty else v.iloc[0]:

vals.resample('5min').agg({
    'o': lambda v: None if v.empty else v.iloc[0],
    'h': lambda v: v.max(),
    'l': lambda v: v.min(),
    'c': lambda v: v.iloc[-1] if len(v) else None,
    'v': lambda v: v.sum(),
    'n': lambda v: v.sum()
})

Note that in this particular case, you should rather go with the builtin 'first'/'last', and similarly for other standard aggregations (no need to use a custom function):

vals.resample('5min').agg({
    'o': 'first',
    'h': 'max',
    'l': 'min',
    'c': 'last',
    'v': 'sum',
    'n': 'sum',
})

Output:

                           o        h        l       c       v     n
t                                                                   
2024-09-11 08:30:00  548.700  548.760  547.557  547.72  566501  5731
2024-09-11 08:35:00      NaN      NaN      NaN     NaN       0     0
2024-09-11 08:40:00      NaN      NaN      NaN     NaN       0     0
2024-09-11 08:45:00      NaN      NaN      NaN     NaN       0     0
2024-09-11 08:50:00      NaN      NaN      NaN     NaN       0     0
...                      ...      ...      ...     ...     ...   ...
2024-09-11 13:25:00      NaN      NaN      NaN     NaN       0     0
2024-09-11 13:30:00      NaN      NaN      NaN     NaN       0     0
2024-09-11 13:35:00      NaN      NaN      NaN     NaN       0     0
2024-09-11 13:40:00      NaN      NaN      NaN     NaN       0     0
2024-09-11 13:45:00  552.285  552.332  551.580  552.10  862437  6378

[64 rows x 6 columns]

how `resample` works

resample will create a Series for each group or datetimes that match the offset. If there is no datapoint for the period, then an empty Series is created.

Thus iloc[0] wouldn't fail for example with resample('12h') since you have at least one value for each period of 12 hours:

                        o        h       l       c        v      n
t                                                                 
2024-09-11 00:00:00  None  548.760  543.78  544.17  1382904  13146
2024-09-11 12:00:00  None  552.332  551.58  552.10   862437   6378

However, for '5min' there is no data for 08:35−9:00 and pandas will create an empty Series to pass to the aggregation function.

To see this for yourself, you can run for example lambda x: print(x) or x.max() as aggregation function, which will print each underlying Series before returning the max:

# vals.resample('1h').agg({'o': lambda x: print(x, '\n') or x.max()})

t                                         # datapoints for 8:00-9:00
2024-09-11 08:30:00    548.7
Name: o, dtype: float64

t                                         # datapoints for 9:00-10:00
2024-09-11 09:01:00    543.85
2024-09-11 09:02:00    544.59
2024-09-11 09:03:00    544.25
2024-09-11 09:04:00    544.21
Name: o, dtype: float64

Series([], Name: o, dtype: float64)       # no datapoints for 10:00-11:00

Series([], Name: o, dtype: float64)       # no datapoints for 11:00-12:00

Series([], Name: o, dtype: float64)       # no datapoints for 12:00-13:00

t                                         # datapoints for 12:00-13:00
2024-09-11 13:45:00    552.285
2024-09-11 13:46:00    551.910
2024-09-11 13:47:00    551.710
2024-09-11 13:48:00    551.845
2024-09-11 13:49:00    552.240
Name: o, dtype: float64

                           o             # this is the final output
t                           
2024-09-11 08:00:00  548.700
2024-09-11 09:00:00  544.590
2024-09-11 10:00:00      NaN             # missing periods are created
2024-09-11 11:00:00      NaN             #
2024-09-11 12:00:00      NaN             #
2024-09-11 13:00:00  552.285

You can observe that empty Series are created for hours 10/11/12 that are missing.

Pandas .resample with .agg function failing on "IndexError: single positional indexer is out-of-bounds

Attempt

Troubleshooting

Conclusion

how resample works

how `resample` works