I am attempting to resample a set of stock market data from a database to different time variances. I am using a pandas dataframe with the data already read in from my database with no issues. That code is below for reference.
vals = dbconn.dbread(checkquery,val)
vals = pd.DataFrame(vals)
#vals['t']=pd.to_datetime(vals['t'])
vals = vals.set_index('t')
print(vals)
with the below output.
o c h l v vw n
t
2024-09-11 08:30:00 548.700 547.720 548.760 547.557 566501 548 5731
2024-09-11 09:01:00 543.850 544.630 544.780 543.780 394648 546 2891
2024-09-11 09:02:00 544.590 544.225 544.670 544.120 231857 547 1503
2024-09-11 09:03:00 544.250 544.200 544.349 544.110 65242 544 1297
2024-09-11 09:04:00 544.210 544.170 544.440 543.950 124656 544 1724
... ... ... ... ... ... ... ...
2024-09-11 13:45:00 552.285 551.910 552.332 551.900 93449 552 987
2024-09-11 13:46:00 551.910 551.715 552.020 551.710 169873 552 1512
2024-09-11 13:47:00 551.710 551.840 551.910 551.580 198264 552 1540
2024-09-11 13:48:00 551.845 552.250 552.280 551.840 124869 552 1222
2024-09-11 13:49:00 552.240 552.100 552.250 552.020 275982 551 1117
[290 rows x 7 columns]
I was attempting to follow this same example with the specific answer here. Although there were no comments or explanation it did appear that it does work from reading the other comments.
After creating my DataFrame with the correct data I attempted to run the resample and it fails with the below output.
code:
vals = vals.resample('5min').agg({
'o': lambda vals: vals.iloc[0],
'h': lambda vals: vals.max(),
'l': lambda vals: vals.min(),
'c': lambda vals: vals.iloc[-1],
'v': lambda vals: vals.sum(),
'n': lambda vals: vals.sum()
})
print(vals)
output:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[18], line 11
1 vals.agg({
2 'o': lambda vals: vals.iloc[0],
3 'h': lambda vals: vals.max(),
(...)
7 'n': lambda vals: vals.sum()
8 })
9 print(vals)
---> 11 vals = vals.resample('5min').agg({
12 'o': lambda vals: vals.iloc[0],
13 'h': lambda vals: vals.max(),
14 'l': lambda vals: vals.min(),
15 'c': lambda vals: vals.iloc[-1],
16 'v': lambda vals: vals.sum(),
17 'n': lambda vals: vals.sum()
18 })
19 print(vals)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pandas\core\resample.py:352, in Resampler.aggregate(self, func, *args, **kwargs)
343 @final
344 @doc(
345 _shared_docs["aggregate"],
(...)
...
1683 len_axis = len(self.obj._get_axis(axis))
1684 if key >= len_axis or key < -len_axis:
-> 1685 raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
I first thought I had something wrong with my .agg
function so I tested that separate as shown below.
vals.agg({
'o': lambda vals: vals.iloc[0],
'h': lambda vals: vals.max(),
'l': lambda vals: vals.min(),
'c': lambda vals: vals.iloc[-1],
'v': lambda vals: vals.sum(),
'n': lambda vals: vals.sum()
})
print(vals)
Where I get the below output:
o c h l v vw n
t
2024-09-11 08:30:00 548.700 547.720 548.760 547.557 566501 548 5731
2024-09-11 09:01:00 543.850 544.630 544.780 543.780 394648 546 2891
2024-09-11 09:02:00 544.590 544.225 544.670 544.120 231857 547 1503
2024-09-11 09:03:00 544.250 544.200 544.349 544.110 65242 544 1297
2024-09-11 09:04:00 544.210 544.170 544.440 543.950 124656 544 1724
... ... ... ... ... ... ... ...
2024-09-11 13:45:00 552.285 551.910 552.332 551.900 93449 552 987
2024-09-11 13:46:00 551.910 551.715 552.020 551.710 169873 552 1512
2024-09-11 13:47:00 551.710 551.840 551.910 551.580 198264 552 1540
2024-09-11 13:48:00 551.845 552.250 552.280 551.840 124869 552 1222
2024-09-11 13:49:00 552.240 552.100 552.250 552.020 275982 551 1117
[290 rows x 7 columns]
.iloc[0]
as the first item in the list or a .iloc[-1]
for the last item in the list..resample('3min')
the first time then changed to .resample('5min')
and it would not work again even after changing back to 3min.The issue is that resample
will create empty Series for the missing periods (see below for more details), and iloc[0]
/iloc[-1]
will fail on those.
What you could do is to check the length/validity of the Series and return None otherwise. For example with v.iloc[0] if len(v) else None
or None if v.empty else v.iloc[0]
:
vals.resample('5min').agg({
'o': lambda v: None if v.empty else v.iloc[0],
'h': lambda v: v.max(),
'l': lambda v: v.min(),
'c': lambda v: v.iloc[-1] if len(v) else None,
'v': lambda v: v.sum(),
'n': lambda v: v.sum()
})
Note that in this particular case, you should rather go with the builtin 'first'
/'last'
, and similarly for other standard aggregations (no need to use a custom function):
vals.resample('5min').agg({
'o': 'first',
'h': 'max',
'l': 'min',
'c': 'last',
'v': 'sum',
'n': 'sum',
})
Output:
o h l c v n
t
2024-09-11 08:30:00 548.700 548.760 547.557 547.72 566501 5731
2024-09-11 08:35:00 NaN NaN NaN NaN 0 0
2024-09-11 08:40:00 NaN NaN NaN NaN 0 0
2024-09-11 08:45:00 NaN NaN NaN NaN 0 0
2024-09-11 08:50:00 NaN NaN NaN NaN 0 0
... ... ... ... ... ... ...
2024-09-11 13:25:00 NaN NaN NaN NaN 0 0
2024-09-11 13:30:00 NaN NaN NaN NaN 0 0
2024-09-11 13:35:00 NaN NaN NaN NaN 0 0
2024-09-11 13:40:00 NaN NaN NaN NaN 0 0
2024-09-11 13:45:00 552.285 552.332 551.580 552.10 862437 6378
[64 rows x 6 columns]
resample
worksresample
will create a Series for each group or datetimes that match the offset. If there is no datapoint for the period, then an empty Series is created.
Thus iloc[0]
wouldn't fail for example with resample('12h')
since you have at least one value for each period of 12 hours:
o h l c v n
t
2024-09-11 00:00:00 None 548.760 543.78 544.17 1382904 13146
2024-09-11 12:00:00 None 552.332 551.58 552.10 862437 6378
However, for '5min'
there is no data for 08:35−9:00 and pandas will create an empty Series to pass to the aggregation function.
To see this for yourself, you can run for example lambda x: print(x) or x.max()
as aggregation function, which will print each underlying Series before returning the max:
# vals.resample('1h').agg({'o': lambda x: print(x, '\n') or x.max()})
t # datapoints for 8:00-9:00
2024-09-11 08:30:00 548.7
Name: o, dtype: float64
t # datapoints for 9:00-10:00
2024-09-11 09:01:00 543.85
2024-09-11 09:02:00 544.59
2024-09-11 09:03:00 544.25
2024-09-11 09:04:00 544.21
Name: o, dtype: float64
Series([], Name: o, dtype: float64) # no datapoints for 10:00-11:00
Series([], Name: o, dtype: float64) # no datapoints for 11:00-12:00
Series([], Name: o, dtype: float64) # no datapoints for 12:00-13:00
t # datapoints for 12:00-13:00
2024-09-11 13:45:00 552.285
2024-09-11 13:46:00 551.910
2024-09-11 13:47:00 551.710
2024-09-11 13:48:00 551.845
2024-09-11 13:49:00 552.240
Name: o, dtype: float64
o # this is the final output
t
2024-09-11 08:00:00 548.700
2024-09-11 09:00:00 544.590
2024-09-11 10:00:00 NaN # missing periods are created
2024-09-11 11:00:00 NaN #
2024-09-11 12:00:00 NaN #
2024-09-11 13:00:00 552.285
You can observe that empty Series are created for hours 10/11/12 that are missing.