The following works as expected
from datetime import datetime
from hypothesis.extra.pandas import columns, data_frames, indexes
import hypothesis.strategies as st
def boundarize(d: datetime):
return d.replace(minute=15 * (d.minute // 15), second=0, microsecond=0)
min_date = datetime(2022, 4, 1, 22, 22, 22)
max_date = datetime(2022, 5, 1, 22, 22, 22)
dfs = data_frames(
index=indexes(
elements=st.datetimes(min_value=min_date, max_value=max_date).map(boundarize),
min_size=3,
max_size=5,
).map(lambda idx: idx.sort_values()),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
with an output similar to
A B C
2022-04-06 12:45:00 -11482 1588438979 -1994987295
2022-04-08 15:45:00 -833447611 3 -51
2022-04-24 06:15:00 -465371373 990274387 -14969
2022-05-01 01:15:00 1750446827 1214440777 116
2022-05-01 06:15:00 -44089 30508 58737
now when I try to generate a similar DataFrame with evenly spaced DatetimeIndex values via
from datetime import datetime
from hypothesis.extra.pandas import columns, data_frames, indexes
import hypothesis.strategies as st
def boundarize(d: datetime):
return d.replace(minute=15 * (d.minute // 15), second=0, microsecond=0)
min_date_start = datetime(2022, 4, 1, 11, 11, 11)
max_date_start = datetime(2022, 4, 2, 11, 11, 11)
min_date_end = datetime(2022, 5, 1, 22, 22, 22)
max_date_end = datetime(2022, 5, 2, 22, 22, 22)
dfs = data_frames(
index=st.builds(pd.date_range,
start=st.datetimes(min_value=min_date_start, max_value=max_date_start).map(boundarize),
end=st.datetimes(min_value=min_date_end, max_value=max_date_end).map(boundarize),
freq=st.just("15T"),
),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
The output is the following, note that the integer columns are always zero when they were not in the first example:
A B C
2022-04-01 15:45:00 0 0 0
2022-04-01 16:00:00 0 0 0
2022-04-01 16:15:00 0 0 0
2022-04-01 16:30:00 0 0 0
2022-04-01 16:45:00 0 0 0
... .. .. ..
2022-05-01 21:15:00 0 0 0
2022-05-01 21:30:00 0 0 0
2022-05-01 21:45:00 0 0 0
2022-05-01 22:00:00 0 0 0
2022-05-01 22:15:00 0 0 0
[2907 rows x 3 columns]
is this expected behavior or am I missing something?
Edit:
Sidestepping the approach of "random consecutive subsets" (see my comments below), I also tried with a pre-defined index
from datetime import datetime
from hypothesis.extra.pandas import columns, data_frames
import hypothesis.strategies as st
min_date_start = datetime(2022, 4, 1, 8, 0, 0)
dfs = data_frames(
index=st.just(pd.date_range(start=min_date_start, periods=10, freq="15T")),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
which gives all zero columns as well
A B C
2022-04-01 08:00:00 0 0 0
2022-04-01 08:15:00 0 0 0
2022-04-01 08:30:00 0 0 0
2022-04-01 08:45:00 0 0 0
2022-04-01 09:00:00 0 0 0
2022-04-01 09:15:00 0 0 0
2022-04-01 09:30:00 0 0 0
2022-04-01 09:45:00 0 0 0
2022-04-01 10:00:00 0 0 0
2022-04-01 10:15:00 0 0 0
Edit 2:
I tried to come up with a handmade version of consecutive subsets which should reduce the space of values to leave enough entropy for the column values as per @zac-hatfield-dodds answer, but empirically it still generates mostly all zero column values
from datetime import datetime
import math
import hypothesis.strategies as st
from hypothesis.extra.pandas import columns, data_frames
import pandas as pd
time_start = datetime(2022, 4, 1, 8, 0, 0)
time_stop = datetime(2022, 4, 2, 8, 0, 0)
r = pd.date_range(start=time_start, end=time_stop, freq="15T")
def build_indices(sequence):
first = 0
if len(sequence) % 2 == 0:
mid_ceiling = len(sequence) // 2
mid_floor = mid_ceiling - 1
else:
mid_floor = math.floor(len(sequence) / 2)
mid_ceiling = mid_floor + 1
second = len(sequence) - 1
return first, mid_floor, mid_ceiling, second
first, mid_floor, mid_ceiling, second = build_indices(r)
a = st.integers(min_value=first, max_value=mid_floor)
b = st.integers(min_value=mid_ceiling, max_value=second)
def indexer(sequence, lower, upper):
return sequence[lower:upper]
dfs = data_frames(
index=st.builds(lambda lower, upper: indexer(r, lower, upper), lower=a, upper=b),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
Your problem is that the latter indexes are way way larger, and Hypothesis is running out of entropy to generate column contents. If you limit the index to at most a few dozen entries, everything should work fine.
We have this soft-cap in order to limit otherwise unbounded recursive structures, so the overall design is working as intended though I acknowledge that in this case it's neither necessary nor desirable.