I would like to construct a Pandas series that is any of several dtypes.
I was hoping to do something like this:
from hypothesis import given
import hypothesis.strategies as hs
import hypothesis.extra.numpy as hs_np
import hypothesis.extra.pandas as hs_pd
import numpy as np
import pandas as pd
import pandera as pda
import pytest
data_schema = pda.DataFrameSchema(...)
def dtype_not_float64() -> hs.SearchStrategy[np.dtype]:
return hs.one_of(
hs_np.integer_dtypes(),
hs_np.complex_number_dtypes(),
hs_np.datetime64_dtypes(),
hs_np.timedelta64_dtypes(),
)
@given(
hs_pandas.data_frames([
hs_pd.column("x", dtype=dtype_not_float64()),
hs_pd.column("y", dtype=dtype_not_float64()),
hs_pd.column("z", dtype=dtype_not_float64()),
])
)
def test_invalid(df: pd.DataFrame) -> None:
r"""Test that the schema does not pass invalid data."""
with pytest.raises(SchemaError):
_ = data_schema(df)
Arguably this is a silly test, but I hope it serves to illustrate what I am trying to achieve.
However, I got this error:
E hypothesis.errors.InvalidArgument: Cannot convert dtype=one_of(integer_dtypes(), complex_number_dtypes(), datetime64_dtypes(), timedelta64_dtypes()) of type OneOfStrategy to type dtype
Apparently one_of()
won't work with the dtypes=
parameter here.
Is there a straightforward way to generate a column with multiple possible dtypes?
This code is failing because the dtype=
argument to columns
must actually be a dtype, not a strategy to generate dtypes (docs). And unfortunately column
objects are a special placeholder object, so you can't st.one_of()
those either...
Solution: build up strategies for each series, put those in a list, and pd.concat()
them into a dataframe:
df = st.tuples(*[
dtype_not_float64().flatmap(lambda dt: hs_pd.column(name, dtype=dt))
for name in ["x", "y", "z"]
]).map(lambda ss: pd.concat(ss, axis=1))
...although this is fiddly enough that I'd suggest using an explicit @st.composite
function to make the logic more obvious:
@st.composite
def dataframes_with_names_and_dtypes(draw, names, dtype_strategy):
cols = [hs_pd.column(name, dtype=draw(dtype_strategy)) for name in names]
return draw(hs_pandas.data_frames(cols))
df = dataframes_with_names_and_dtypes(["x", "y", "z"], dtype_not_float64())