I have a Categorical
column named decile
in my polars DataFrame df
, with its values ranging from "01" to "10". When attempting to convert that column into a numerical representation via:
df.with_columns(pl.col('decile').cast(pl.Int8))
, the casted values are not mapped as expected (i.e., "01" doesn't get mapped to 1, and so on), and the range now also from 0 to 9, not 1 to 10.
The weird thing is that no matter what the original values of the column decile
were, they will always get mapped unexpectedly, and to [0, 9] when casting it into an integer datatype.
I am trying to cast the values into integer datatype for plotting purposes.
Here is a minimal reproducible example:
size = 1e3
df = pl.DataFrame({
"id": np.random.randint(50, size=int(size), dtype=np.uint16),
"amount": np.round(np.random.uniform(10, 100000, int(size)).astype(np.float32), 2),
"quantity": np.random.randint(1, 7, size=int(size), dtype=np.uint16),
})
df = (df
.group_by("id")
.agg(revenue=pl.sum("amount"), tot_quantity=pl.sum("quantity"))
)
df = (df.with_columns(
pl.col('revenue')
.qcut(10, labels=[f'q{i:02}' for i in range(10, 0, -1)])
.alias("decile")
))
How to have the casting be proper (as one would expect the values to be mapped), and in the same range as the original values?
The first cast on a pl.Categorical should always be string (pl.String
) first, then converting from string to int from here (in your example, a bit more than a straight cast is required to separate the q
):
pl.col('decile').cast(pl.String).str.slice(1).str.to_integer()