datepython-polars

Why is my value returned as a pl.Float32 and not a pl.Date?


Update: This issue has since been resolved. The output is a date type as expected.

shape: (2, 3)
┌──────────┬────────────┬───────────────────────────┐
│ customer ┆ order_date ┆ startdate_before_override │
│ ---      ┆ ---        ┆ ---                       │
│ str      ┆ null       ┆ date                      │
╞══════════╪════════════╪═══════════════════════════╡
│ Mike     ┆ null       ┆ 2023-01-01                │
│ Jan      ┆ null       ┆ 2023-01-01                │
└──────────┴────────────┴───────────────────────────┘

Original question

I have a dataframe with customers and order dates. Order dates can be either a pl.Date or a null, but currently this column contains only null values. I want to create a new column, "startdate_before_override", which is set to either the "order_date" value of the same row or the "startdate" value (a constant of dtype pl.Date, declared at the third line).

I have tried doing this with the when/then/otherwise pattern, adding a check for nulls as the first "when" (edit: I see now that this check is actually not needed, but that's beside the point of my issue).

The problem I am trying to solve is that the values returned to startdate_before_override are pl.Float32 and not pl.Date

import polars as pl

startdate = pl.date(year=2023, month=1, day=1)

data = {"customer": ["Mike", "Jan"], "order_date": [None, None]}

df = pl.DataFrame(data)
(df
.with_columns(
    pl
    .when(pl.col("order_date").is_null())
    .then(startdate)
    .when(pl.col("order_date")>startdate)
    .then(pl.col("order_date"))
    .otherwise(startdate)
    .alias("startdate_before_override"),
)
)

The output is not as I expect:

shape: (2, 3)
┌──────────┬────────────┬───────────────────────────┐
│ customer ┆ order_date ┆ startdate_before_override │
│ ---      ┆ ---        ┆ ---                       │
│ str      ┆ f32        ┆ f32                       │ # <- f32, why?
╞══════════╪════════════╪═══════════════════════════╡
│ Mike     ┆ null       ┆ 19358.0                   │
│ Jan      ┆ null       ┆ 19358.0                   │
└──────────┴────────────┴───────────────────────────┘

Since both rows have null values for the order_date col, I would expect the result to be 2023-01-01 in a pl.Date format.


Solution

  • In your example the order_date column is all nulls and defaulting to the f32 dtype.

    As your when/then can produce startdate or order_date, i.e. either an f32 or pl.Date value, Polars calculates the "supertype", which in this case is f32.

    startdate is being cast to a float, which is where 19358.0 is coming from.

    >>> pl.select(startdate.cast(float))
    shape: (1, 1)
    ┌─────────┐
    │ date    │
    │ ---     │
    │ f64     │
    ╞═════════╡
    │ 19358.0 │
    └─────────┘
    

    You can either explicitly set the type of the column to date on creation, e.g. with schema= / schema_overrides=

    df = pl.DataFrame(data, schema_overrides={"order_date": pl.Date})
    
    df.with_columns(
       pl.when(pl.col("order_date") > startdate)
         .then(pl.col("order_date"))
         .otherwise(startdate)
         .alias("startdate_before_override"),
    )
    

    Or you can explicitly .cast() the result of the .when().then()

    df.with_columns(
       pl.when(pl.col("order_date") > startdate)
         .then(pl.col("order_date"))
         .otherwise(startdate)
         .cast(pl.Date)
         .alias("startdate_before_override"),
    )
    
    shape: (2, 3)
    ┌──────────┬────────────┬───────────────────────────┐
    │ customer ┆ order_date ┆ startdate_before_override │
    │ ---      ┆ ---        ┆ ---                       │
    │ str      ┆ date       ┆ date                      │
    ╞══════════╪════════════╪═══════════════════════════╡
    │ Mike     ┆ null       ┆ 2023-01-01                │
    │ Jan      ┆ null       ┆ 2023-01-01                │
    └──────────┴────────────┴───────────────────────────┘