pythonpython-polarsrust-polarspolars

polars date quarter parsing using strptime returns null


Using the documentation here (which also points to here) I would expect the following use of the Polars strptime function to produce a pl.Date value:

import polars as pl

date_format = "%Y-Q%q-%d"
df = pl.DataFrame({
    "quarter_str": ["2024-Q1-01", "2023-Q3-01", "2025-Q2-01"]
})

## another approach that does not work
#date_format = "%Y-Q%q"
#df = pl.DataFrame({
#    "quarter_str": ["2024-Q1", "2023-Q3", "2025-Q2"]
#})

result = df.with_columns(
    pl.col("quarter_str").str.strptime(pl.Date, format=date_format, strict=False).alias("parsed_date")
)

print(result)

I'm not sure if this is my issue, a Polars issue, or an issue with the Rust library. But it seems like the parsing of the quarter is not performing as expected. Note that neither of the above approaches work (see commented out value). At first I thought it would not assume the first day of the quarter, but passing a day %d value didn't help either.

Is there a python polars approach to convert a string only containing a year/quarter into a pl.Date value?

The expected output for the string '2023-Q3', for example, would be the date July 1st, 2023. Since that date is the first day of the 3rd quarter. The expected output for the string '2023-Q3-27' would be July 27th, 2023 (the 27th day of the quarter). The expected output for the string '2023-Q3-45' would be the 45th day of the 3rd quarter of 2023 - sometime in mid-August, for example.


Solution

  • As far as I know, there is no direct way to parse those formats using pl.Expr.str.strptime. An alternative approach using column expressions is:

    1. Extract the year, quarter and optionally the day since the start of the quarter from the quarter_str

    2. Determine the starting month of the quarter

    3. Construct the quarter start date

    4. If quarter_str contains days at the end of the string, add them to the quarter start date

    import polars as pl
    
    df = pl.DataFrame(
        {
            "quarter_str": ["2023-Q3-01", "2023-Q3-27", "2025-Q4", "2025-Q3-45"],
        }
    )
    
    quarter = pl.col("quarter_str").str.extract("Q(\d)").cast(pl.Int8)
    year = pl.col("quarter_str").str.extract("(\d{4})").cast(pl.Int16)
    month = (quarter - 1) * 3 + 1
    quarter_start = pl.date(year, month, 1)
    
    has_day_of_quarter = pl.col("quarter_str").str.contains("\d{4}-Q\d-\d+")
    days_since_quarter_start = (
        pl.when(has_day_of_quarter)
        .then(pl.col("quarter_str").str.extract("-(\d+)$").cast(pl.Int16) - 1)
        .otherwise(0)
    )
    
    result = df.with_columns(
        (quarter_start + pl.duration(days=days_since_quarter_start)).alias("parsed_date")
    )
    

    Output :

    >>> result
    
    shape: (4, 2)
    ┌─────────────┬─────────────┐
    │ quarter_str ┆ parsed_date │
    │ ---         ┆ ---         │
    │ str         ┆ date        │
    ╞═════════════╪═════════════╡
    │ 2023-Q3-01  ┆ 2023-07-01  │
    │ 2023-Q3-27  ┆ 2023-07-27  │
    │ 2025-Q4     ┆ 2025-10-01  │
    │ 2025-Q3-45  ┆ 2025-08-14  │
    └─────────────┴─────────────┘