pythonpandaspython-polars

How to make a new date column off of a integer representation using python polars?


I have a python polars dataframe that is quite large where Pandas runs into memory errors. I want to use python polars but am running into an issue of taking a integer representation of date to make two new columns: PeriodDate, and LagDate. I can do this on a sample in Pandas using the following:

df['PeriodDate'] = pd.to_datetime(df['IntegerDate'],format='%Y%m')
df['LaggedDate'] = df['PeriodDate'] - pd.DateOffset(months=1)

I have tried to do the following:

df.with_columns(
  pl.col('IntegerDate').str.strptime(pl.Datetime,"%Y%m")
)

SchemaError: Series of dtype: Int64 != Utf8.

For reference the 'IntegerDate' column is of the format: 202005, 202006, ...etc

I haven't been able to find good examples of how to do this in polars so any help would be greatly appreciated.

Thanks!


Solution

  • Here is Polars implementation

    import polars as pl
    from datetime import datetime as dt
    
    df = pl.DataFrame({'IntegerDate': [202005, 202006, 202207, 202303, 202109]})
    
    df = df.with_columns(pl.col('IntegerDate').cast(pl.String).str.to_date('%Y%m').alias('PeriodDate'))
    
    df
    

    Output

    ┌─────────────┬────────────┐
    │ IntegerDate ┆ PeriodDate │
    │ ---         ┆ ---        │
    │ i64         ┆ date       │
    ╞═════════════╪════════════╡
    │ 202005      ┆ 2020-05-01 │
    │ 202006      ┆ 2020-06-01 │
    │ 202207      ┆ 2022-07-01 │
    │ 202303      ┆ 2023-03-01 │
    │ 202109      ┆ 2021-09-01 │
    └─────────────┴────────────┘
    
    # add `LaggedDate`
    df = df.with_columns(pl.col('PeriodDate').dt.offset_by('-1mo').alias('LaggedDate'))
    
    df
    

    Final Output

    ┌─────────────┬────────────┬────────────┐
    │ IntegerDate ┆ PeriodDate ┆ LaggedDate │
    │ ---         ┆ ---        ┆ ---        │
    │ i64         ┆ date       ┆ date       │
    ╞═════════════╪════════════╪════════════╡
    │ 202005      ┆ 2020-05-01 ┆ 2020-04-01 │
    │ 202006      ┆ 2020-06-01 ┆ 2020-05-01 │
    │ 202207      ┆ 2022-07-01 ┆ 2022-06-01 │
    │ 202303      ┆ 2023-03-01 ┆ 2023-02-01 │
    │ 202109      ┆ 2021-09-01 ┆ 2021-08-01 │
    └─────────────┴────────────┴────────────┘
    

    more on date offsets - .dt.offset_by()