I have a python polars dataframe that is quite large where Pandas runs into memory errors. I want to use python polars but am running into an issue of taking a integer representation of date to make two new columns: PeriodDate, and LagDate. I can do this on a sample in Pandas using the following:
df['PeriodDate'] = pd.to_datetime(df['IntegerDate'],format='%Y%m')
df['LaggedDate'] = df['PeriodDate'] - pd.DateOffset(months=1)
I have tried to do the following:
df.with_columns(
pl.col('IntegerDate').str.strptime(pl.Datetime,"%Y%m")
)
SchemaError: Series of dtype: Int64 != Utf8.
For reference the 'IntegerDate' column is of the format: 202005, 202006, ...etc
I haven't been able to find good examples of how to do this in polars so any help would be greatly appreciated.
Thanks!
Here is Polars implementation
import polars as pl
from datetime import datetime as dt
df = pl.DataFrame({'IntegerDate': [202005, 202006, 202207, 202303, 202109]})
df = df.with_columns(pl.col('IntegerDate').cast(pl.String).str.to_date('%Y%m').alias('PeriodDate'))
df
Output
┌─────────────┬────────────┐
│ IntegerDate ┆ PeriodDate │
│ --- ┆ --- │
│ i64 ┆ date │
╞═════════════╪════════════╡
│ 202005 ┆ 2020-05-01 │
│ 202006 ┆ 2020-06-01 │
│ 202207 ┆ 2022-07-01 │
│ 202303 ┆ 2023-03-01 │
│ 202109 ┆ 2021-09-01 │
└─────────────┴────────────┘
# add `LaggedDate`
df = df.with_columns(pl.col('PeriodDate').dt.offset_by('-1mo').alias('LaggedDate'))
df
Final Output
┌─────────────┬────────────┬────────────┐
│ IntegerDate ┆ PeriodDate ┆ LaggedDate │
│ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ date │
╞═════════════╪════════════╪════════════╡
│ 202005 ┆ 2020-05-01 ┆ 2020-04-01 │
│ 202006 ┆ 2020-06-01 ┆ 2020-05-01 │
│ 202207 ┆ 2022-07-01 ┆ 2022-06-01 │
│ 202303 ┆ 2023-03-01 ┆ 2023-02-01 │
│ 202109 ┆ 2021-09-01 ┆ 2021-08-01 │
└─────────────┴────────────┴────────────┘
more on date offsets - .dt.offset_by()