I have a Polars LazyFrame that, after applying several functions, looks like this:
┌───────────────┬──────────────┬─────────────────────────┬──────────────────────────┐
│ citing_patent ┆ cited_patent ┆ cited_patent_issue_date ┆ citing_patent_issue_date │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ date ┆ date │
╞═══════════════╪══════════════╪═════════════════════════╪══════════════════════════╡
│ X ┆ A ┆ 2000-10-20 ┆ 2001-02-08 │
│ X ┆ B ┆ 1999-08-04 ┆ 2001-02-08 │
│ Y ┆ B ┆ 1999-08-04 ┆ 2004-06-04 │
│ N ┆ A ┆ 2000-10-20 ┆ 2021-12-20 │
└───────────────┴──────────────┴─────────────────────────┴──────────────────────────┘
I would like to group it by cited_patent
, and have a column for the number of citing_patent
s within three years of cited_patent_issue_date
.
After reading 15741618, I tried using dateutil.relativedelta
.
Here is the code I have so far:
.group_by("cited_patent")
.agg(
pl.col("cited_patent_issue_date").first(),
(pl.col("citing_patent_issue_date") <= pl.col("cited_patent_issue_date").first() + relativedelta(years=3)).sum()
)
However, this doesn't work, as I get an error:
pyo3_runtime.PanicException: could not convert value relativedelta(years=+3) as a Literal
I can't seem to find anything else on this, so I'm a bit stuck.
What's the recommended way to add years to dates in Polars?
You can use the .dt.offset_by
function. Starting with this data:
import polars as pl
df = pl.DataFrame(
{
"citing_patent": ["X", "X", "Y", "N"],
"cited_patent": ["A", "B", "B", "A"],
"cited_patent_issue_date": [
"2000-10-20",
"1999-08-04",
"1999-08-04",
"2000-10-20",
],
"citing_patent_issue_date": [
"2001-02-08",
"2001-02-08",
"2004-06-04",
"2021-12-20",
],
}
).with_columns(pl.col("^.*_date$").str.to_date())
df
shape: (4, 4)
┌───────────────┬──────────────┬─────────────────────────┬──────────────────────────┐
│ citing_patent ┆ cited_patent ┆ cited_patent_issue_date ┆ citing_patent_issue_date │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ date ┆ date │
╞═══════════════╪══════════════╪═════════════════════════╪══════════════════════════╡
│ X ┆ A ┆ 2000-10-20 ┆ 2001-02-08 │
│ X ┆ B ┆ 1999-08-04 ┆ 2001-02-08 │
│ Y ┆ B ┆ 1999-08-04 ┆ 2004-06-04 │
│ N ┆ A ┆ 2000-10-20 ┆ 2021-12-20 │
└───────────────┴──────────────┴─────────────────────────┴──────────────────────────┘
We can use the offset_by
with 3y
. Note that it's under the dt
namespace of datetime expressions in Polars.
(
df.group_by("cited_patent").agg(
pl.col("cited_patent_issue_date").first(),
(
pl.col("citing_patent_issue_date")
<= pl.col("cited_patent_issue_date").first().dt.offset_by('3y')
).sum(),
)
)
shape: (2, 3)
┌──────────────┬─────────────────────────┬──────────────────────────┐
│ cited_patent ┆ cited_patent_issue_date ┆ citing_patent_issue_date │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞══════════════╪═════════════════════════╪══════════════════════════╡
│ B ┆ 1999-08-04 ┆ 1 │
│ A ┆ 2000-10-20 ┆ 1 │
└──────────────┴─────────────────────────┴──────────────────────────┘