python-polars

Polars: how to add years as literals?


I have a Polars LazyFrame that, after applying several functions, looks like this:

┌───────────────┬──────────────┬─────────────────────────┬──────────────────────────┐
│ citing_patent ┆ cited_patent ┆ cited_patent_issue_date ┆ citing_patent_issue_date │
│ ---           ┆ ---          ┆ ---                     ┆ ---                      │
│ str           ┆ str          ┆ date                    ┆ date                     │
╞═══════════════╪══════════════╪═════════════════════════╪══════════════════════════╡
│ X             ┆ A            ┆ 2000-10-20              ┆ 2001-02-08               │
│ X             ┆ B            ┆ 1999-08-04              ┆ 2001-02-08               │
│ Y             ┆ B            ┆ 1999-08-04              ┆ 2004-06-04               │
│ N             ┆ A            ┆ 2000-10-20              ┆ 2021-12-20               │
└───────────────┴──────────────┴─────────────────────────┴──────────────────────────┘

I would like to group it by cited_patent, and have a column for the number of citing_patents within three years of cited_patent_issue_date.

After reading 15741618, I tried using dateutil.relativedelta.

Here is the code I have so far:

.group_by("cited_patent")
.agg(
    pl.col("cited_patent_issue_date").first(),
    (pl.col("citing_patent_issue_date") <= pl.col("cited_patent_issue_date").first() + relativedelta(years=3)).sum()
)

However, this doesn't work, as I get an error:

pyo3_runtime.PanicException: could not convert value relativedelta(years=+3) as a Literal

I can't seem to find anything else on this, so I'm a bit stuck.

What's the recommended way to add years to dates in Polars?


Solution

  • You can use the .dt.offset_by function. Starting with this data:

    import polars as pl
    
    df = pl.DataFrame(
        {
            "citing_patent": ["X", "X", "Y", "N"],
            "cited_patent": ["A", "B", "B", "A"],
            "cited_patent_issue_date": [
                "2000-10-20",
                "1999-08-04",
                "1999-08-04",
                "2000-10-20",
            ],
            "citing_patent_issue_date": [
                "2001-02-08",
                "2001-02-08",
                "2004-06-04",
                "2021-12-20",
            ],
        }
    ).with_columns(pl.col("^.*_date$").str.to_date())
    df
    
    shape: (4, 4)
    ┌───────────────┬──────────────┬─────────────────────────┬──────────────────────────┐
    │ citing_patent ┆ cited_patent ┆ cited_patent_issue_date ┆ citing_patent_issue_date │
    │ ---           ┆ ---          ┆ ---                     ┆ ---                      │
    │ str           ┆ str          ┆ date                    ┆ date                     │
    ╞═══════════════╪══════════════╪═════════════════════════╪══════════════════════════╡
    │ X             ┆ A            ┆ 2000-10-20              ┆ 2001-02-08               │
    │ X             ┆ B            ┆ 1999-08-04              ┆ 2001-02-08               │
    │ Y             ┆ B            ┆ 1999-08-04              ┆ 2004-06-04               │
    │ N             ┆ A            ┆ 2000-10-20              ┆ 2021-12-20               │
    └───────────────┴──────────────┴─────────────────────────┴──────────────────────────┘
    

    We can use the offset_by with 3y. Note that it's under the dt namespace of datetime expressions in Polars.

    (
        df.group_by("cited_patent").agg(
            pl.col("cited_patent_issue_date").first(),
            (
                pl.col("citing_patent_issue_date")
                <= pl.col("cited_patent_issue_date").first().dt.offset_by('3y')
            ).sum(),
        )
    )
    
    shape: (2, 3)
    ┌──────────────┬─────────────────────────┬──────────────────────────┐
    │ cited_patent ┆ cited_patent_issue_date ┆ citing_patent_issue_date │
    │ ---          ┆ ---                     ┆ ---                      │
    │ str          ┆ date                    ┆ u32                      │
    ╞══════════════╪═════════════════════════╪══════════════════════════╡
    │ B            ┆ 1999-08-04              ┆ 1                        │
    │ A            ┆ 2000-10-20              ┆ 1                        │
    └──────────────┴─────────────────────────┴──────────────────────────┘