pythondataframepython-polarsrust-polars

Repeating a date in polars and exploding it


I have a polars dataframe with two date columns that represent a start and end date and then a value that I want to repeat for all dates in between those two dates so that I can join those on other tables.

Example input is

df = pl.from_repr("""
┌─────┬─────────────────────┬─────────────────────┬───────┐
│ id  ┆ start               ┆ end                 ┆ value │
│ --- ┆ ---                 ┆ ---                 ┆ ---   │
│ str ┆ datetime[ns]        ┆ datetime[ns]        ┆ i64   │
╞═════╪═════════════════════╪═════════════════════╪═══════╡
│ 123 ┆ 2022-01-01 00:00:00 ┆ 2022-01-04 00:00:00 ┆ 10    │
│ abc ┆ 2022-03-04 00:00:00 ┆ 2022-03-04 00:00:00 ┆ 3     │
│ 456 ┆ 2022-05-11 00:00:00 ┆ 2022-05-16 00:00:00 ┆ 4     │
└─────┴─────────────────────┴─────────────────────┴───────┘
""")

and expected output is

shape: (11, 3)
┌─────┬─────────────────────┬───────┐
│ id  ┆ date                ┆ value │
│ --- ┆ ---                 ┆ ---   │
│ str ┆ datetime[ns]        ┆ i64   │
╞═════╪═════════════════════╪═══════╡
│ 123 ┆ 2022-01-01 00:00:00 ┆ 10    │
│ 123 ┆ 2022-01-02 00:00:00 ┆ 10    │
│ 123 ┆ 2022-01-03 00:00:00 ┆ 10    │
│ 123 ┆ 2022-01-04 00:00:00 ┆ 10    │
│ abc ┆ 2022-03-04 00:00:00 ┆ 3     │
│ 456 ┆ 2022-05-11 00:00:00 ┆ 4     │
│ 456 ┆ 2022-05-12 00:00:00 ┆ 4     │
│ 456 ┆ 2022-05-13 00:00:00 ┆ 4     │
│ 456 ┆ 2022-05-14 00:00:00 ┆ 4     │
│ 456 ┆ 2022-05-15 00:00:00 ┆ 4     │
│ 456 ┆ 2022-05-16 00:00:00 ┆ 4     │
└─────┴─────────────────────┴───────┘

Solution

  • I struggled today with the same problem and I thought I share my solution.

    As cbilot already mentions pl.date_range doesn't take expressions as low and high value. So I worked around by using apply.

    Data:

    import polars as pl
    from datetime import date
    
    df = pl.DataFrame(
        {
            "id": ["123", "abc", "456"],
            "start": [date(2022, 1, 1), date(2022, 3, 4), date(2022, 5, 11)],
            "end": [date(2022, 1, 4), date(2022, 3, 4), date(2022, 5, 16)],
            "value": [10, 3, 4],
        }
    )
    

    Solution:

    (
        df.with_columns(
            pl.struct("start", "end")
              .map_elements(lambda x: pl.date_range(x["start"], x["end"], "1d", eager=True))
              .alias("date"))
          .explode("date")
          .select("id", "date", "value")
    )
    
    shape: (11, 3)
    ┌─────┬────────────┬───────┐
    │ id  ┆ date       ┆ value │
    │ --- ┆ ---        ┆ ---   │
    │ str ┆ date       ┆ i64   │
    ╞═════╪════════════╪═══════╡
    │ 123 ┆ 2022-01-01 ┆ 10    │
    │ 123 ┆ 2022-01-02 ┆ 10    │
    │ 123 ┆ 2022-01-03 ┆ 10    │
    │ 123 ┆ 2022-01-04 ┆ 10    │
    │ abc ┆ 2022-03-04 ┆ 3     │
    │ 456 ┆ 2022-05-11 ┆ 4     │
    │ 456 ┆ 2022-05-12 ┆ 4     │
    │ 456 ┆ 2022-05-13 ┆ 4     │
    │ 456 ┆ 2022-05-14 ┆ 4     │
    │ 456 ┆ 2022-05-15 ┆ 4     │
    │ 456 ┆ 2022-05-16 ┆ 4     │
    └─────┴────────────┴───────┘