pythonpython-3.xmappingpython-polars

Is there a way to utilize polars map_batches to make this code more efficient?


I have some polars code that functionally can do what I want, but I feel it is an inefficient implementation at best. I feel that there must be some way to achieve the same result with .map_batches(), but I can't figure out how. Any thoughts or suggestions?

Specifically, my data are organized as follows: Each column is a location, and each row is a datetime. What I'm trying to do is calculate the maximum count of consecutive non-zero values (which I converted to Booleans because I don't need the magnitude of the value, I just need to know if the value is zero or not). Example data and example expected output below:

Example Dummy Data

pivoted_df = pl.from_repr("""
┌─────────────────────┬────────────┬────────────┐
│ Date                ┆ Location 1 ┆ Location 2 │
│ ---                 ┆ ---        ┆ ---        │
│ datetime[ns]        ┆ i64        ┆ i64        │
╞═════════════════════╪════════════╪════════════╡
│ 2023-01-01 00:00:00 ┆ 0          ┆ 1          │
│ 2023-01-01 01:00:00 ┆ 1          ┆ 1          │
│ 2023-01-01 02:00:00 ┆ 1          ┆ 1          │
│ 2023-01-01 03:00:00 ┆ 0          ┆ 1          │
│ 2023-01-01 04:00:00 ┆ 1          ┆ 1          │
│ 2023-01-01 05:00:00 ┆ 1          ┆ 0          │
│ 2023-01-01 06:00:00 ┆ 1          ┆ 0          │
└─────────────────────┴────────────┴────────────┘
""")

Expected Output:

┌────────────┬───────┐
│ Location   ┆ Value │
│ ---        ┆ ---   │
│ str        ┆ i32   │
╞════════════╪═══════╡
│ Location 1 ┆ 3     │
│ Location 2 ┆ 5     │
└────────────┴───────┘

Below is the code I have that is functional, but feels like it can be improved my someone smarter and more well-versed in polars than I am.

for col in pivoted_df.drop("Date").columns:
    xy_cont_df_a = (
        pivoted_df.select(pl.col(col))
        .with_columns(
            pl.when(
                pl.col(col).cast(pl.Boolean)
                & pl.col(col)
                .cast(pl.Boolean)
                .shift(-1, fill_value=False)
                .not_()
            ).then(
                pl.count().over(
                    (
                        pl.col(col).cast(pl.Boolean)
                        != pl.col(col).cast(pl.Boolean).shift()
                    ).cum_sum()
                )
            )
        )
        .max()
    )

Solution

  • Update: Polars 0.18.7 added .rle() and .rle_id() which simplify these types of tasks.

    df.select(
       pl.col('Location 1', 'Location 2').rle()
         .struct['len'].max().name.keep()
    )
    
    shape: (1, 2)
    ┌────────────┬────────────┐
    │ Location 1 ┆ Location 2 │
    │ ---        ┆ ---        │
    │ i64        ┆ i64        │
    ╞════════════╪════════════╡
    │ 3          ┆ 5          │
    └────────────┴────────────┘
    

    With unpivoted data:

    df.group_by('Location').agg(
       pl.col('Value').rle().struct['len'].max()
    )
    
    shape: (2, 2)
    ┌────────────┬───────┐
    │ Location   ┆ Value │
    │ ---        ┆ ---   │
    │ str        ┆ i32   │
    ╞════════════╪═══════╡
    │ Location 2 ┆ 5     │
    │ Location 1 ┆ 3     │
    └────────────┴───────┘