python-3.xpython-polars

How to create a rank or an index column based on more than one column?


I have a a problem (a simple one i believe) which i can't get my head around. I'm looking for a way to create a rank or an index column based on two columns in ascending order and partitioning by a specific column.

see example below

data = {
    'id' : [1,2,3,4,5,1,2,3,4,5] ,
    'month' : [1,1,1,1,1,2,2,2,2,2] ,
    'sales' : [20.1, 34.7, 89.4, 932.65, 12.7, 65.2, 893.2, 89.2, 12.0, 12.25],         
    'sector' : ['sec1', 'sec1', 'sec1', 'sec1', 'sec1', 'sec2', 'sec2', 'sec2', 'sec2', 'sec2'],      
}

df = pl.DataFrame(data)

I want to create a rank for every sector sorted by the ['id','month'] ascending for both. so ideally, something among the lines of row_number() in sql.

I tried to first sort the dataframe by ascending order, i.e.,

.sort(by=['month', 'id'], descending=[False, False])

and then something like this:

.with_columns(pl.lit(1).cum_sum().over('sector').alias('order_trade'))

but to no avail. I also attempted some bunch of group_by expressions, and using the rank method but couldn't figure it out.

the result I'm looking for is a 'rank' column which is based off of on the month and id column, where both are in ascending order partitioned by the sector.

Expected Output

i've edited the data input to make it a bit simpler to get my point across.

it sql I'd do something like this:

ROW_NUMBER() OVER(PARTITION BY Sector ORDER BY month_number ASC, id ASC) AS 'rank'

thus, something like this.

shape: (10, 5)
┌──────┬─────┬──────────────┬────────┬────────┐
│ rank ┆ id  ┆ month_number ┆ sector ┆ sales  │
│ ---  ┆ --- ┆ ---          ┆ ---    ┆ ---    │
│ i64  ┆ i64 ┆ i64          ┆ str    ┆ f64    │
╞══════╪═════╪══════════════╪════════╪════════╡
│ 1    ┆ 1   ┆ 1            ┆ sec1   ┆ 20.1   │
│ 2    ┆ 2   ┆ 1            ┆ sec1   ┆ 34.7   │
│ 3    ┆ 3   ┆ 1            ┆ sec1   ┆ 89.4   │
│ 4    ┆ 4   ┆ 1            ┆ sec1   ┆ 932.65 │
│ 5    ┆ 5   ┆ 1            ┆ sec1   ┆ 12.7   │
│ 1    ┆ 1   ┆ 2            ┆ sec2   ┆ 65.2   │
│ 2    ┆ 2   ┆ 2            ┆ sec2   ┆ 893.2  │
│ 3    ┆ 3   ┆ 2            ┆ sec2   ┆ 89.2   │
│ 4    ┆ 4   ┆ 2            ┆ sec2   ┆ 12.0   │
│ 5    ┆ 5   ┆ 2            ┆ sec2   ┆ 12.25  │
└──────┴─────┴──────────────┴────────┴────────┘

Thanks in advance for any input!


Solution

  • I've shuffled the order of your example.

    df = pl.from_repr("""
    ┌─────┬───────┬────────┬────────┐
    │ id  ┆ month ┆ sales  ┆ sector │
    │ --- ┆ ---   ┆ ---    ┆ ---    │
    │ i64 ┆ i64   ┆ f64    ┆ str    │
    ╞═════╪═══════╪════════╪════════╡
    │ 2   ┆ 1     ┆ 34.7   ┆ sec1   │
    │ 1   ┆ 1     ┆ 20.1   ┆ sec1   │
    │ 3   ┆ 1     ┆ 89.4   ┆ sec1   │
    │ 10  ┆ 2     ┆ 12.25  ┆ sec2   │
    │ 5   ┆ 1     ┆ 12.7   ┆ sec1   │
    │ 6   ┆ 2     ┆ 65.2   ┆ sec2   │
    │ 4   ┆ 1     ┆ 932.65 ┆ sec1   │
    │ 8   ┆ 2     ┆ 89.2   ┆ sec2   │
    │ 9   ┆ 2     ┆ 12.0   ┆ sec2   │
    │ 7   ┆ 2     ┆ 893.2  ┆ sec2   │
    └─────┴───────┴────────┴────────┘
    """)
    

    It looks like you're after a "dense rank":

    df.with_columns(rank = pl.col("id").rank("dense").over("sector"))
    
    shape: (10, 5)
    ┌─────┬───────┬────────┬────────┬──────┐
    │ id  ┆ month ┆ sales  ┆ sector ┆ rank │
    │ --- ┆ ---   ┆ ---    ┆ ---    ┆ ---  │
    │ i64 ┆ i64   ┆ f64    ┆ str    ┆ u32  │
    ╞═════╪═══════╪════════╪════════╪══════╡
    │ 2   ┆ 1     ┆ 34.7   ┆ sec1   ┆ 2    │
    │ 1   ┆ 1     ┆ 20.1   ┆ sec1   ┆ 1    │
    │ 3   ┆ 1     ┆ 89.4   ┆ sec1   ┆ 3    │
    │ 10  ┆ 2     ┆ 12.25  ┆ sec2   ┆ 5    │
    │ 5   ┆ 1     ┆ 12.7   ┆ sec1   ┆ 5    │
    │ 6   ┆ 2     ┆ 65.2   ┆ sec2   ┆ 1    │
    │ 4   ┆ 1     ┆ 932.65 ┆ sec1   ┆ 4    │
    │ 8   ┆ 2     ┆ 89.2   ┆ sec2   ┆ 3    │
    │ 9   ┆ 2     ┆ 12.0   ┆ sec2   ┆ 4    │
    │ 7   ┆ 2     ┆ 893.2  ┆ sec2   ┆ 2    │
    └─────┴───────┴────────┴────────┴──────┘