python optimization python-polars ranking

Polars NDCG optimized calculation

The problem here is to implement NDCG calculation on Polars that would be efficient for huge datasets.

Main idea of NDCG is to calculate DCG and IDCG, let's skip the gain part and only think about discount part, which depends on ranks from ideal and proposed orderings.

So the tricky part for me here is to properly and effectively calculate the positions of similar items from ideal and proposed parts, i.e.:

ideal:   a b c d e f
propsed: d b g e h

so we have intersection of {b, d, e} items with idx_ideal=[2,4,5] (starting from 1) and idx_propsed=[2,1,4]

So what I want is to calculate those idx_proposed and idx_ideal for dataframe with columns (user, ideal, proposed), so the resulting DF would have columns: (user, ideal, proposed, idx_ideal, idx_propsed)

# so its only ideal idx, then I create extra for proposed idx and join them
(
    df
        .explode('ideal')
        .with_columns(idx=pl.int_range(pl.len()).over('user'))
        .filter(pl.col('ideal').is_in(pl.col('proposed')))
        .group_by('user', maintain_order=True)
        .agg(pl.col('idx'))
)

I explode ideal over user and find position by adding extra idx column and leaving only ideal=proposed rows, but this results in extra DF with a subset of rows, which I would have to join back, which is probably not very optimal. Then I have to calculate it once again for proposed side.

Moreover, I will have to explode over (idx_ideal, idx_proposed) on the next step to calculate user's IDCG, DCG and NDCG. Could you help me optimize those calculations?

I think I should use that users do not interact with each-other and separate rows could be processed separately.

Here is the random data generator

import polars as pl
import random

num_users = 100_000
min_len = 10
max_len = 200
item_range = 10_000

def generate_user_data():
    length = random.randint(min_len, max_len)
    ideal = random.sample(range(item_range), length)

    length = random.randint(min_len, max_len)
    predicted = random.sample(range(item_range), length)
    
    return ideal, predicted

data = []
for user_id in range(num_users):
    ideal, predicted = generate_user_data()
    data.append({
        'user': user_id,
        'ideal': ideal,
        'proposed': predicted
    })

df = pl.DataFrame(data)
print(df.head())

shape: (5, 3)
┌──────┬──────────────────────┬──────────────────────┐
│ user ┆ ideal                ┆ proposed             │
│ ---  ┆ ---                  ┆ ---                  │
│ i64  ┆ list[i64]            ┆ list[i64]            │
╞══════╪══════════════════════╪══════════════════════╡
│ 0    ┆ [9973, 313, … 5733]  ┆ [8153, 3461, … 4602] │
│ 1    ┆ [3756, 9053, … 1014] ┆ [435, 9407, … 6159]  │
│ 2    ┆ [8152, 1615, … 2873] ┆ [5078, 9006, … 8157] │
│ 3    ┆ [6104, 2929, … 2606] ┆ [5110, 790, … 363]   │
│ 4    ┆ [1863, 6801, … 271]  ┆ [5571, 5555, … 5591] │
└──────┴──────────────────────┴──────────────────────┘

Solution

In my testing, I got the fastest results by first getting the intersection:

.list.set_intersection()

There is currently no List API method to get the index positions, but it can be emulated if we .flatten() the lists .over() each "row". (we use with_row_index as the group id)

.arg_true() is used to get the indexes where the .is_in() check is True
mapping_strategy="join" will give us back a list column.

pl.Config(fmt_str_lengths=100, fmt_table_cell_list_len=10)

df = pl.DataFrame({
    "user": [1, 2], 
    "ideal": [["a", "b", "c", "d", "a"], ["x", "y", "z"]], 
    "proposed": [["f", "a", "e", "r", "d"], ["y", "e", "s"]]
})

(df
  .with_row_index()
  .with_columns(
     pl.col("ideal").list.set_intersection("proposed")
       .alias("intersection")
  )
  .with_columns(
     pl.col("ideal").flatten().is_in(pl.col("intersection").flatten())
       .arg_true()
       .over("index", mapping_strategy="join")
       .alias("idx_ideal"),
     pl.col("proposed").flatten().is_in(pl.col("intersection").flatten())
       .arg_true()
       .over("index", mapping_strategy="join")
       .alias("idx_proposed")
  )
)

shape: (2, 7)
┌───────┬──────┬───────────────────────────┬───────────────────────────┬──────────────┬───────────┬──────────────┐
│ index ┆ user ┆ ideal                     ┆ proposed                  ┆ intersection ┆ idx_ideal ┆ idx_proposed │
│ ---   ┆ ---  ┆ ---                       ┆ ---                       ┆ ---          ┆ ---       ┆ ---          │
│ u32   ┆ i64  ┆ list[str]                 ┆ list[str]                 ┆ list[str]    ┆ list[u32] ┆ list[u32]    │
╞═══════╪══════╪═══════════════════════════╪═══════════════════════════╪══════════════╪═══════════╪══════════════╡
│ 0     ┆ 1    ┆ ["a", "b", "c", "d", "a"] ┆ ["f", "a", "e", "r", "d"] ┆ ["a", "d"]   ┆ [0, 3, 4] ┆ [1, 4]       │
│ 1     ┆ 2    ┆ ["x", "y", "z"]           ┆ ["y", "e", "s"]           ┆ ["y"]        ┆ [1]       ┆ [0]          │
└───────┴──────┴───────────────────────────┴───────────────────────────┴──────────────┴───────────┴──────────────┘

First appearance only

As per the updated requirements, you could add .is_first_distinct() to pick the first appearance in the case of duplicate values.

(df.with_row_index()
   .with_columns(pl.col("ideal").list.set_intersection("proposed").alias("intersection"))
   .with_columns(
      (
         pl.col("ideal").flatten().is_first_distinct()
         &
         pl.col("ideal").flatten().is_in(pl.col("intersection").flatten())
      )
      .arg_true()
      .over("index", mapping_strategy="join")
      .alias("idx_ideal")
   )
)

shape: (2, 6)
┌───────┬──────┬───────────────────────────┬───────────────────────────┬──────────────┬───────────┐
│ index ┆ user ┆ ideal                     ┆ proposed                  ┆ intersection ┆ idx_ideal │
│ ---   ┆ ---  ┆ ---                       ┆ ---                       ┆ ---          ┆ ---       │
│ u32   ┆ i64  ┆ list[str]                 ┆ list[str]                 ┆ list[str]    ┆ list[u32] │
╞═══════╪══════╪═══════════════════════════╪═══════════════════════════╪══════════════╪═══════════╡
│ 0     ┆ 1    ┆ ["a", "b", "c", "d", "a"] ┆ ["f", "a", "e", "r", "d"] ┆ ["a", "d"]   ┆ [0, 3]    │
│ 1     ┆ 2    ┆ ["x", "y", "z"]           ┆ ["y", "e", "s"]           ┆ ["y"]        ┆ [1]       │
└───────┴──────┴───────────────────────────┴───────────────────────────┴──────────────┴───────────┘

Build expressions with a function

As the expression has become rather complex, it may be desirable to put it inside a helper function.

def index_of_intersection(expr):
    return (
        (
            expr.flatten().is_first_distinct() 
            & 
            expr.flatten().is_in(pl.col("intersection").flatten())
        )
        .arg_true()
        .over("index", mapping_strategy="join")
        .name.prefix("idx_")
    )

And use .pipe() to call it.

(df
  .with_row_index()
  .with_columns(
     pl.col("ideal").list.set_intersection("proposed")
       .alias("intersection")
  )
  .with_columns(
     pl.col("ideal", "proposed").pipe(index_of_intersection)
  )
)

shape: (2, 7)
┌───────┬──────┬───────────────────────────┬───────────────────────────┬──────────────┬───────────┬──────────────┐
│ index ┆ user ┆ ideal                     ┆ proposed                  ┆ intersection ┆ idx_ideal ┆ idx_proposed │
│ ---   ┆ ---  ┆ ---                       ┆ ---                       ┆ ---          ┆ ---       ┆ ---          │
│ u32   ┆ i64  ┆ list[str]                 ┆ list[str]                 ┆ list[str]    ┆ list[u32] ┆ list[u32]    │
╞═══════╪══════╪═══════════════════════════╪═══════════════════════════╪══════════════╪═══════════╪══════════════╡
│ 0     ┆ 1    ┆ ["a", "b", "c", "d", "a"] ┆ ["f", "a", "e", "r", "d"] ┆ ["a", "d"]   ┆ [0, 3]    ┆ [1, 4]       │
│ 1     ┆ 2    ┆ ["x", "y", "z"]           ┆ ["y", "e", "s"]           ┆ ["y"]        ┆ [1]       ┆ [0]          │
└───────┴──────┴───────────────────────────┴───────────────────────────┴──────────────┴───────────┴──────────────┘