I have a polars data frame which could be generated like so:
import polars as pl
import numpy as np
num_points = 10
group_count = 3
df = pl.DataFrame(
{
"group_id": np.concatenate([[i]*num_points for i in range(group_count)]),
"data": np.concatenate([np.random.rand(num_points) for _ in range(group_count)]),
}
)
and I want to apply some operation that takes in 2 groups, and outputs a single number, over all the groups.
This operation is not a polars native expression, and is shapely.frechet_distance
, but for a good starting point I would like to see if that is possible to do fast using a polars expression, like taking the dot product.
An easy solution is to loop over all the groups like so:
def distance_between_groups(df: pl.DataFrame):
"""
for each pair of groups g1, g2, returns some aggregate function of their data,
such that we get a G x G matrix of the resulting aggregations.
"""
groups = df['group_id'].unique(maintain_order=True)
res = np.zeros((df['group_id'].n_unique(), df['group_id'].n_unique()))
for i, g1 in enumerate(groups):
g1_df = df.filter(pl.col('group_id') == g1)
for j, g2 in enumerate(groups):
g2_df = df.filter(pl.col('group_id') == g2)
res[i][j] = (g1_df['data'] * g2_df['data']).sum()
return res
but I am looking for something more efficient, that uses the parallelism of polars.
In general, doing something "for each pair of things" can be thought of as a cartesian product. In Polars, this can be achieved via a join
.
For example, each pair of "group_id"
df = pl.DataFrame({"group_id": range(3)})
df.join(df, how="cross")
shape: (9, 2)
┌──────────┬────────────────┐
│ group_id ┆ group_id_right │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════════╪════════════════╡
│ 0 ┆ 0 │
│ 0 ┆ 1 │
│ 0 ┆ 2 │
│ 1 ┆ 0 │
│ 1 ┆ 1 │
│ 1 ┆ 2 │
│ 2 ┆ 0 │
│ 2 ┆ 1 │
│ 2 ┆ 2 │
└──────────┴────────────────┘
In the case of a dot product, you are only wanting to join the data elementwise, not each combination of the underlying records. In other words, for the input in your example of 30 rows, the result of the join should have 90 rows. 9 groups - as above - with 10 rows each. It shouldn't have 900 rows (each row with every other row).
To achieve this, give each row an index within the group and use that as the join key.
# same setup and `df` declaration from question
# give each row an index within the group
df = df.with_columns(index=pl.int_range(pl.len()).over("group_id"))
(
df.join(df, on="index")
# group by the pair of groups
# maintain_order to keep the same order as the numpy array (optional)
.group_by("group_id", "group_id_right", maintain_order=True)
.agg(pl.col("data").dot("data_right"))
)
shape: (9, 3)
┌──────────┬────────────────┬──────────┐
│ group_id ┆ group_id_right ┆ data │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞══════════╪════════════════╪══════════╡
│ 0 ┆ 0 ┆ 3.303398 │
│ 1 ┆ 0 ┆ 1.535815 │
│ 2 ┆ 0 ┆ 3.058639 │
│ 0 ┆ 1 ┆ 1.535815 │
│ 1 ┆ 1 ┆ 1.971855 │
│ 2 ┆ 1 ┆ 2.581091 │
│ 0 ┆ 2 ┆ 3.058639 │
│ 1 ┆ 2 ┆ 2.581091 │
│ 2 ┆ 2 ┆ 5.248869 │
└──────────┴────────────────┴──────────┘
and for completeness
distance_between_groups(df)
array([[3.30339833, 1.53581549, 3.05863929],
[1.53581549, 1.97185485, 2.58109071],
[3.05863929, 2.58109071, 5.24886858]])