Correlation dataframe convertion from results from pl.corr

I have a simple dataframe as follows:

import polars as pl

df = pl.DataFrame(
    {
        "group": [1, 1, 1, 1, 2, 2, 2, 2],
        "a": [1, 2, 3, 4, 1, 2, 3, 4],
        "b": [5, 1, 7, 9, 2, 4, 9, 7],
        "c": [2, 6, 3, 9, 1, 5, 3, 6],
    }
)

I want to have a correlation 'matrix' resides in polars dataframe structured like the one below. How can I do that?

┌───────┬──────┬──────────┬──────────┬──────────┐
│ group ┆ name ┆ a        ┆ b        ┆ c        │
│ ---   ┆ ---  ┆ ---      ┆ ---      ┆ ---      │
│ i64   ┆ str  ┆ f64      ┆ f64      ┆ f64      │
╞═══════╪══════╪══════════╪══════════╪══════════╡
│ 1     ┆ a    ┆ 1.0      ┆ 0.680336 ┆ 0.734847 │
│ 1     ┆ b    ┆ 0.680336 ┆ 1.0      ┆ 0.246885 │
│ 1     ┆ c    ┆ 0.734847 ┆ 0.246885 ┆ 1.0      │
│ 2     ┆ a    ┆ 1.0      ┆ 0.830455 ┆ 0.756889 │
│ 2     ┆ b    ┆ 0.830455 ┆ 1.0      ┆ 0.410983 │
│ 2     ┆ c    ┆ 0.756889 ┆ 0.410983 ┆ 1.0      │
└───────┴──────┴──────────┴──────────┴──────────┘

Currently, this is what I tried:

df.group_by("group").agg(
    pl.corr(col1, col2).alias(f"{col1}_{col2}")
    for col1 in ["a", "b", "c"]
    for col2 in ["a", "b", "c"]
)

shape: (2, 10)
┌───────┬─────┬──────────┬──────────┬───┬──────────┬──────────┬──────────┬─────┐
│ group ┆ a_a ┆ a_b      ┆ a_c      ┆ … ┆ b_c      ┆ c_a      ┆ c_b      ┆ c_c │
│ ---   ┆ --- ┆ ---      ┆ ---      ┆   ┆ ---      ┆ ---      ┆ ---      ┆ --- │
│ i64   ┆ f64 ┆ f64      ┆ f64      ┆   ┆ f64      ┆ f64      ┆ f64      ┆ f64 │
╞═══════╪═════╪══════════╪══════════╪═══╪══════════╪══════════╪══════════╪═════╡
│ 2     ┆ 1.0 ┆ 0.830455 ┆ 0.756889 ┆ … ┆ 0.410983 ┆ 0.756889 ┆ 0.410983 ┆ 1.0 │
│ 1     ┆ 1.0 ┆ 0.680336 ┆ 0.734847 ┆ … ┆ 0.246885 ┆ 0.734847 ┆ 0.246885 ┆ 1.0 │
└───────┴─────┴──────────┴──────────┴───┴──────────┴──────────┴──────────┴─────┘

So, not sure about how can I transform it to the shape/structure I want? Or, are there some other (potentially better) ways to generate the results I want directly?

Solution

@jqurious, here's an approach that uses the corr function on the DataFrame itself.

(
    pl.concat(
        next_df
        .select(pl.exclude('group'))
        .corr()
        .select([
            pl.lit(next_group).alias('group'),
            pl.Series(next_df.columns[1:]).alias('name'),
            pl.all()
        ])
        for next_group, next_df
        in df.partition_by('group', as_dict=True).items()
    )
)

shape: (6, 5)
┌───────┬──────┬──────────┬──────────┬──────────┐
│ group ┆ name ┆ a        ┆ b        ┆ c        │
│ ---   ┆ ---  ┆ ---      ┆ ---      ┆ ---      │
│ i32   ┆ str  ┆ f64      ┆ f64      ┆ f64      │
╞═══════╪══════╪══════════╪══════════╪══════════╡
│ 1     ┆ a    ┆ 1.0      ┆ 0.680336 ┆ 0.734847 │
│ 1     ┆ b    ┆ 0.680336 ┆ 1.0      ┆ 0.246885 │
│ 1     ┆ c    ┆ 0.734847 ┆ 0.246885 ┆ 1.0      │
│ 2     ┆ a    ┆ 1.0      ┆ 0.830455 ┆ 0.756889 │
│ 2     ┆ b    ┆ 0.830455 ┆ 1.0      ┆ 0.410983 │
│ 2     ┆ c    ┆ 0.756889 ┆ 0.410983 ┆ 1.0      │
└───────┴──────┴──────────┴──────────┴──────────┘

Performance

How does it perform? Let's take 1,000 groups and 100,000 observations per group:

import numpy as np
import time

nbr_groups = 1_000
nbr_obs_per_group = 100_000

rng = np.random.default_rng(1)
df = pl.DataFrame({
    'group': list(range(0, nbr_groups)) * nbr_obs_per_group,
    **{col_nm: rng.normal(0, 1, nbr_obs_per_group * nbr_groups)
       for col_nm in ['a', 'b', 'c']
       }
})
df

shape: (100000000, 4)
┌───────┬───────────┬───────────┬───────────┐
│ group ┆ a         ┆ b         ┆ c         │
│ ---   ┆ ---       ┆ ---       ┆ ---       │
│ i64   ┆ f64       ┆ f64       ┆ f64       │
╞═══════╪═══════════╪═══════════╪═══════════╡
│ 0     ┆ 0.345584  ┆ -0.858613 ┆ 1.382227  │
│ 1     ┆ 0.821618  ┆ 0.965737  ┆ 1.086405  │
│ 2     ┆ 0.330437  ┆ -0.567488 ┆ 0.57299   │
│ 3     ┆ -1.303157 ┆ 1.070117  ┆ 0.147326  │
│ …     ┆ …         ┆ …         ┆ …         │
│ 996   ┆ 0.842205  ┆ 0.515653  ┆ 0.88825   │
│ 997   ┆ -0.133607 ┆ -1.1532   ┆ -1.041619 │
│ 998   ┆ -0.256237 ┆ 0.654807  ┆ -0.852552 │
│ 999   ┆ -0.627053 ┆ -0.133583 ┆ 0.531616  │
└───────┴───────────┴───────────┴───────────┘

start = time.perf_counter()
(
    pl.concat(
        next_df
        .select(pl.exclude('group'))
        .corr()
        .select([
            pl.lit(next_group).alias('group'),
            pl.Series(next_df.columns[1:]).alias('name'),
            pl.all()
        ])
        for next_group, next_df
        in df.partition_by('group', as_dict=True).items()
    )
)
print(time.perf_counter() - start)

shape: (3000, 5)
┌───────┬──────┬───────────┬───────────┬───────────┐
│ group ┆ name ┆ a         ┆ b         ┆ c         │
│ ---   ┆ ---  ┆ ---       ┆ ---       ┆ ---       │
│ i32   ┆ str  ┆ f64       ┆ f64       ┆ f64       │
╞═══════╪══════╪═══════════╪═══════════╪═══════════╡
│ 0     ┆ a    ┆ 1.0       ┆ 0.002105  ┆ 0.006153  │
│ 0     ┆ b    ┆ 0.002105  ┆ 1.0       ┆ -0.001446 │
│ 0     ┆ c    ┆ 0.006153  ┆ -0.001446 ┆ 1.0       │
│ 1     ┆ a    ┆ 1.0       ┆ -0.00137  ┆ -0.003253 │
│ …     ┆ …    ┆ …         ┆ …         ┆ …         │
│ 998   ┆ c    ┆ 0.002755  ┆ 0.001199  ┆ 1.0       │
│ 999   ┆ a    ┆ 1.0       ┆ -0.001362 ┆ -0.000156 │
│ 999   ┆ b    ┆ -0.001362 ┆ 1.0       ┆ -0.00049  │
│ 999   ┆ c    ┆ -0.000156 ┆ -0.00049  ┆ 1.0       │
└───────┴──────┴───────────┴───────────┴───────────┘
>>> print(time.perf_counter() - start)
6.240678512999693

Roughly 6 seconds on my 32-core system for a DataFrame of 100 million records. There may be faster algorithms, but the code is relatively straightforward.