I have a simple dataframe as follows:
import polars as pl
df = pl.DataFrame(
{
"group": [1, 1, 1, 1, 2, 2, 2, 2],
"a": [1, 2, 3, 4, 1, 2, 3, 4],
"b": [5, 1, 7, 9, 2, 4, 9, 7],
"c": [2, 6, 3, 9, 1, 5, 3, 6],
}
)
I want to have a correlation 'matrix' resides in polars dataframe structured like the one below. How can I do that?
┌───────┬──────┬──────────┬──────────┬──────────┐
│ group ┆ name ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪══════╪══════════╪══════════╪══════════╡
│ 1 ┆ a ┆ 1.0 ┆ 0.680336 ┆ 0.734847 │
│ 1 ┆ b ┆ 0.680336 ┆ 1.0 ┆ 0.246885 │
│ 1 ┆ c ┆ 0.734847 ┆ 0.246885 ┆ 1.0 │
│ 2 ┆ a ┆ 1.0 ┆ 0.830455 ┆ 0.756889 │
│ 2 ┆ b ┆ 0.830455 ┆ 1.0 ┆ 0.410983 │
│ 2 ┆ c ┆ 0.756889 ┆ 0.410983 ┆ 1.0 │
└───────┴──────┴──────────┴──────────┴──────────┘
Currently, this is what I tried:
df.group_by("group").agg(
pl.corr(col1, col2).alias(f"{col1}_{col2}")
for col1 in ["a", "b", "c"]
for col2 in ["a", "b", "c"]
)
shape: (2, 10)
┌───────┬─────┬──────────┬──────────┬───┬──────────┬──────────┬──────────┬─────┐
│ group ┆ a_a ┆ a_b ┆ a_c ┆ … ┆ b_c ┆ c_a ┆ c_b ┆ c_c │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪═════╪══════════╪══════════╪═══╪══════════╪══════════╪══════════╪═════╡
│ 2 ┆ 1.0 ┆ 0.830455 ┆ 0.756889 ┆ … ┆ 0.410983 ┆ 0.756889 ┆ 0.410983 ┆ 1.0 │
│ 1 ┆ 1.0 ┆ 0.680336 ┆ 0.734847 ┆ … ┆ 0.246885 ┆ 0.734847 ┆ 0.246885 ┆ 1.0 │
└───────┴─────┴──────────┴──────────┴───┴──────────┴──────────┴──────────┴─────┘
So, not sure about how can I transform it to the shape/structure I want? Or, are there some other (potentially better) ways to generate the results I want directly?
@jqurious, here's an approach that uses the corr
function on the DataFrame itself.
(
pl.concat(
next_df
.select(pl.exclude('group'))
.corr()
.select([
pl.lit(next_group).alias('group'),
pl.Series(next_df.columns[1:]).alias('name'),
pl.all()
])
for next_group, next_df
in df.partition_by('group', as_dict=True).items()
)
)
shape: (6, 5)
┌───────┬──────┬──────────┬──────────┬──────────┐
│ group ┆ name ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪══════╪══════════╪══════════╪══════════╡
│ 1 ┆ a ┆ 1.0 ┆ 0.680336 ┆ 0.734847 │
│ 1 ┆ b ┆ 0.680336 ┆ 1.0 ┆ 0.246885 │
│ 1 ┆ c ┆ 0.734847 ┆ 0.246885 ┆ 1.0 │
│ 2 ┆ a ┆ 1.0 ┆ 0.830455 ┆ 0.756889 │
│ 2 ┆ b ┆ 0.830455 ┆ 1.0 ┆ 0.410983 │
│ 2 ┆ c ┆ 0.756889 ┆ 0.410983 ┆ 1.0 │
└───────┴──────┴──────────┴──────────┴──────────┘
How does it perform? Let's take 1,000 groups and 100,000 observations per group:
import numpy as np
import time
nbr_groups = 1_000
nbr_obs_per_group = 100_000
rng = np.random.default_rng(1)
df = pl.DataFrame({
'group': list(range(0, nbr_groups)) * nbr_obs_per_group,
**{col_nm: rng.normal(0, 1, nbr_obs_per_group * nbr_groups)
for col_nm in ['a', 'b', 'c']
}
})
df
shape: (100000000, 4)
┌───────┬───────────┬───────────┬───────────┐
│ group ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ 0.345584 ┆ -0.858613 ┆ 1.382227 │
│ 1 ┆ 0.821618 ┆ 0.965737 ┆ 1.086405 │
│ 2 ┆ 0.330437 ┆ -0.567488 ┆ 0.57299 │
│ 3 ┆ -1.303157 ┆ 1.070117 ┆ 0.147326 │
│ … ┆ … ┆ … ┆ … │
│ 996 ┆ 0.842205 ┆ 0.515653 ┆ 0.88825 │
│ 997 ┆ -0.133607 ┆ -1.1532 ┆ -1.041619 │
│ 998 ┆ -0.256237 ┆ 0.654807 ┆ -0.852552 │
│ 999 ┆ -0.627053 ┆ -0.133583 ┆ 0.531616 │
└───────┴───────────┴───────────┴───────────┘
start = time.perf_counter()
(
pl.concat(
next_df
.select(pl.exclude('group'))
.corr()
.select([
pl.lit(next_group).alias('group'),
pl.Series(next_df.columns[1:]).alias('name'),
pl.all()
])
for next_group, next_df
in df.partition_by('group', as_dict=True).items()
)
)
print(time.perf_counter() - start)
shape: (3000, 5)
┌───────┬──────┬───────────┬───────────┬───────────┐
│ group ┆ name ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪══════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ a ┆ 1.0 ┆ 0.002105 ┆ 0.006153 │
│ 0 ┆ b ┆ 0.002105 ┆ 1.0 ┆ -0.001446 │
│ 0 ┆ c ┆ 0.006153 ┆ -0.001446 ┆ 1.0 │
│ 1 ┆ a ┆ 1.0 ┆ -0.00137 ┆ -0.003253 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 998 ┆ c ┆ 0.002755 ┆ 0.001199 ┆ 1.0 │
│ 999 ┆ a ┆ 1.0 ┆ -0.001362 ┆ -0.000156 │
│ 999 ┆ b ┆ -0.001362 ┆ 1.0 ┆ -0.00049 │
│ 999 ┆ c ┆ -0.000156 ┆ -0.00049 ┆ 1.0 │
└───────┴──────┴───────────┴───────────┴───────────┘
>>> print(time.perf_counter() - start)
6.240678512999693
Roughly 6 seconds on my 32-core system for a DataFrame of 100 million records. There may be faster algorithms, but the code is relatively straightforward.