pythonpython-polars

Correlation dataframe convertion from results from pl.corr


I have a simple dataframe as follows:

import polars as pl

df = pl.DataFrame(
    {
        "group": [1, 1, 1, 1, 2, 2, 2, 2],
        "a": [1, 2, 3, 4, 1, 2, 3, 4],
        "b": [5, 1, 7, 9, 2, 4, 9, 7],
        "c": [2, 6, 3, 9, 1, 5, 3, 6],
    }
)

I want to have a correlation 'matrix' resides in polars dataframe structured like the one below. How can I do that?

┌───────┬──────┬──────────┬──────────┬──────────┐
│ group ┆ name ┆ a        ┆ b        ┆ c        │
│ ---   ┆ ---  ┆ ---      ┆ ---      ┆ ---      │
│ i64   ┆ str  ┆ f64      ┆ f64      ┆ f64      │
╞═══════╪══════╪══════════╪══════════╪══════════╡
│ 1     ┆ a    ┆ 1.0      ┆ 0.680336 ┆ 0.734847 │
│ 1     ┆ b    ┆ 0.680336 ┆ 1.0      ┆ 0.246885 │
│ 1     ┆ c    ┆ 0.734847 ┆ 0.246885 ┆ 1.0      │
│ 2     ┆ a    ┆ 1.0      ┆ 0.830455 ┆ 0.756889 │
│ 2     ┆ b    ┆ 0.830455 ┆ 1.0      ┆ 0.410983 │
│ 2     ┆ c    ┆ 0.756889 ┆ 0.410983 ┆ 1.0      │
└───────┴──────┴──────────┴──────────┴──────────┘

Currently, this is what I tried:

df.group_by("group").agg(
    pl.corr(col1, col2).alias(f"{col1}_{col2}")
    for col1 in ["a", "b", "c"]
    for col2 in ["a", "b", "c"]
)

shape: (2, 10)
┌───────┬─────┬──────────┬──────────┬───┬──────────┬──────────┬──────────┬─────┐
│ group ┆ a_a ┆ a_b      ┆ a_c      ┆ … ┆ b_c      ┆ c_a      ┆ c_b      ┆ c_c │
│ ---   ┆ --- ┆ ---      ┆ ---      ┆   ┆ ---      ┆ ---      ┆ ---      ┆ --- │
│ i64   ┆ f64 ┆ f64      ┆ f64      ┆   ┆ f64      ┆ f64      ┆ f64      ┆ f64 │
╞═══════╪═════╪══════════╪══════════╪═══╪══════════╪══════════╪══════════╪═════╡
│ 2     ┆ 1.0 ┆ 0.830455 ┆ 0.756889 ┆ … ┆ 0.410983 ┆ 0.756889 ┆ 0.410983 ┆ 1.0 │
│ 1     ┆ 1.0 ┆ 0.680336 ┆ 0.734847 ┆ … ┆ 0.246885 ┆ 0.734847 ┆ 0.246885 ┆ 1.0 │
└───────┴─────┴──────────┴──────────┴───┴──────────┴──────────┴──────────┴─────┘

So, not sure about how can I transform it to the shape/structure I want? Or, are there some other (potentially better) ways to generate the results I want directly?


Solution

  • @jqurious, here's an approach that uses the corr function on the DataFrame itself.

    (
        pl.concat(
            next_df
            .select(pl.exclude('group'))
            .corr()
            .select([
                pl.lit(next_group).alias('group'),
                pl.Series(next_df.columns[1:]).alias('name'),
                pl.all()
            ])
            for next_group, next_df
            in df.partition_by('group', as_dict=True).items()
        )
    )
    
    shape: (6, 5)
    ┌───────┬──────┬──────────┬──────────┬──────────┐
    │ group ┆ name ┆ a        ┆ b        ┆ c        │
    │ ---   ┆ ---  ┆ ---      ┆ ---      ┆ ---      │
    │ i32   ┆ str  ┆ f64      ┆ f64      ┆ f64      │
    ╞═══════╪══════╪══════════╪══════════╪══════════╡
    │ 1     ┆ a    ┆ 1.0      ┆ 0.680336 ┆ 0.734847 │
    │ 1     ┆ b    ┆ 0.680336 ┆ 1.0      ┆ 0.246885 │
    │ 1     ┆ c    ┆ 0.734847 ┆ 0.246885 ┆ 1.0      │
    │ 2     ┆ a    ┆ 1.0      ┆ 0.830455 ┆ 0.756889 │
    │ 2     ┆ b    ┆ 0.830455 ┆ 1.0      ┆ 0.410983 │
    │ 2     ┆ c    ┆ 0.756889 ┆ 0.410983 ┆ 1.0      │
    └───────┴──────┴──────────┴──────────┴──────────┘
    

    Performance

    How does it perform? Let's take 1,000 groups and 100,000 observations per group:

    import numpy as np
    import time
    
    nbr_groups = 1_000
    nbr_obs_per_group = 100_000
    
    rng = np.random.default_rng(1)
    df = pl.DataFrame({
        'group': list(range(0, nbr_groups)) * nbr_obs_per_group,
        **{col_nm: rng.normal(0, 1, nbr_obs_per_group * nbr_groups)
           for col_nm in ['a', 'b', 'c']
           }
    })
    df
    
    shape: (100000000, 4)
    ┌───────┬───────────┬───────────┬───────────┐
    │ group ┆ a         ┆ b         ┆ c         │
    │ ---   ┆ ---       ┆ ---       ┆ ---       │
    │ i64   ┆ f64       ┆ f64       ┆ f64       │
    ╞═══════╪═══════════╪═══════════╪═══════════╡
    │ 0     ┆ 0.345584  ┆ -0.858613 ┆ 1.382227  │
    │ 1     ┆ 0.821618  ┆ 0.965737  ┆ 1.086405  │
    │ 2     ┆ 0.330437  ┆ -0.567488 ┆ 0.57299   │
    │ 3     ┆ -1.303157 ┆ 1.070117  ┆ 0.147326  │
    │ …     ┆ …         ┆ …         ┆ …         │
    │ 996   ┆ 0.842205  ┆ 0.515653  ┆ 0.88825   │
    │ 997   ┆ -0.133607 ┆ -1.1532   ┆ -1.041619 │
    │ 998   ┆ -0.256237 ┆ 0.654807  ┆ -0.852552 │
    │ 999   ┆ -0.627053 ┆ -0.133583 ┆ 0.531616  │
    └───────┴───────────┴───────────┴───────────┘
    
    start = time.perf_counter()
    (
        pl.concat(
            next_df
            .select(pl.exclude('group'))
            .corr()
            .select([
                pl.lit(next_group).alias('group'),
                pl.Series(next_df.columns[1:]).alias('name'),
                pl.all()
            ])
            for next_group, next_df
            in df.partition_by('group', as_dict=True).items()
        )
    )
    print(time.perf_counter() - start)
    
    shape: (3000, 5)
    ┌───────┬──────┬───────────┬───────────┬───────────┐
    │ group ┆ name ┆ a         ┆ b         ┆ c         │
    │ ---   ┆ ---  ┆ ---       ┆ ---       ┆ ---       │
    │ i32   ┆ str  ┆ f64       ┆ f64       ┆ f64       │
    ╞═══════╪══════╪═══════════╪═══════════╪═══════════╡
    │ 0     ┆ a    ┆ 1.0       ┆ 0.002105  ┆ 0.006153  │
    │ 0     ┆ b    ┆ 0.002105  ┆ 1.0       ┆ -0.001446 │
    │ 0     ┆ c    ┆ 0.006153  ┆ -0.001446 ┆ 1.0       │
    │ 1     ┆ a    ┆ 1.0       ┆ -0.00137  ┆ -0.003253 │
    │ …     ┆ …    ┆ …         ┆ …         ┆ …         │
    │ 998   ┆ c    ┆ 0.002755  ┆ 0.001199  ┆ 1.0       │
    │ 999   ┆ a    ┆ 1.0       ┆ -0.001362 ┆ -0.000156 │
    │ 999   ┆ b    ┆ -0.001362 ┆ 1.0       ┆ -0.00049  │
    │ 999   ┆ c    ┆ -0.000156 ┆ -0.00049  ┆ 1.0       │
    └───────┴──────┴───────────┴───────────┴───────────┘
    >>> print(time.perf_counter() - start)
    6.240678512999693
    

    Roughly 6 seconds on my 32-core system for a DataFrame of 100 million records. There may be faster algorithms, but the code is relatively straightforward.