pythonpandasdataframepython-polars

how to calculate correlation between ten columns with polars


I previously have a large dataframe in pandas and I am having a hard time migrating to Polars.

I used to use the code below to calculate correlation between columns

print(df.corr(numeric_only=True).stack().sort_values(ascending=False).loc[lambda x: x < 1])

and result is like:

enter image description here

how am I supposed to achieve same result with Polars?

many thanks.


Solution

  • You can do it using corr() and unpivot().

    (df.corr()
       .with_columns(index = pl.lit(pl.Series(df.columns)))
       .unpivot(index = "index")
       .filter(pl.col("index") != pl.col("variable"))
    )
    
    # Output
    ┌───────┬──────────┬───────────┐
    │ index ┆ variable ┆ value     │
    │ ---   ┆ ---      ┆ ---       │
    │ str   ┆ str      ┆ f64       │
    ╞═══════╪══════════╪═══════════╡
    │ B     ┆ A        ┆ 0.493197  │
    │ C     ┆ A        ┆ -0.866325 │
    │ D     ┆ A        ┆ -0.493197 │
    │ A     ┆ B        ┆ 0.493197  │
    │ …     ┆ …        ┆ …         │
    │ D     ┆ C        ┆ 0.416025  │
    │ A     ┆ D        ┆ -0.493197 │
    │ B     ┆ D        ┆ -1.0      │
    │ C     ┆ D        ┆ 0.416025  │
    └───────┴──────────┴───────────┘