I have a pandas DataFrame df:
d = {'era': ["a", "a", "b","b","c", "c"], 'feature1': [3, 4, 5, 6, 7, 8], 'feature2': [7, 8, 9, 10, 11, 12], 'target': [1, 2, 3, 4, 5 ,6]}
df = pd.DataFrame(data=d)
And I want to apply a correlation between the feature_cols = ['feature1', 'feature2'] and the TARGET_COL = 'target' for each era:
corrs_split = (
training_data
.groupby("era")
.apply(lambda d: d[feature_cols].corrwith(d[TARGET_COL]))
)
I've been trying to get this done with Polars, but I can't get a polars dataframe with a column for each different era and the correlations for each feature. The maximum I've got, is a single column, with all the correlations calculated, but without the era as index and not discriminated by feature.
Here's the polars equivalent of that code. You can do this by combining group_by() and agg().
import polars as pl
d = {'era': ["a", "a", "b","b","c", "c"], 'feature1': [3, 4, 5, 6, 7, 8], 'feature2': [7, 8, 9, 10, 11, 12], 'target': [1, 2, 3, 4, 5 ,6]}
df = pl.DataFrame(d)
feature_cols = ['feature1', 'feature2']
TARGET_COL = 'target'
agg_cols = []
for feature_col in feature_cols:
agg_cols += [pl.corr(feature_col, TARGET_COL)]
print(df.group_by("era").agg(agg_cols))
Output:
shape: (3, 3)
┌─────┬──────────┬──────────┐
│ era ┆ feature1 ┆ feature2 │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╡
│ c ┆ 1.0 ┆ 1.0 │
│ b ┆ 1.0 ┆ 1.0 │
│ a ┆ 1.0 ┆ 1.0 │
└─────┴──────────┴──────────┘
(Order may be different for you.)