I am trying calculate correlation matrix with groupby and sort. I have 100 companies from 11 industries. I would like to group by industry and sort by their total assets (atq), and then calculate the correlation of data.pr_multi with this order. however, when I do sort and groupby, it reverses back and calculates by alphabetical order.
The code I use:
index | datafqtr | tic | pr_multi | atq | industry |
---|---|---|---|---|---|
0 | 2018Q1 | A | NaN | 8698.0 | 4 |
1 | 2018Q2 | A | -0.0856845728151735 | 8784.0 | 4 |
2 | 2018Q3 | A | 0.0035103320774146 | 8349.0 | 4 |
3 | 2018Q4 | A | -0.0157732687260246 | 8541.0 | 4 |
4 | 2018Q1 | AAL | NaN | 53280.0 | 5 |
5 | 2018Q2 | AAL | -0.2694380292532717 | 52622.0 | 5 |
the code I use:
data1=data18.sort_values(['atq'],ascending=False).groupby('industry').head()
df = data1.pivot_table('pr_multi', ['datafqtr'], 'tic')
# calculate correlation matrix using inbuilt pandas function
correlation_matrix = df.corr()
correlation_matrix.head()
IIUC, you want to calculate the correlation between the order based on the groupby
and the pr_multi
column. use:
data1=data18.groupby('industry')['atq'].apply(lambda x: x.sort_values(ascending=False))
np.corrcoef(data1.reset_index()['level_1'], data18['pr_multi'].astype(float).fillna(0))
Output:
array([[ 1. , -0.44754795],
[-0.44754795, 1. ]])