Here is a simplified version of my problem:
import pandas as pd
v1 = [1, 5, 0]
v2, v2_tocomb = [2, 4, 9], [0, 1, 0]
v3, v3_tocomb = [3, 3, 8], [2, 0, 1]
v4, v4_tocomb = [0, 7, 6], [1, 1, 1]
v5 = [2, 2, 6]
v1_1 = [3, 4, 1]
v2_1, v2_tocomb_1 = [4, 2, 8], [2, 1, 0]
v3_1, v3_tocomb_1 = [1, 1, 7], [1, 1, 0]
v4_1 = [0, 0, 5]
v5_1 = [9, 5, 5]
data = pd.DataFrame({'v1': v1, 'v2': v2, 'v2_tocomb': v2_tocomb, 'v3': v3, 'v3_tocomb': v3_tocomb, 'v4': v4, 'v4_tocomb': v4_tocomb, 'v5': v5,
'v1_1': v1_1, 'v2_1': v2_1, 'v2_tocomb_1': v2_tocomb_1, 'v3_1': v3_1, 'v3_tocomb_1': v3_tocomb_1, 'v4_1': v4_1, 'v5_1': v5_1})
# what to do? #
data_final = pd.DataFrame({'v1': v1, 'v2_comb': [sum(x) for x in zip(v2, v2_tocomb)], 'v3_comb': [sum(x) for x in zip(v3, v3_tocomb)], 'v4_comb': [sum(x) for x in zip(v4, v4_tocomb)], 'v5': v5,
'v1_1': v1_1, 'v2_comb_1': [sum(x) for x in zip(v2_1, v2_tocomb_1)], 'v3_comb_1': [sum(x) for x in zip(v3_1, v3_tocomb_1)], 'v4_1': v4_1, 'v5_1': v5_1})
This is the starting dataset (data):
v1 | v2 | v2_tocomb | v3 | v3_tocomb | v4 | v4_tocomb | v5 | v1_1 | v2_1 | v2_tocomb_1 | v3_1 | v3_tocomb_1 | v4_1 | v5_1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 0 | 3 | 2 | 0 | 1 | 2 | 3 | 4 | 2 | 1 | 1 | 0 | 9 |
5 | 4 | 1 | 3 | 0 | 7 | 1 | 2 | 4 | 2 | 1 | 1 | 1 | 0 | 5 |
0 | 9 | 0 | 8 | 1 | 6 | 1 | 6 | 1 | 8 | 0 | 7 | 0 | 5 | 5 |
And this is the desired output (data_final):
v1 | v2_comb | v3_comb | v4_comb | v5 | v1_1 | v2_comb_1 | v3_comb_1 | v4_1 | v5_1 |
---|---|---|---|---|---|---|---|---|---|
1 | 2 | 5 | 1 | 2 | 3 | 6 | 2 | 0 | 9 |
5 | 5 | 3 | 8 | 2 | 4 | 3 | 2 | 0 | 5 |
0 | 9 | 9 | 7 | 6 | 1 | 8 | 7 | 5 | 5 |
The main task is to sum values pairwise for columns that have _comb
in their name and the corresponding variable. For example columns v2
and v2_tocomb
need to be summed up together in the new column v2_comb
(then v2
and v2_tocomb
are dropped). The same happens for columns ending with _1
, for example, v3_1
and v3_tocomb_1
are combined in the new column v3_comb_1
.
I can get data_final for this simplified version of the dataset by doing the described process "by hand", but the original problem refers to 200+ columns, so I need to find a way to automate the process. Furthermore, the original features' names aren't that simple (v1
, v2
, and so on), but it shouldn't be a problem.
In my opinion, what makes this problem tricky is that the columns to combine can both end with _tocomb
or _tocomb_1
and that there is no symmetry, hence a column doesn't necessarily have another one to be summed up with and columns ending with _1
don't follow the same path of the "original" ones.
Actually, I am using only pandas library but I could use any other Python library to achieve the desired output. Can anyone help me?
You can remove the _comb
and groupby.sum
:
grp = data.columns.str.replace('_tocomb', '', regex=True)
data_final = data.groupby(grp, axis=1, sort=False).sum()
Output:
v1 v2 v3 v4 v5 v1_1 v2_1 v3_1 v4_1 v5_1
0 1 2 5 1 2 3 6 2 0 9
1 5 5 3 8 2 4 3 2 0 5
2 0 9 9 7 6 1 8 7 5 5
Then modify the columns names if needed with rename
:
cols = {c.replace('_tocomb', ''): c.replace('_tocomb', '_comb')
for c in data.columns.difference(data_final.columns)}
data_final = data_final.rename(columns=cols)
Output:
v1 v2_comb v3_comb v4_comb v5 v1_1 v2_comb_1 v3_comb_1 v4_1 v5_1
0 1 2 5 1 2 3 6 2 0 9
1 5 5 3 8 2 4 3 2 0 5
2 0 9 9 7 6 1 8 7 5 5