I am trying to organise DataFrame columns based on the specific rules, but I don't know the way.
For example, I have a DataFrame related to chemistry as shown below. Each row shows the number of chemical bonds in a chemical compound.
OH HO CaO OCa OO NaMg MgNa
0 2 3 2 0 1 1 1
1 0 2 3 4 5 2 0
2 1 2 3 0 0 0 0
In chemistry, OH (Oxygen-Hydrogen) bond is equal to HO (Hydrogen-Oxygen) bond and CaO (Calcium-Oxygen) bond is equal to OCa (Oxygen-Calcium) bond in the meaning. Thus, I'd like to organise the DataFrame as shown below.
OH CaO OO NaMg
0 5 2 1 2
1 2 7 9 2
2 3 3 0 0
I’m struggling because:
I looked for the same question online and wrote codes by myself, but I was not able to find the way. I would like to know the codes which solve my problem.
You can use str.findall
to extract individual element and use and sort individual elements to reorganize the pairs. Using frozenset
frozenset
is not a good solution because for OO
, the second will be lost.
Now you can group by this sets and apply sum:
# Modified from https://www.johndcook.com/blog/2016/02/04/regular-expression-to-match-a-chemical-element/
pat = r'(A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(?:u[opst])?|V|W|Xe|Yb?|Z[nr])'
grp = df.columns.str.findall(pat).map(lambda x: tuple(sorted(x))))
out = df.groupby(grp, axis=1).sum().rename(columns=''.join)
Output:
>>> out
CaO HO MgNa OO
0 2 5 2 1
1 7 2 2 5
2 3 3 0 0