pythonpandaschemistry

How to organise DataFrame columns


I am trying to organise DataFrame columns based on the specific rules, but I don't know the way.

For example, I have a DataFrame related to chemistry as shown below. Each row shows the number of chemical bonds in a chemical compound.

   OH  HO  CaO  OCa  OO  NaMg  MgNa
0   2   3    2    0   1     1     1
1   0   2    3    4   5     2     0
2   1   2    3    0   0     0     0

In chemistry, OH (Oxygen-Hydrogen) bond is equal to HO (Hydrogen-Oxygen) bond and CaO (Calcium-Oxygen) bond is equal to OCa (Oxygen-Calcium) bond in the meaning. Thus, I'd like to organise the DataFrame as shown below.

   OH  CaO  OO  NaMg 
0   5    2   1     2
1   2    7   9     2
2   3    3   0     0

I’m struggling because:

I looked for the same question online and wrote codes by myself, but I was not able to find the way. I would like to know the codes which solve my problem.


Solution

  • You can use str.findall to extract individual element and use frozenset and sort individual elements to reorganize the pairs. Using frozenset is not a good solution because for OO, the second will be lost.

    Now you can group by this sets and apply sum:

    # Modified from https://www.johndcook.com/blog/2016/02/04/regular-expression-to-match-a-chemical-element/
    pat = r'(A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(?:u[opst])?|V|W|Xe|Yb?|Z[nr])'
    
    grp = df.columns.str.findall(pat).map(lambda x: tuple(sorted(x))))
    out = df.groupby(grp, axis=1).sum().rename(columns=''.join)
    

    Output:

    >>> out
       CaO  HO  MgNa  OO
    0    2   5     2   1
    1    7   2     2   5
    2    3   3     0   0