pythonpandasrdkit

Indexing of strings


I am trying to get an output based on this procedure, which is best to explain with an example.

for example in a smile,

C(N)(N)CC(N)C, [0, 1, 2, 0, 0, 1, 0]
this is the output I am trying to get.

It counts the branching (which is represented by brackets). So for the above example, it counts the first (N) as 1, then the second (N) as 2. This count is then reset once it reaches an atom that is not branched (or bracketed). It continues to get 0 and the count begins and resets again. The problem is I am not getting the expected output. Below are my outputs, expected outputs and code. Thanks

Also, I need to ensure situations like these CC(CC(C)) are not incorrectly indexed. It should not count excess and not reset, not continuously count. That smile should have output of [0 0 1 1 1].

another example: CC(CCC)CCCC [0 0 1 1 1 0 0 0 0]

For nested brackets I will rerun this process and just start counting from 1.

I am getting this

          SMILES                             branch_count
0  C(N)(N)CC(N)C  [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]
1            CCC                                [0, 0, 0]
2          C1CC1                          [0, 0, 0, 0, 0]
3      C1CC1(C)C              [0, 0, 0, 0, 0, 0, 1, 0, 0]
4         CC(C)C                       [0, 0, 0, 1, 0, 0]

when it should be this

          SMILES        branch_count
0  C(N)(N)CC(N)C  [0, 1, 2, 0, 0, 1, 0]
1            CCC           [0, 0, 0]
2          C1CC1           [0, 0, 0]
3      C1CC1(C)C        [0, 0, 0, 1, 0]
4         CC(C)C           [0, 0, 1, 0]


import pandas as pd
import numpy as np
from rdkit import Chem

def get_branch_count(smile):
    # Initialize variables
    branch_count = [0] * len(smile)
    bracket_count = 0
    current_count = 0
    
    # Loop through each character in the smile
    for i, c in enumerate(smile):
        # If the character is an open bracket, increment bracket count
        if c == "(":
            bracket_count += 1
        # If the character is a close bracket, decrement bracket count
        elif c == ")":
            bracket_count -= 1
            # If there are no more open brackets after this one, reset current count
            if bracket_count == 0:
                current_count = 0
        # If the character is not a bracket, update the current count
        else:
            if bracket_count > 0:
                # If the previous character was also a bracket, don't increment the count
                if smile[i-1] != ")":
                    current_count += 1
            else:
                current_count = 0
            branch_count[i] = current_count
            
    return branch_count

def collect_branch_count(smile_list):
    rows = []

    for smile in smile_list:
        branch_count = get_branch_count(smile)
        data = {"branch_count": branch_count}

        row = {"SMILES": smile}
        for key, value in data.items():
            row[key] = value
        rows.append(row)

    df = pd.DataFrame(rows)
    return df

smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)


Solution

  • This is my solution.

    First I replace all C1 with C to evaluate one letter as an optional group. Then I count the open brackets. If only one backet is open, I have a new group. It I have a closing bracket, I check it the next letter is an opening one, to check if there is a consecutive group. If not, I reset the counter to 0.

    import pandas as pd
    
    def smile_grouping(s):
        s = s.replace('C1', 'C')
        open_brackets = 0
        group_counter = 0
    
        res = []
        for i, letter in enumerate(s):
            if letter == '(':
                open_brackets += 1
                if open_brackets == 1:
                    group_counter += 1
            elif letter == ')':
                open_brackets -= 1
            else:
                res.append(group_counter)
    
            if open_brackets == 0:
                if i+1<len(s) and s[i+1] != '(':
                    group_counter = 0
        return res
    

    This is the result

    df = pd.DataFrame(
        {'smile':[
            "C(N)(N)CC(N)C",
            "CCC",
            "C1CC1",
            "C1CC1(C)C",
            "CC(C)C",
            "C(N)(N)(N)CC(N)C",
            "C((N)(N)N)CC(N)C",
            "CC(CCC)CCCC",
            "CC(CC(C))"
        ]})
    df['branch_count'] = df['smile'].apply(smile_grouping)
    >>> df
                  smile                 branch_count
    0     C(N)(N)CC(N)C        [0, 1, 2, 0, 0, 1, 0]
    1               CCC                    [0, 0, 0]
    2             C1CC1                    [0, 0, 0]
    3         C1CC1(C)C              [0, 0, 0, 1, 0]
    4            CC(C)C                 [0, 0, 1, 0]
    5  C(N)(N)(N)CC(N)C     [0, 1, 2, 3, 0, 0, 1, 0]
    6  C((N)(N)N)CC(N)C     [0, 1, 1, 1, 0, 0, 1, 0]
    7       CC(CCC)CCCC  [0, 0, 1, 1, 1, 0, 0, 0, 0]
    8         CC(CC(C))              [0, 0, 1, 1, 1]