I am trying to get an output based on this procedure, which is best to explain with an example.
for example in a smile,
C(N)(N)CC(N)C, [0, 1, 2, 0, 0, 1, 0]
this is the output I am trying to get.
It counts the branching (which is represented by brackets). So for the above example, it counts the first (N) as 1, then the second (N) as 2. This count is then reset once it reaches an atom that is not branched (or bracketed). It continues to get 0 and the count begins and resets again. The problem is I am not getting the expected output. Below are my outputs, expected outputs and code. Thanks
Also, I need to ensure situations like these CC(CC(C)) are not incorrectly indexed. It should not count excess and not reset, not continuously count. That smile should have output of [0 0 1 1 1].
another example: CC(CCC)CCCC [0 0 1 1 1 0 0 0 0]
For nested brackets I will rerun this process and just start counting from 1.
I am getting this
SMILES branch_count
0 C(N)(N)CC(N)C [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]
1 CCC [0, 0, 0]
2 C1CC1 [0, 0, 0, 0, 0]
3 C1CC1(C)C [0, 0, 0, 0, 0, 0, 1, 0, 0]
4 CC(C)C [0, 0, 0, 1, 0, 0]
when it should be this
SMILES branch_count
0 C(N)(N)CC(N)C [0, 1, 2, 0, 0, 1, 0]
1 CCC [0, 0, 0]
2 C1CC1 [0, 0, 0]
3 C1CC1(C)C [0, 0, 0, 1, 0]
4 CC(C)C [0, 0, 1, 0]
import pandas as pd
import numpy as np
from rdkit import Chem
def get_branch_count(smile):
# Initialize variables
branch_count = [0] * len(smile)
bracket_count = 0
current_count = 0
# Loop through each character in the smile
for i, c in enumerate(smile):
# If the character is an open bracket, increment bracket count
if c == "(":
bracket_count += 1
# If the character is a close bracket, decrement bracket count
elif c == ")":
bracket_count -= 1
# If there are no more open brackets after this one, reset current count
if bracket_count == 0:
current_count = 0
# If the character is not a bracket, update the current count
else:
if bracket_count > 0:
# If the previous character was also a bracket, don't increment the count
if smile[i-1] != ")":
current_count += 1
else:
current_count = 0
branch_count[i] = current_count
return branch_count
def collect_branch_count(smile_list):
rows = []
for smile in smile_list:
branch_count = get_branch_count(smile)
data = {"branch_count": branch_count}
row = {"SMILES": smile}
for key, value in data.items():
row[key] = value
rows.append(row)
df = pd.DataFrame(rows)
return df
smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)
This is my solution.
First I replace all C1
with C
to evaluate one letter as an optional group. Then I count the open brackets. If only one backet is open, I have a new group. It I have a closing bracket, I check it the next letter is an opening one, to check if there is a consecutive group. If not, I reset the counter to 0.
import pandas as pd
def smile_grouping(s):
s = s.replace('C1', 'C')
open_brackets = 0
group_counter = 0
res = []
for i, letter in enumerate(s):
if letter == '(':
open_brackets += 1
if open_brackets == 1:
group_counter += 1
elif letter == ')':
open_brackets -= 1
else:
res.append(group_counter)
if open_brackets == 0:
if i+1<len(s) and s[i+1] != '(':
group_counter = 0
return res
This is the result
df = pd.DataFrame(
{'smile':[
"C(N)(N)CC(N)C",
"CCC",
"C1CC1",
"C1CC1(C)C",
"CC(C)C",
"C(N)(N)(N)CC(N)C",
"C((N)(N)N)CC(N)C",
"CC(CCC)CCCC",
"CC(CC(C))"
]})
df['branch_count'] = df['smile'].apply(smile_grouping)
>>> df
smile branch_count
0 C(N)(N)CC(N)C [0, 1, 2, 0, 0, 1, 0]
1 CCC [0, 0, 0]
2 C1CC1 [0, 0, 0]
3 C1CC1(C)C [0, 0, 0, 1, 0]
4 CC(C)C [0, 0, 1, 0]
5 C(N)(N)(N)CC(N)C [0, 1, 2, 3, 0, 0, 1, 0]
6 C((N)(N)N)CC(N)C [0, 1, 1, 1, 0, 0, 1, 0]
7 CC(CCC)CCCC [0, 0, 1, 1, 1, 0, 0, 0, 0]
8 CC(CC(C)) [0, 0, 1, 1, 1]