Edit2: So for this smile ('CC(C)(C)C1=C(C(=C(C=C1)O)O)O'), the distance metric would look like this.
[1, 2.0, 3.0, 3.0, 3.0, 4.5, 6.0, 7.5, 6.0, 4.5, 8.5, 7.0, 5.5].
This vector is in the order of the letters in the smile.
| SMILES | C-C | C-C-C-C-C | C-C-C@C@ | C-C@C@O- | C-C@C@ |O-C |
| ------------------------------- | -------------- | -------------- | -------------- | -------------- | -------------- |--------------|
| CC(C)(C)C1=C(C(=C(C=C1)O)O)O | 3 | 1 | 1 | 3 | 2 | 3 |
Now what I wanted was this.
| SMILES | C-C | C-C-C-C-C | C-C-C@C@ | C-C@C@O- | C-C@C@ |O-C |
| ------------------------------- | -------------- | -------------- | -------------- | -------------- | -------------- |--------------|
| CC(C)(C)C1=C(C(=C(C=C1)O)O)O | [1, 3, 3] | [2] | [3] | [4.5, 6, 7.5] | [6, 4] | [8.5, 7, 5.5] |
As you can see from the table above, the number of columns in the vector represents the frequency of the group, with the value corresponding to the distance metric of each of the occurrences of the group. I think that it requires a sort of indexing to associate each group defined with the distance metric.
Edit1:I have posted my attempts below.
I am using rdkit to classify groups and a distance metric from a string (which represents a molecule). I am trying to create a vector which will be headed by the functional group, and the number of rows equals the number of occurrences of that group in a smile (the string which represents the molecule). The value inside is a topological distance metric.
C-C-C | C-C- |
---|---|
1.5 | 1 |
3 | 5 |
4 |
I have 2 codes, 1 to get the groups and 1 to get the distances. I have been struggling on combining them to get the output above. I have made attempts on this, but have been struggling. I can send what I tried if anyone wants to see. These are the two codes I have made for each job.
#Group code
import pandas as pd
from rdkit import Chem
def collect_bonded_atoms(smile):
mol = Chem.MolFromSmiles(smile)
atom_counts = {}
for atom in mol.GetAtoms():
charge = '+' if atom.GetFormalCharge() > 0 else '-' if atom.GetFormalCharge() < 0 else ''
neighbors = [(neighbor.GetSymbol(), bond.GetBondType())
for neighbor, bond in zip(atom.GetNeighbors(), atom.GetBonds())]
neighbors.sort()
key = "{}{}-{}".format(charge, atom.GetSymbol(), "".join(f"{symbol}{'-' if bond_order == 1 else '=' if bond_order == 2 else '#' if bond_order == 3 else '@'}" for symbol, bond_order in neighbors))
atom_counts[key] = atom_counts.get(key, 0) + 1
return atom_counts
smile_list = ['CC(C)(C)C1=C(C(=C(C=C1)O)O)O']
# Create a list of dictionaries for each SMILES
data = []
for smile in smile_list:
atom_counts = collect_bonded_atoms(smile)
data.append(atom_counts)
# Create a pandas DataFrame from the list of dictionaries
df = pd.DataFrame(data)
# Fill missing values with zeros
df = df.fillna(0)
# Add a column for the SMILES
df.insert(0, 'SMILES', smile_list)
# Display the DataFrame
print(df)
#df.to_excel("TOTALSMILESGROUPSwhole_updated.xlsx")
from rdkit import Chem
from queue import PriorityQueue
def dijkstra(mol, start_atom_index):
"""Computes the shortest path between a start atom and all other atoms in the molecule using Dijkstra's algorithm.
Args:
mol (RDKit Mol object): The input molecule.
start_atom_index (int): The index of the start atom.
Returns:
A list of distances from the start atom to each other atom in the molecule, where the i-th element is the distance from the start atom to the i-th atom in the molecule with an offset of 1 added to each distance.
"""
# Initialize distances to infinity
distances = [float('inf')] * mol.GetNumAtoms()
distances[start_atom_index] = 0
# Initialize priority queue with start atom
pq = PriorityQueue()
pq.put((0, start_atom_index))
# Loop until all atoms have been visited
while not pq.empty():
# Get the next atom with the shortest distance
dist, atom_index = pq.get()
# Update the distances of its neighbors
atom = mol.GetAtomWithIdx(atom_index)
for neighbor in atom.GetNeighbors():
neighbor_index = neighbor.GetIdx()
bond = mol.GetBondBetweenAtoms(atom_index, neighbor_index)
bond_length = bond.GetBondTypeAsDouble()
new_distance = dist + bond_length
# Update the distance if it is shorter
if new_distance < distances[neighbor_index]:
distances[neighbor_index] = new_distance
pq.put((new_distance, neighbor_index))
# Add an offset of 1 to each distance and return the list
return [x+1 for x in distances]
# Example usage
smiles = 'CC(C)(C)C1=C(C(=C(C=C1)O)O)O'
mol = Chem.MolFromSmiles(smiles)
distances = dijkstra(mol, 0)
print(distances)
Can anyone guide me on how to do this, or even better show me with the code? Many Thanks
#attempt 1
import pandas as pd
from rdkit import Chem
from queue import PriorityQueue
def collect_bonded_atoms(smile):
mol = Chem.MolFromSmiles(smile)
atom_counts = {}
distances = {}
for atom in mol.GetAtoms():
charge = '+' if atom.GetFormalCharge() > 0 else '-' if atom.GetFormalCharge() < 0 else ''
neighbors = [(neighbor.GetSymbol(), bond.GetBondType())
for neighbor, bond in zip(atom.GetNeighbors(), atom.GetBonds())]
neighbors.sort()
key = "{}{}-{}".format(charge, atom.GetSymbol(), "".join(f"{symbol}{'-' if bond_order == 1 else '=' if bond_order == 2 else '#' if bond_order == 3 else '@'}" for symbol, bond_order in neighbors))
atom_counts[key] = atom_counts.get(key, 0) + 1
distances[key] = {}
dijkstra_distances = dijkstra(mol, atom.GetIdx())
for neighbor in atom.GetNeighbors():
neighbor_key = "{}{}-{}".format('',
neighbor.GetSymbol(),
"".join(f"{bond.GetBondTypeAsDouble()}" for bond in mol.GetBonds() if (bond.GetBeginAtomIdx() == atom.GetIdx() and bond.GetEndAtomIdx() == neighbor.GetIdx()) or (bond.GetBeginAtomIdx() == neighbor.GetIdx() and bond.GetEndAtomIdx() == atom.GetIdx())))
neighbor_index = neighbor.GetIdx()
distances[key][neighbor_key] = dijkstra_distances[neighbor_index]
return atom_counts, distances
def dijkstra(mol, start_atom_index):
"""Computes the shortest path between a start atom and all other atoms in the molecule using Dijkstra's algorithm.
Args:
mol (RDKit Mol object): The input molecule.
start_atom_index (int): The index of the start atom.
Returns:
A list of distances from the start atom to each other atom in the molecule, where the i-th element is the distance from the start atom to the i-th atom in the molecule.
"""
# Initialize distances to infinity
distances = [float('inf')] * mol.GetNumAtoms()
distances[start_atom_index] = 0
# Initialize priority queue with start atom
pq = PriorityQueue()
pq.put((0, start_atom_index))
# Loop until all atoms have been visited
while not pq.empty():
# Get the next atom with the shortest distance
dist, atom_index = pq.get()
# Update the distances of its neighbors
atom = mol.GetAtomWithIdx(atom_index)
for neighbor in atom.GetNeighbors():
neighbor_index = neighbor.GetIdx()
bond = mol.GetBondBetweenAtoms(atom_index, neighbor_index)
bond_length = bond.GetBondTypeAsDouble()
new_distance = dist + bond_length
# Update the distance if it is shorter
if new_distance < distances[neighbor_index]:
distances[neighbor_index] = new_distance
pq.put((new_distance, neighbor_index))
return distances
smile_list = ['C1=CC=CC=C1', 'CC(C)C', 'CCO', 'C1=CC=C2C(=C1)C(=CN2)CC=C']
data = []
for smile in smile_list:
atom_counts, distances = collect_bonded_atoms(smile)
for key, value in atom_counts.items():
row = {}
# Add the SMILES to the row
row['SMILES'] = smile
# Add the presence of each group to the row
for group, count in atom_counts.items():
row[group] = 1 if group == key else 0
# Add the distance of each group to the row
for group, dist_dict in distances.items():
if group == key:
continue
dist = dist_dict.get(key, float('inf'))
row[group] = (1, dist) if dist != float('inf') else (0, 0)
data.append(row)
# Create a pandas DataFrame from the list of dictionaries
df = pd.DataFrame(data)
# Reorder the columns so that SMILES comes first
cols = list(df.columns)
cols.remove('SMILES')
cols = ['SMILES'] + cols
df = df[cols]
# Fill missing values with zeros
df = df.fillna(0)
# Display the DataFrame
print(df)
#attempt 2
from rdkit import Chem
from queue import PriorityQueue
def collect_bonded_atoms(smile):
mol = Chem.MolFromSmiles(smile)
Chem.Kekulize(mol)
atom_counts = {}
for atom in mol.GetAtoms():
symbol = atom.GetSymbol()
charge = '+' if atom.GetFormalCharge() > 0 else '-' if atom.GetFormalCharge() < 0 else ''
neighbors = [(neighbor.GetSymbol(), bond.GetBondType())
for neighbor, bond in zip(atom.GetNeighbors(), atom.GetBonds())]
neighbors.sort()
key = "{}{}-{}".format(charge, symbol, "".join(f"{symbol}{'-' if bond_order == 1 else '=' if bond_order == 2 else '#' if bond_order == 3 else '@'}" for symbol, bond_order in neighbors))
atom_counts[key] = atom_counts.get(key, 0) + 1
return atom_counts
def dijkstra(mol, start_atom_index):
"""Computes the shortest path between a start atom and all other atoms in the molecule using Dijkstra's algorithm.
Args:
mol (RDKit Mol object): The input molecule.
start_atom_index (int): The index of the start atom.
Returns:
A list of distances from the start atom to each other atom in the molecule, where the i-th element is the distance from the start atom to the i-th atom in the molecule.
"""
# Initialize distances to infinity
distances = [float('inf')] * mol.GetNumAtoms()
distances[start_atom_index] = 0
# Initialize priority queue with start atom
pq = PriorityQueue()
pq.put((0, start_atom_index))
# Loop until all atoms have been visited
while not pq.empty():
# Get the next atom with the shortest distance
dist, atom_index = pq.get()
# Update the distances of its neighbors
atom = mol.GetAtomWithIdx(atom_index)
for neighbor in atom.GetNeighbors():
neighbor_index = neighbor.GetIdx()
bond = mol.GetBondBetweenAtoms(atom_index, neighbor_index)
bond_length = bond.GetBondTypeAsDouble()
new_distance = dist + bond_length
# Update the distance if it is shorter
if new_distance < distances[neighbor_index]:
distances[neighbor_index] = new_distance
pq.put((new_distance, neighbor_index))
return distances
# Define a list of SMILES
smile_list = ['CC(C)(C)C1=C(C(=C(C=C1)O)O)O', 'C1CCCCC1']
# Create a list to hold the tuples for each SMILES
smile_tuples = []
# Loop through each SMILES and create a tuple for each group and its location
for smile in smile_list:
# Create a dictionary of group frequencies for this SMILES
atom_counts = collect_bonded_atoms(smile)
# Compute the topical distance matrix for this SMILES
mol = Chem.MolFromSmiles(smile)
distances = dijkstra(mol, 0)
# Create a list of tuples for this SMILES, where each tuple is (group, distance)
tuples = []
for group, frequency in atom_counts.items():
symbol = group.split('-', 1)[1].split('-')[0]
for i in range(frequency):
atom_index = Chem.MolFromSmiles(smile).GetSubstructMatches(Chem.MolFromSmiles(symbol))[i][0]
distance = distances[atom_index]
tuples.append((group, distance))
print(tuples)
print(distances)
print(group)
print(symbol)
#attempt 3
import pandas as pd
from rdkit import Chem
from queue import PriorityQueue
def dijkstra(mol, start_atom_index):
"""Computes the shortest path between a start atom and all other atoms in the molecule using Dijkstra's algorithm.
Args:
mol (RDKit Mol object): The input molecule.
start_atom_index (int): The index of the start atom.
Returns:
A list of distances from the start atom to each other atom in the molecule, where the i-th element is the distance from the start atom to the i-th atom in the molecule.
"""
# Initialize distances to infinity
distances = [float('inf')] * mol.GetNumAtoms()
distances[start_atom_index] = 0
# Initialize priority queue with start atom
pq = PriorityQueue()
pq.put((0, start_atom_index))
# Loop until all atoms have been visited
while not pq.empty():
# Get the next atom with the shortest distance
dist, atom_index = pq.get()
# Update the distances of its neighbors
atom = mol.GetAtomWithIdx(atom_index)
for neighbor in atom.GetNeighbors():
neighbor_index = neighbor.GetIdx()
bond = mol.GetBondBetweenAtoms(atom_index, neighbor_index)
bond_length = bond.GetBondTypeAsDouble()
new_distance = dist + bond_length
# Update the distance if it is shorter
if new_distance < distances[neighbor_index]:
distances[neighbor_index] = new_distance
pq.put((new_distance, neighbor_index))
return distances
def collect_bonded_atoms_and_distances(smile):
mol = Chem.MolFromSmiles(smile)
atom_counts = {}
distances = {}
for atom in mol.GetAtoms():
charge = '+' if atom.GetFormalCharge() > 0 else '-' if atom.GetFormalCharge() < 0 else ''
neighbors = [(neighbor.GetSymbol(), bond.GetBondType())
for neighbor, bond in zip(atom.GetNeighbors(), atom.GetBonds())]
neighbors.sort()
key = "{}{}-{}".format(charge, atom.GetSymbol(), "".join(f"{symbol}{'-' if bond_order == 1 else '=' if bond_order == 2 else '#' if bond_order == 3 else '@'}" for symbol, bond_order in neighbors))
if key not in atom_counts:
atom_counts[key] = 0
distances[key] = []
atom_counts[key] += 1
distances[key].extend(dijkstra(mol, atom.GetIdx()))
return {**atom_counts, **distances}
smile_list = ['CC(C)(C)C1=C(C(=C(C=C1)O)O)O']
# Create a list of dictionaries for each SMILES
data = []
for smile in smile_list:
counts_and_distances = collect_bonded_atoms_and_distances(smile)
data.append(counts_and_distances)
# Create a pandas DataFrame from the list of dictionaries
df = pd.DataFrame(data)
# Fill missing values with zeros
df = df.fillna(0)
# Transpose the DataFrame to match the desired output format
df = df.T
# Rename the columns and index
df.columns = [f"Atom {i}" for i in range(1, len(df.columns) + 1)]
df.index.name = 'Functional group'
# Display the final DataFrame
print(df)
#attempt 4
import pandas as pd
from rdkit import Chem
from queue import PriorityQueue
def dijkstra(mol, start_atom_index):
"""Computes the shortest path between a start atom and all other atoms in the molecule using Dijkstra's algorithm.
Args:
mol (RDKit Mol object): The input molecule.
start_atom_index (int): The index of the start atom.
Returns:
A list of distances from the start atom to each other atom in the molecule, where the i-th element is the distance from the start atom to the i-th atom in the molecule.
"""
# Initialize distances to infinity
distances = [float('inf')] * mol.GetNumAtoms()
distances[start_atom_index] = 0
# Initialize priority queue with start atom
pq = PriorityQueue()
pq.put((0, start_atom_index))
# Loop until all atoms have been visited
while not pq.empty():
# Get the next atom with the shortest distance
dist, atom_index = pq.get()
# Update the distances of its neighbors
atom = mol.GetAtomWithIdx(atom_index)
for neighbor in atom.GetNeighbors():
neighbor_index = neighbor.GetIdx()
bond = mol.GetBondBetweenAtoms(atom_index, neighbor_index)
bond_length = bond.GetBondTypeAsDouble()
new_distance = dist + bond_length
# Update the distance if it is shorter
if new_distance < distances[neighbor_index]:
distances[neighbor_index] = new_distance
pq.put((new_distance, neighbor_index))
return distances
def collect_bonded_atoms_and_distances(smile):
mol = Chem.MolFromSmiles(smile)
atom_counts = {}
for atom in mol.GetAtoms():
charge = '+' if atom.GetFormalCharge() > 0 else '-' if atom.GetFormalCharge() < 0 else ''
neighbors = [(neighbor.GetSymbol(), bond.GetBondType())
for neighbor, bond in zip(atom.GetNeighbors(), atom.GetBonds())]
neighbors.sort()
key = "{}{}-{}".format(charge, atom.GetSymbol(), "".join(f"{symbol}{'-' if bond_order == 1 else '=' if bond_order == 2 else '#' if bond_order == 3 else '@'}" for symbol, bond_order in neighbors))
atom_counts[key] = atom_counts.get(key, 0) + 1
distances = {}
for key in atom_counts:
distances[key] = []
indices = [atom.GetIdx() for atom in mol.GetAtoms() if atom_counts[key] > 0 and atom.GetSmarts() == key]
for index in indices:
dists = dijkstra(mol, index)
for i, d in enumerate(dists):
if i not in distances[key]:
distances[key].append(d)
atom_counts[key] -= 1
return {**atom_counts, **distances}
smile_list = ['CC(C)(C)C1=C(C(=C(C=C1)O)O)O']
# Create a list of dictionaries for each SMILES
data = []
for smile in smile_list:
counts_and_distances = collect_bonded_atoms_and_distances(smile)
data.append(counts_and_distances)
# Create a pandas DataFrame from the list of dictionaries
df = pd.DataFrame(data)
# Fill missing values with zeros
df = df.fillna(0)
# Add a column for the SMILES
df.insert(0, 'SMILES', smile_list)
# Display the DataFrame
print(df)
Change function collect_bonded_atoms
like this:
def collect_bonded_atoms(smile, array):
mol = Chem.MolFromSmiles(smile)
atom_counts = {}
for idx, atom in enumerate(mol.GetAtoms()):
charge = '+' if atom.GetFormalCharge() > 0 else '-' if atom.GetFormalCharge() < 0 else ''
neighbors = [(neighbor.GetSymbol(), bond.GetBondType())
for neighbor, bond in zip(atom.GetNeighbors(), atom.GetBonds())]
neighbors.sort()
key = "{}{}-{}".format(charge, atom.GetSymbol(), "".join(f"{symbol}{'-' if bond_order == 1 else '=' if bond_order == 2 else '#' if bond_order == 3 else '@'}" for symbol, bond_order in neighbors))
if key in atom_counts:
atom_counts[key][0].append(array[idx])
else:
atom_counts[key] = [[array[idx]]]
return atom_counts
Don't change function dijkstra
at all.
Utilize functions like this:
smile_list = ['CC(C)(C)C1=C(C(=C(C=C1)O)O)O']
mol = Chem.MolFromSmiles(smile_list[0])
distances = dijkstra(mol, 0)
data = collect_bonded_atoms(smile_list[0], distances)
data['SMILES'] = [smile_list[0]]
df = pd.DataFrame(data)
print(df)
To be honest: I don't know whether this is what you logically want to achieve but its output is what you were asking for.
Every time your collect_bonded_atoms
now finds a "group" the index - I guess of the first atom of the group - is taken and mapped to the distances-array.
I have not yet changed the code's functionality to create a huge dataframe for multiple SMILES in smile_list
but I guess that was not the main focus of your question, was it?
Let me know how it works out for you.