I have seen various questions on here about combining linked lists with recursion, but none of them seem to quite fit to my problem.
I have a list with elements that combine to make a single string. The elements of the string have 'attachment point' identifiers, of the format "[*:x]" which denote which element to attach where. For example, ["[*:2]DEF","ABC[*:1]GH"]
will combine to ABCDEFGH
. The process to combine two elements is replacing the attachment points with a character, for example "Z" and then calling a function which combines the two strings at the correct point, using the "Z" character as a locator (so I can't just regex it)
Easy enough when you have 2 elements, but when you have multiple elements that is much more difficult. If I want to combine the elements of the list lib = ["[*:4]F[*:2]","[*:1]GH[*:5]","AB[*:4]","[*:3]C[*:6]E[*:1]","[*:2]IJ","[*:4]D"]
then I need a different approach. My predecessor wrote a function to recursively combine the elements as pure strings, which I have been using but now I need to add in the function to combine the elements as pairs.
The old recursive function is here:
lib = ["[*:4]F[*:2]","[*:1]GH[*:5]","AB[*:4]","[*:3]C[*:6]E[*:1]","[*:2]IJ","[*:4]D"]
def combine(r, element, lib):
if "[*:" in element:
root = f"[*:{r}]"
pos = element.index("[*:")
pos1 = element.index("]", pos)
next_r = int(element[pos + 3:pos1])
con = conf[next_r - 1]
if root in con:
con = con.replace(root, "")
return element[0:pos] + combine(next_r, con, lib) + combine(r, element[pos1 + 1:], lib)
return element
result = combine(1,lib[0],lib)
My semi-functional attempt to modify the above function to be able to use the pairwise combination function pairstitch
is here:
import re
lib = ["[*:4]F[*:2]","[*:1]GH[*:5]","AB[*:4]","[*:3]C[*:6]E[*:1]","[*:2]IJ","[*:4]D"]
#Example to make MWE work - actual function is much more complicated
def pairstitch(frag1, frag2, identifier):
return frag1.replace(identifier, frag2.replace(identifier,""))
def combine(r, smiles, lib):
if "[*:" in element:
root = f"[*:{r}]"
pos = element.index("[*:")
pos1 = element.index("]", pos)
next_r = int(element[pos + 3:pos1])
con = lib[next_r - 1]
attach_lst_base = re.findall("\[\*\:\d\]", element)
attach_lst_con = re.findall("\[\*\:\d\]", con)
if len(attach_lst_con) == 1:
if root in con:
next_r_block = f"[*:{next_r}]"
frag1 = element.replace(next_r_block, "Z")
frag2 = con.replace(root, "Z")
else:
pos = con.index("[*:")
pos1 = con.index("]", pos)
next_r = int(con[pos + 3:pos1])
next_r_block = con[pos:pos1 + 1]
frag1 = smiles.replace(root, "Z")
frag2 = con.replace(next_r_block, "Z")
element = pairstitch(frag1,frag2,"Z")
else:
next_frag = con
for att_pt in attach_lst_base:
if len(attach_lst_base) > len(attach_lst_con):
next_frag = combine(r, next_frag, conf)
else:
next_frag = combine(next_r, next_frag, lib)
next_r_block = f"[*:{next_r}]"
frag1 = smiles.replace(next_r_block, "Z")
frag2 = next_frag.replace(root, "Z")
element = pairstitch(frag1, frag2, "Z")
return element
result = combine(1,lib[0],lib)
However this only goes down one branch of the tree, and gives me ABCDEF[*:2]
when what I would expect is ABCDEFGHIJ
.
For context, the strings being combined represent chemical fragments which might contain numbers, certain special characters like ":,/,,#,=,@", or sections like "[nH2]", hence the format of the attachment points. As such, splitting the elements up into smaller parts will not work because the function which combines pairs of fragments together does so by various chemical rules to make sure a valid molecule comes out the other end.
Came back to this after some time working on other things, and I figured it out, so I'm posting the solution here in case it helps anyone else.
The solution lay in adding a parameter to the function which gave the location of the parent to that fragment. This means that if the list to be combined was
lib = ["[*:4]F[*:2]","[*:1]GH[*:5]","AB[*:4]","[*:3]C[*:6]E[*:1]","[*:2]IJ","[*:4]D"]
then starting with element 0 you would be able to examine element 3 next, while knowing that you got there from element 0.
The code itself is here:
import re
lib = ["[*:4]F[*:2]","[*:1]GH[*:5]","A[*:7][*:4]","[*:3]C[*:6]E[*:1]","[*:2]IJ","[*:4]D", "[*:3]B"]
#Example to make MWE work - actual function is much more complicated
def pairstitch(frag1, frag2, identifier):
return frag1.replace(identifier, frag2.replace(identifier,""))
def combine(r0, r, element, lib):
if "[*:" in element:
root = f"[*:{r}]"
parent = f"[*:{r0}]"
attach_lst_base = re.findall("\[\*\:\d\]", element)
for att_pt in attach_lst_base:
if att_pt != parent:
next_r = int(att_pt[3:-1])
con = lib[next_r - 1]
attach_lst_con = re.findall("\[\*\:\d\]", con)
if len(attach_lst_con) != 1:
next_frag = combine(r, next_r, con, lib)
else:
next_frag = con
if root in next_frag:
next_r_block = f"[*:{next_r}]"
frag1 = element.replace(next_r_block, "Z")
frag2 = next_frag.replace(root, "Z")
element = pairstitch(frag1, frag2, "Z")
return element
return element
test = combine(0,1,lib[0],lib)
print(test)
This outputs "ABCDEFGHIJ" as expected, and so gives an example of how a list of elements with defined attachment points can be combined when under the restriction that the elements have to be combined in a chemically aware manner.