pythonnumpyindexingalignmentsequence-alignment

indices of alignment for a list of strings to string


I need a function to give the indices for which a list of strings is best aligned to a larger string.

For example:

Given the string:

text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'

and the list of strings:

tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']

Can a function be created to yield:

indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]


Here is a script I created to illustrate the point:

from re import split
from numpy import vstack, zeros
import numpy as np

# I need a function which takes a string and the tokenized list 
# and returns the indices for which the tokens were split at
def index_of_split(text_str, list_of_strings):
    #?????
    return indices

# The text string, string token list, and character binary annotations 
# are all given
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']
# (This binary array labels the following terms ['Kir4.3', 'Dextran-sulfate', 'glucose'])
bin_ann = [1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

# Here we would apply our function
indices = index_of_split(text, tok)
# This list is the desired output
#indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]

# We could now split the binary array based on these indices
bin_ann_toked = np.split(bin_ann, indices)
# and combine with the tokenized list
tokenized_strings = np.vstack((tok, bin_ann_toked)).T

# Then we can remove the trailing zeros, 
# which are likely caused from spaces, 
# or other non tokenized text
for i, el in enumerate(tokenized_strings):
    tokenized_strings[i][1] = el[1][:len(el[0])]
print(tokenized_strings)

This would provide the following output, given that the function worked as described:

[['Kir4.3' array([1, 1, 1, 1, 1, 1])]
 ['is' array([0, 0])]
 ['a' array([0])]
 ['inwardly-rectifying'
  array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
 ['potassium' array([0, 0, 0, 0, 0, 0, 0, 0, 0])]
 ['channel' array([0, 0, 0, 0, 0, 0, 0])]
 ['.' array([0])]
 ['Dextran-sulfate' array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])]
 ['is' array([0, 0])]
 ['useful' array([0, 0, 0, 0, 0, 0])]
 ['in' array([0, 0])]
 ['glucose' array([1, 1, 1, 1, 1, 1, 1])]
 ['-' array([0])]
 ['mediated' array([0, 0, 0, 0, 0, 0, 0, 0])]
 ['channels' array([0, 0, 0, 0, 0, 0, 0, 0])]
 ['.' array([0])]]

Solution

  • text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
    
    tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']
    
    
    ind = [0]
    for i,substring in enumerate(tok):
        ind.append(text.find(substring,ind[i],len(text)))
    
    print ind[2:]
    

    results in

    [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]