I need a function to give the indices for which a list of strings is best aligned to a larger string.
For example:
Given the string:
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
and the list of strings:
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']
Can a function be created to yield:
indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]
Here is a script I created to illustrate the point:
from re import split
from numpy import vstack, zeros
import numpy as np
# I need a function which takes a string and the tokenized list
# and returns the indices for which the tokens were split at
def index_of_split(text_str, list_of_strings):
#?????
return indices
# The text string, string token list, and character binary annotations
# are all given
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']
# (This binary array labels the following terms ['Kir4.3', 'Dextran-sulfate', 'glucose'])
bin_ann = [1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
# Here we would apply our function
indices = index_of_split(text, tok)
# This list is the desired output
#indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]
# We could now split the binary array based on these indices
bin_ann_toked = np.split(bin_ann, indices)
# and combine with the tokenized list
tokenized_strings = np.vstack((tok, bin_ann_toked)).T
# Then we can remove the trailing zeros,
# which are likely caused from spaces,
# or other non tokenized text
for i, el in enumerate(tokenized_strings):
tokenized_strings[i][1] = el[1][:len(el[0])]
print(tokenized_strings)
This would provide the following output, given that the function worked as described:
[['Kir4.3' array([1, 1, 1, 1, 1, 1])]
['is' array([0, 0])]
['a' array([0])]
['inwardly-rectifying'
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
['potassium' array([0, 0, 0, 0, 0, 0, 0, 0, 0])]
['channel' array([0, 0, 0, 0, 0, 0, 0])]
['.' array([0])]
['Dextran-sulfate' array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])]
['is' array([0, 0])]
['useful' array([0, 0, 0, 0, 0, 0])]
['in' array([0, 0])]
['glucose' array([1, 1, 1, 1, 1, 1, 1])]
['-' array([0])]
['mediated' array([0, 0, 0, 0, 0, 0, 0, 0])]
['channels' array([0, 0, 0, 0, 0, 0, 0, 0])]
['.' array([0])]]
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']
ind = [0]
for i,substring in enumerate(tok):
ind.append(text.find(substring,ind[i],len(text)))
print ind[2:]
results in
[7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]