I run into a problem to find a comfort method to split text by the list of predefined sentences. Sentences can include any special characters and whatever absolutely custom.
Example:
text = "My name. is A. His name is B. Her name is C. That's why..."
delims = ["My name. is", "His name is", "Her name is"]
I want something like:
def custom_sentence_split(text, delims):
# stuff
return result
custom_sentence_split(text, delims)
# ["My name. is", " A. ", "His name is", " B. ", "Her name is", " C. That's why..."]
UPD. Well there can be non-comfort solution like that, I'd prefer to getting more comfort one
def collect_output(text, finds):
text_copy = text[:]
retn = []
for found in finds:
part1, part2 = text_copy.split(found, 1)
retn += [part1, found]
text_copy = part2
return retn
def custom_sentence_split(text, splitters):
pattern = "("+"|".join(splitters)+"|)"
finds = list(filter(bool, re.findall(pattern, text)))
output = collect_output(text, finds)
return output
UPD2: seems working solution is found.
pattern = "("+"|".join(map(re.escape, delims)) +")";
re.split(pattern, text)
You want to use the re.split
method.
You will need a regex string like (My\sname\sis|His\sname\sis|Her\sname\sis)
You could construct your regex string like "("+"|".join(map(re.escape, delims))+")"
Edit: You could do something like this:
text = "My name is A. His name is B. Her name is C. That's why..."
delims = ["My name is", "His name is", "Her name is"]
import re
def custom_sentence_split(text,delims):
pattern = "("+"|".join(map(re.escape, delims))+")"
return re.split(pattern,text)
print(custom_sentence_split(text,delims))