pythonregexpython-resentence

Split text by sentences


I run into a problem to find a comfort method to split text by the list of predefined sentences. Sentences can include any special characters and whatever absolutely custom.

Example:

text = "My name. is A. His name is B. Her name is C. That's why..."
delims = ["My name. is", "His name is", "Her name is"]

I want something like:

def custom_sentence_split(text, delims):
     # stuff
     return result

custom_sentence_split(text, delims)
# ["My name. is", "  A. ", "His name is", "  B. ", "Her name is", " C. That's why..."]

UPD. Well there can be non-comfort solution like that, I'd prefer to getting more comfort one


def collect_output(text, finds):
    text_copy = text[:]
    retn = []
    for found in finds:
        part1, part2 = text_copy.split(found, 1)
        retn += [part1, found]
        text_copy = part2
    return retn
    

def custom_sentence_split(text, splitters):
    pattern = "("+"|".join(splitters)+"|)"
    finds = list(filter(bool, re.findall(pattern, text)))
    output = collect_output(text, finds)
    return output

UPD2: seems working solution is found.

pattern = "("+"|".join(map(re.escape, delims)) +")"; 
re.split(pattern, text)

Solution

  • You want to use the re.split method.

    You will need a regex string like (My\sname\sis|His\sname\sis|Her\sname\sis)

    You could construct your regex string like "("+"|".join(map(re.escape, delims))+")"

    Edit: You could do something like this:

    text = "My name is A. His name is B. Her name is C. That's why..."
    delims = ["My name is", "His name is", "Her name is"]
    
    import re
    
    def custom_sentence_split(text,delims):
        pattern = "("+"|".join(map(re.escape, delims))+")"
        return re.split(pattern,text)
    
    print(custom_sentence_split(text,delims))