python-3.xregexregular-language

Split a string by punctuation marks (.!?;:) while excluding abbreviations


I'd like to create a function that's capable of splitting a string containing multiple sentences by dots, but at the same time handles abbreviations. For example, it shouldn't split after "Univ." and "Dept.". It's kinda hard to explain but I will show the test cases. I have seen this post (Split string with "." (dot) while handling abbreviations) but the answer removed the non-punctuation dots (U.S.A. to USA) and I want to keep dots in place

This is my function:

def split_string_by_punctuation(line: str) -> list[str]:
    """
    Splits a given string into a list of strings using terminal punctuation marks (., !, ?, or :) as delimiters.

    This function utilizes regular expression patterns to ensure that abbreviations, honorifics,
    and certain special cases are not considered as sentence delimiters.

    Args:
        line (str): The input string to be split into sentences.

    Returns:
        list: A list of strings representing the sentences obtained after splitting the input string.

    Notes:
        - Negative lookbehind is used to exclude abbreviations (e.g., "e.g.", "i.e.", "U.S.A."),
          which might have a period but are not the end of a sentence.
        - Negative lookbehind is also used to exclude honorifics (e.g., "Mr.", "Mrs.", "Dr.")
          that might have a period but are not the end of a sentence.
        - Negative lookbehind is also used to exclude some abbreviations (e.g., "Dept.", "Univ.", "et al.")
          that might have a period but are not the end of a sentence.
        - Positive lookbehind is used to match a whitespace character following a terminal
          punctuation mark (., !, ?, or :).
    """
    punct_regex = re.compile(r"(?<=[.!?;:])(?:(?<!Prof\.)|(?<!Dept\.)|(?<!Univ\.)|(?<!et\sal\.))(?<!\w\.\w.)(?<![A-Z][a-z]\.)\s")


    return re.split(punct_regex, line)

And these are my test cases:

class TestSplitStringByPunctuation(object):
    def test_split_string_by_punctuation_1(self):
        # Test case 1
        text1 = "I am studying at Univ. of California, Dept. of Computer Science. The research team includes " \
                "Prof. Smith, Dr. Johnson, and Ms. Adams et al. so we are working on a new project."
        result1 = split_string_by_punctuation(text1)
        assert result1 == ['I am studying at Univ. of California, Dept. of Computer Science.',
                           'The research team includes Prof. Smith, Dr. Johnson, and Ms. Adams et al. '
                           'so we are working on a new project.'], "Test case 1 failed"

    def test_split_string_by_punctuation_2(self):
        # Test case 2
        text2 = "This is a city in U.S.A.. This is i.e. one! What about this e.g. one? " \
                "Finally, here's the last one:"
        result2 = split_string_by_punctuation(text2)
        assert result2 == ['This is a city in U.S.A..', 'This is i.e. one!', 'What about this e.g. one?',
                           "Finally, here's the last one:"], "Test case 2 failed"

    def test_split_string_by_punctuation_3(self):
        # Test case 3
        text3 = "This sentence contains no punctuation marks from Mr. Zhong, Dr. Lu and Mrs. Han It should return as a single element list"
        result3 = split_string_by_punctuation(text3)
        assert result3 == [
            'This sentence contains no punctuation marks from Mr. Zhong, Dr. Lu and Mrs. Han It should return '
            'as a single element list'], "Test case 3 failed"

For example, the result of test case 1 is ['I am studying at Univ.', 'of California, Dept.', 'of Computer Science.', 'The research team includes Prof.', 'Smith, Dr. Johnson, and Ms. Adams et al.', 'so we are working on a new project.'] which splits the string on "Univ.", "Dept.", "Prof." and "et al.".


Solution

  • I would suggest using findall to capture sentences instead of split to identify sentence breaks.

    Some other remarks:

    After fixing the test cases, this function passed the tests:

    def split_string_by_punctuation(line):
        punct_regex = r"(?=\S)(?:[A-Z][a-z]{0,3}\.|[^.?!;:]|\.(?!\s+[A-Z]))*.?"
        return re.findall(punct_regex, line)
    

    Explanation:

    NB: a non-capturing group still matches text, it just cannot be referenced with a back reference. The word "capture" refers to creating a group for it, not to "matching".