pythonregexsplitspacepositive-lookahead

Splitting sentences on space that follows a non-fixed length expression


Given the following text:

text = "Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character. In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2] She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5] It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"

I need:

["Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.",
 "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2]",
 "She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5]",
 "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"]

I tried this but it doesn't work:

new_line = re.split('(?<=\.) |(([.?!](\[\d+\])+))\s', text)
print(new_line)

The result I am getting is this:

['Van der Weyden was preoccupied by commissioned\xa0portraiture\xa0towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.', None, None, None, "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers", '.[2]', '.[2]', '[2]', 'She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress', '.[3][4][5]', '.[3][4][5]', '[5]', "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"]

Solution

  • You can use

    re.findall(r'(?s)(.*?(?:\.|[.?!](?:\[\d+\])+))(?:\s+|\s*\Z)', text)
    

    See the regex demo. Details:

    See the Python demo:

    import re
    text = "Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character. In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2] She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5] It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"
    print( re.findall(r'(.*?(?:\.|[.?!](?:\[\d+\])+))(?:\s+|\s*\Z)', text, re.DOTALL) )
    

    Output:

    [
      'Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.',
      "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2]",
      'She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5]',
      "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"
    ]