Given the following text:
text = "Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character. In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2] She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5] It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"
I need:
["Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.",
"In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2]",
"She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5]",
"It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"]
I tried this but it doesn't work:
new_line = re.split('(?<=\.) |(([.?!](\[\d+\])+))\s', text)
print(new_line)
The result I am getting is this:
['Van der Weyden was preoccupied by commissioned\xa0portraiture\xa0towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.', None, None, None, "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers", '.[2]', '.[2]', '[2]', 'She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress', '.[3][4][5]', '.[3][4][5]', '[5]', "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"]
You can use
re.findall(r'(?s)(.*?(?:\.|[.?!](?:\[\d+\])+))(?:\s+|\s*\Z)', text)
See the regex demo. Details:
(?s)
- same as re.S
or re.DOTALL
, makes .
match across lines(.*?(?:\.|[.?!](?:\[\d+\])+))
- Group 1:
.*?
- zero or more chars as few as possible(?:\.|[.?!](?:\[\d+\])+)
- either a dot or a .
/?
/!
and the one or more occurrences of [
+ digit(s) + ]
substring(?:\s+|\s*\Z)
- either one or more whitespaces or zero or more whitespaces followed with end of string.See the Python demo:
import re
text = "Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character. In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2] She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5] It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"
print( re.findall(r'(.*?(?:\.|[.?!](?:\[\d+\])+))(?:\s+|\s*\Z)', text, re.DOTALL) )
Output:
[
'Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.',
"In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2]",
'She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5]',
"It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"
]