I have a large text file in Python. I want to make a new line for each sentences. For each line should contain only one sentence information.
For example:
Input:
The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world". Numerous attempts in the 21. century to settle the debate.
Output:
The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci.
Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world".
Numerous attempts in the 21. century to settle the debate.
I tried :
with open("new_all_data.txt", 'r') as text, open("new_all_data2.txt", "w") as new_text2:
text_lines = text.readlines()
for line in text_lines:
if "." in line:
new_lines = line.replace(".", ".\n")
new_text2.write(new_lines)
It makes a new line for sentences; however, it makes a new line for every string after ".".
For example:
The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci.
Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world".
Numerous attempts in the 21.
century to settle the debate.
I want to keep "Numerous attempts in the 21. century to settle the debate" in one line.
You only need to replace periods followed by a space and a capital letter:
import re
with open("new_all_data.txt", 'r') as text, open("new_all_data2.txt", "w") as new_text2:
text_lines = text.readlines()
for line in text_lines:
if "." in line:
new_lines = re.sub(
r"(?<=\.) (?=[A-Z])",
"\n",
line
)
new_text2.write(new_lines)
I use the re
module that allows performing regex-based replacements with the function re.sub
. Then, in the line, I search for spaces that match the following regex: (?<=\.) (?=[A-Z])
(?<=xxx)
which is a positive look behind, it makes sure that the match has xxx
just before). \.
matches a period, so (?<=\.)
(note the space at the end) makes sure I match spaces that have a period right before it.(?=xxx)
which is a positive look ahead, it makes sure that the match has xxx
just after). [A-Z]
matches any capital letter, so (?=[A-Z])
(note the space at the beginning) makes sure I match spaces that have a capital letter after it.Combining those two conditions should be enough to replace by a new line only spaces that are between two sentences.