pythonlistfiletextedittexteditingcontroller

How to make a new line for a sentence after finished sentene with dot?


I have a large text file in Python. I want to make a new line for each sentences. For each line should contain only one sentence information.

For example:

Input:

The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world". Numerous attempts in the 21. century to settle the debate.


Output:

The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. 
Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world".
Numerous attempts in the 21. century to settle the debate.

I tried :

with open("new_all_data.txt", 'r') as text, open("new_all_data2.txt", "w") as new_text2:
    text_lines = text.readlines()

    for line in text_lines:

        if "." in line:

           new_lines = line.replace(".", ".\n")
           new_text2.write(new_lines)

It makes a new line for sentences; however, it makes a new line for every string after ".".

For example:

The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. 
Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world".
Numerous attempts in the 21.
century to settle the debate.

I want to keep "Numerous attempts in the 21. century to settle the debate" in one line.


Solution

  • You only need to replace periods followed by a space and a capital letter:

    import re
    
    with open("new_all_data.txt", 'r') as text, open("new_all_data2.txt", "w") as new_text2:
        text_lines = text.readlines()
        for line in text_lines:
            if "." in line:
                new_lines = re.sub(
                   r"(?<=\.) (?=[A-Z])",
                   "\n",
                   line
                )
                new_text2.write(new_lines)
    

    I use the re module that allows performing regex-based replacements with the function re.sub. Then, in the line, I search for spaces that match the following regex: (?<=\.) (?=[A-Z])

    Combining those two conditions should be enough to replace by a new line only spaces that are between two sentences.