pythonpython-3.xwhitespacedata-cleaningremoving-whitespace

New line character causing words to break up during text cleaning


I am trying to clean some text for further use using Python. Below is an example input text:

have agreed in-\nsured as\nan per-\nson in\nwriting in a contract or agreement

Now I want the output to be:

have agreed insured as an person in writing in a contract or agreement

But the white space are causing problems. I tried two different logic to achieve the desired output but one works for certain words and not for others and the second works for certain and not for others.

These are the two logic I have tried:-

Logic 1:-

x = "have agreed in-\nsured as\nan per-\nson in\nwriting in a contract or agreement"

#remove everything except alphabets and whitespaces
v = re.sub(r"[^\w\s.?!]", "", v)
#remove single whitespaces
v = v.replace("\n", "")
#remove extra whitespaces
v = re.sub(r"\s+", " ", v, flags=re.I)

This results in the following output:

have agreed insured asan person inwriting in a contract or agreement

As you can see the words in-\nsured, per-\nson have been cleaned properly but the words as\nan and in\nwriting have not been cleaned. So to solve this I tried the below logic:

Logic 2:-

v = re.sub(r"[^\w\s.?!]", "", v)    
v = v.replace("\n", " ")     <----This line has been changed (" " instead of "")
v = re.sub(r"\s+", " ", v, flags=re.I)

This gave the following output:

have agreed in sured as an per son in writing in a contract or agreement

The words as\nan and in\nwriting have been cleaned but this messes up the in-\nsured and per-\nsonwords.

How can I solve this issue?

Thanks in advance!


Solution

  • This seems pretty easy with simple string replacement. A hyphen followed by a newline is a single word, so just remove those. Then any other remaining newlines can get replaced by spaces. Thus:

    s = 'have agreed in-\nsured as\nan per-\nson in\nwriting in a contract or agreement'
    fixed = s.replace('-\n', '').replace('\n', ' ')
    # 'have agreed insured as an person in writing in a contract or agreement'