I am trying to clean some text for further use using Python. Below is an example input text:
have agreed in-\nsured as\nan per-\nson in\nwriting in a contract or agreement
Now I want the output to be:
have agreed insured as an person in writing in a contract or agreement
But the white space are causing problems. I tried two different logic to achieve the desired output but one works for certain words and not for others and the second works for certain and not for others.
These are the two logic I have tried:-
Logic 1:-
x = "have agreed in-\nsured as\nan per-\nson in\nwriting in a contract or agreement"
#remove everything except alphabets and whitespaces
v = re.sub(r"[^\w\s.?!]", "", v)
#remove single whitespaces
v = v.replace("\n", "")
#remove extra whitespaces
v = re.sub(r"\s+", " ", v, flags=re.I)
This results in the following output:
have agreed insured asan person inwriting in a contract or agreement
As you can see the words in-\nsured
, per-\nson
have been cleaned properly but the words as\nan
and in\nwriting
have not been cleaned. So to solve this I tried the below logic:
Logic 2:-
v = re.sub(r"[^\w\s.?!]", "", v)
v = v.replace("\n", " ") <----This line has been changed (" " instead of "")
v = re.sub(r"\s+", " ", v, flags=re.I)
This gave the following output:
have agreed in sured as an per son in writing in a contract or agreement
The words as\nan
and in\nwriting
have been cleaned but this messes up the in-\nsured
and per-\nson
words.
How can I solve this issue?
Thanks in advance!
This seems pretty easy with simple string replacement. A hyphen followed by a newline is a single word, so just remove those. Then any other remaining newlines can get replaced by spaces. Thus:
s = 'have agreed in-\nsured as\nan per-\nson in\nwriting in a contract or agreement'
fixed = s.replace('-\n', '').replace('\n', ' ')
# 'have agreed insured as an person in writing in a contract or agreement'