I am trying to write a function that replaces unprintable characters with space, that worked well but it is replacing linebreak \n
with space too. I cannot figure out why.
Test code:
import re
def replace_unknown_characters_with_space(input_string):
# Replace non-printable characters (including escape sequences) with spaces
# According to ChatGPT, \n should not be in this range
cleaned_string = re.sub(r'[^\x20-\x7E]', ' ', input_string)
return cleaned_string
def main():
test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\x0Ais\x2028a\x2029test."
print("Original String:")
print(test_string)
cleaned_string = replace_unknown_characters_with_space(test_string)
print("\nCleaned String:")
print(cleaned_string)
if __name__ == "__main__":
main()
Output:
Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.
Cleaned String:
This is a test string with some unprintable characters: Hello World This is 28a 29test.
As you can see, the linebreak before Hello World is replaced by space, which is not intended. I tried to get help from ChatGPT but its regex solutions don't work.
my last resort is to use a for loop and use python built-in isprintable()
method to filter the characters out, but this will be much slower compared to regex.
Modified regex expression inspired by Carlo Arenas' answer.
Code:
import re
def replace_unknown_characters_with_space(input_string):
# Replace all non printable ascii, excluding \n from the expression
cleaned_string = re.sub(r'[^\x20-\x7E\n]', ' ', input_string, flags=re.DOTALL)
return cleaned_string
def main():
test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\nis\x2028a\x2029test."
print("Original String:")
print(test_string)
cleaned_string = replace_unknown_characters_with_space(test_string)
print("\nCleaned String:")
print(cleaned_string)
if __name__ == "__main__":
main()
Output
Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.
Cleaned String:
This is a test string with some unprintable characters:
Hello World This
is 28a 29test.
\n
is no longer replaced