pythonpython-3.xasciipython-renon-printing-characters

Python replace unprintable characters except linebreak


I am trying to write a function that replaces unprintable characters with space, that worked well but it is replacing linebreak \n with space too. I cannot figure out why.

Test code:

import re

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable characters (including escape sequences) with spaces
    # According to ChatGPT, \n should not be in this range
    cleaned_string = re.sub(r'[^\x20-\x7E]', ' ', input_string)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\x0Ais\x2028a\x2029test."
    
    print("Original String:")
    print(test_string)
    
    cleaned_string = replace_unknown_characters_with_space(test_string)
    
    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()

Output:

Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.

Cleaned String:
This is a test string with some unprintable characters: Hello World This is 28a 29test.

As you can see, the linebreak before Hello World is replaced by space, which is not intended. I tried to get help from ChatGPT but its regex solutions don't work.

my last resort is to use a for loop and use python built-in isprintable() method to filter the characters out, but this will be much slower compared to regex.


Solution

  • Modified regex expression inspired by Carlo Arenas' answer.

    Code:

    import re
    
    def replace_unknown_characters_with_space(input_string):
        # Replace all non printable ascii, excluding \n from the expression
        cleaned_string = re.sub(r'[^\x20-\x7E\n]', ' ', input_string, flags=re.DOTALL)
    
        return cleaned_string
    
    def main():
        test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\nis\x2028a\x2029test."
        
        print("Original String:")
        print(test_string)
        
        cleaned_string = replace_unknown_characters_with_space(test_string)
        
        print("\nCleaned String:")
        print(cleaned_string)
    
    if __name__ == "__main__":
        main()
    

    Output

    Original String:
    This is a test string with some unprintable characters:
    Hello
    Thisd
    is 28a 29test.
    
    Cleaned String:
    This is a test string with some unprintable characters:
    Hello World This
    is 28a 29test.
    

    \n is no longer replaced