stringwindowsbatch-filecontrol-characters

Removing binary control characters from a text file


I have a text file that contains binary control characters, such as "^@" and "^M". When I try to perform string operations directly on the text file, the control characters crash the script.

Through trial and error, I discovered that the more command will strip the control characters so that I can process the file properly.

more file_with_control_characters.not_txt > file_without_control_characters.txt

Is this considered a good method, or is there a better way to remove control characters from a text file? Does more have this behavior in OSes earlier than Windows 8?


Solution

  • Certainly you do not want to simply remove all control characters. Newline and Tab characters are control characters as well, and you don't want to remove those.

    I'm assuming your ^M is a carriage return, and ^@ is a NULL byte. The carriage returns are not causing you problems, and MORE does not remove them. But NULL bytes can cause problems if your utility is expecting ASCII text files.

    Your input file is most likely UTF-16. MORE is converting the UTF-16 into ANSI (extended ASCII) format, which does effectively remove the NULL bytes. It also converts non-ASCII values into extended ASCII characters in the decimal 128 - 255 byte value range. I believe it uses your active code page (CHCP) value to figure out what characters map where, but I'm not positive.

    You should be aware of some additional issues.

    If MORE works for you, than by all means use it.

    One other option is to use TYPE, which will also convert UTF-16 to ANSI:

    type "yourFile.txt" >"newFile.txt"
    

    TYPE definitely maps non-ASCII codes based on the active code page.

    There are some differences with how TYPE converts vs. MORE