[SOLVED] Removing binary control characters from a text file

Removing binary control characters from a text file

I have a text file that contains binary control characters, such as "^@" and "^M". When I try to perform string operations directly on the text file, the control characters crash the script.

Through trial and error, I discovered that the more command will strip the control characters so that I can process the file properly.

more file_with_control_characters.not_txt > file_without_control_characters.txt

Is this considered a good method, or is there a better way to remove control characters from a text file? Does more have this behavior in OSes earlier than Windows 8?

Solution

Certainly you do not want to simply remove all control characters. Newline and Tab characters are control characters as well, and you don't want to remove those.

I'm assuming your ^M is a carriage return, and ^@ is a NULL byte. The carriage returns are not causing you problems, and MORE does not remove them. But NULL bytes can cause problems if your utility is expecting ASCII text files.

Your input file is most likely UTF-16. MORE is converting the UTF-16 into ANSI (extended ASCII) format, which does effectively remove the NULL bytes. It also converts non-ASCII values into extended ASCII characters in the decimal 128 - 255 byte value range. I believe it uses your active code page (CHCP) value to figure out what characters map where, but I'm not positive.

You should be aware of some additional issues.

MORE will convert all Tab characters into a series of spaces, and you cannot control how many spaces (it varies depending on the current position in the line).
MORE will always terminate each line with \r\n (carriage return and line feed).
MORE also removes the two byte BOM at the beginning of the file, if it exists. The BOM indicates the UTF-16 format. But MORE does not require the 2 byte BOM indicator, it will convert the UTF-16 to ANSI regardless.
Lastly MORE can hang indefinitely if your file exceeds 64K lines.

If MORE works for you, than by all means use it.

One other option is to use TYPE, which will also convert UTF-16 to ANSI:

type "yourFile.txt" >"newFile.txt"

TYPE definitely maps non-ASCII codes based on the active code page.

There are some differences with how TYPE converts vs. MORE

One advantage of TYPE is it does not convert Tab characters to spaces.
Another advantage is it will not hang with large files.
Another difference (maybe good, maybe bad) is it will not add a line terminator to a line that does not already have one.
A potential disadvantage of TYPE is it will not convert UTF-16 to ANSI if the input is missing the BOM.