javalexical

Why ASCII SUB (\u001a) is ignored in JAVA?


In the Java Spec, I read that

As a special concession for compatibility with certain operating systems, the ASCII SUB character (\u001a, or control-Z) is ignored if it is the last character in the escaped input stream. Here

I don't understand what the SUB character is and why it should be removed/ignored if it is the last character in the escaped input stream

Can anyone help me to understand? Thank you very much


Solution

  • The Ctrl+Z control code is kinda special in Windows, which inherited it from DOS which inherited it from CP/M. Its legacy use was as an end-of-text marker, similar to how Ctrl+D is used in Unix.

    It was included as a non-printable character in unicode to match the existing ASCII character 0x1A.

    Many text editors and program languages still support this convention, or can be configured to insert this character at the end of a file when editing. The standard specification for CSV files still recommends a trailing EOF character to be appended as the last character in the file.

    See https://en.wikipedia.org/wiki/Substitute_character

    Since you'll never encounter this character in any other place, especially in an escaped input stream, where only printable ascii characters should occur, the character can be safely ignored everywhere. In practice it's only ignored if it's the last character in an escaped input stream.

    So if you put a Ctrl-Z in the middle of your source code, e.g. as part of a variable name, you will get a compiler error. But if you would write your code in some ancient text editor that puts a Ctrl-Z at the end of the file, the compiler will safely ignore it for you.