linuxgeditnano

Why does every text editor write an additional byte (UTF-8)?


I'm working on Ubuntu 16.04 (Xenial Xerus). I found out that text editors write additional bytes (UTF-8) to the text file. It made some problems for me, when I tried to pass tests.

So we have a string, "Extra byte", with the size = 10 bytes in UTF-8. When I try to write it in file by gedit, for example, I get a file with the size = 11 byte. Also, nano makes the same size. Even "echo "Extra byte" > filename" returns 11 bytes.

However, when we try something like this:

#include <fstream>

int main(){
    std::ofstream file("filename");

    file<<"Extra byte";
    return 0;
}

or this:

with open("filename_py",'w+',encoding='UTF-8') as file:
    file.write('Extra byte')

We get the file with size = 10 bytes. Why?


Solution

  • You are seeing a newline character (often expressed in programming languages as \n, in ASCII it is hex 0a, decimal 10):

    $ echo 'foo' > /tmp/test.txt
    $ xxd /tmp/test.txt
    00000000: 666f 6f0a                                foo.
    

    The hex-dump tool xxd shows that the file consists of 4 bytes, hex 66 (ASCII lowercase f), two times hex 65 (lowercase letter o) and the newline.

    You can use the -n command-line switch to disable adding the newline:

    $ echo -n 'foo' > /tmp/test.txt
    $ xxd /tmp/test.txt
    00000000: 666f 6f                                  foo
    

    or you can use printf instead (which is more POSIX compliant):

    $ printf 'foo' > /tmp/test.txt
    $ xxd /tmp/test.txt
    00000000: 666f 6f                                  foo
    

    Also see 'echo' without newline in a shell script.

    Most text editors will also add a newline to the end of a file; how to prevent this depends on the exact editor (often you can just use delete at the end of the file before saving). There are also various command-line options to remove the newline after the fact, see How can I delete a newline if it is the last character in a file?.

    Text editors generally add a newline because they deal with text lines, and the POSIX standard defines that text lines end with a newline:

    3.206 Line
    A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

    Also see Why should text files end with a newline?