cbashutf-16hexdumplibiconv

Why is hexdump of UTF-16 string when passed in as a command line argument different from what it is directly on the terminal?


In bash, I can get the hexdump of the string hello as UTF-16 by doing the following:

$  echo -n "hello" | iconv -f ascii -t utf-16 | hexdump
0000000 feff 0068 0065 006c 006c 006f          
000000c

I can also write a short C program like so:

int main(int argc, char **argv) {
  char *str = argv[1];

  hexDump("The string", str, 12);

  return 0;
}

using the hexDump routine from how to get hexdump of a structure data. 12 is the number of bytes I counted from the use of hexdump above.

Compile and run:

$ gcc test.c -o test


$ ./test $(echo -n hello | iconv -f ascii -t utf-16)
The string:
  0000  ff fe 68 65 6c 6c 6f 00 53 53 48 5f              ..hello.SSH_

Why is there a difference between the first hexstring feff 0068 0065 006c 006c 006f and the second hexstring ff fe 68 65 6c 6c 6f 00 53 53 48 5f?

I am asking this because I am trying to debug an application that uses libiconv to convert a UTF-16 string to UTF-8 and keep getting an errno of EILSEQ which means that libiconv has come across an "invalid multibyte sequence."

UPDATE:

If I run hexdump with -C, I get the following output:

$ echo -n hello | iconv -f ascii -t utf-16 | hexdump -C
00000000  ff fe 68 00 65 00 6c 00  6c 00 6f 00              |..h.e.l.l.o.|
0000000c

This hexstring is still different from the one my C program produces in that it includes the \x00 bytes interspersed between the ascii characters. When I run the C program however, there are no \x00 bytes interspersed at all. It just has the ff fe header and then the regular ascii characters.


Solution

  • The command echo -n hello | iconv -f ascii -t utf-16 | hexdump -C just pipes data directly between programs. Whatever bytes come out of iconv are taken directly as input to hexdump.

    With the command ./test $(echo -n hello | iconv -f ascii -t utf-16), the shell takes the output of iconv, and effectively pastes it into a new command, parses the new command, and then executes it.

    So the bytes coming out of iconv are: "ff fe 68 00 65 00 6c 00 6c 00 6f 00" and the shell parses this. It appears as though the shell simply skips null bytes when parsing, so the argument input to your program is just the non-null bytes. Since your string is ascii that means the result is just an ascii string (preceded by a UTF-16 BOM).

    We can demonstrate this using a character like U+3300 (㌀). If we pass this instead of an ascii character and the above is correct, then the output will include 0x33 (the digit '3').

    ./test $(echo -n ㌀ | iconv -f utf-8 -t utf-16)
    

    My terminal happens to use UTF-8, which supports the character U+3300, so I have iconv convert from that to UTF-16. I get the output:

    The string:
      0000  ff fe 33                                         ..3
    

    By the way, your program includes a hard coded size for the array:

    hexDump("The string", str, 12);
    

    You really shouldn't do that. If the array isn't that big then you get undefined behavior, and your post shows some garbage being printed out after the real argument (the garbage appears to be the beginning of the environment variable array). There's really no reason for this. Just use the right value:

    hexDump("The string", str, strlen(str));