In bash, I can get the hexdump of the string hello
as UTF-16 by doing the following:
$ echo -n "hello" | iconv -f ascii -t utf-16 | hexdump
0000000 feff 0068 0065 006c 006c 006f
000000c
I can also write a short C program like so:
int main(int argc, char **argv) {
char *str = argv[1];
hexDump("The string", str, 12);
return 0;
}
using the hexDump
routine from how to get hexdump of a structure data. 12
is the number of bytes I counted from the use of hexdump
above.
Compile and run:
$ gcc test.c -o test
$ ./test $(echo -n hello | iconv -f ascii -t utf-16)
The string:
0000 ff fe 68 65 6c 6c 6f 00 53 53 48 5f ..hello.SSH_
Why is there a difference between the first hexstring feff 0068 0065 006c 006c 006f
and the second hexstring ff fe 68 65 6c 6c 6f 00 53 53 48 5f
?
I am asking this because I am trying to debug an application that uses libiconv
to convert a UTF-16 string to UTF-8 and keep getting an errno
of EILSEQ
which means that libiconv
has come across an "invalid multibyte sequence."
UPDATE:
If I run hexdump
with -C
, I get the following output:
$ echo -n hello | iconv -f ascii -t utf-16 | hexdump -C
00000000 ff fe 68 00 65 00 6c 00 6c 00 6f 00 |..h.e.l.l.o.|
0000000c
This hexstring is still different from the one my C program produces in that it includes the \x00
bytes interspersed between the ascii characters. When I run the C program however, there are no \x00
bytes interspersed at all. It just has the ff fe
header and then the regular ascii characters.
The command echo -n hello | iconv -f ascii -t utf-16 | hexdump -C
just pipes data directly between programs. Whatever bytes come out of iconv are taken directly as input to hexdump.
With the command ./test $(echo -n hello | iconv -f ascii -t utf-16)
, the shell takes the output of iconv, and effectively pastes it into a new command, parses the new command, and then executes it.
So the bytes coming out of iconv are: "ff fe 68 00 65 00 6c 00 6c 00 6f 00" and the shell parses this. It appears as though the shell simply skips null bytes when parsing, so the argument input to your program is just the non-null bytes. Since your string is ascii that means the result is just an ascii string (preceded by a UTF-16 BOM).
We can demonstrate this using a character like U+3300 (㌀). If we pass this instead of an ascii character and the above is correct, then the output will include 0x33 (the digit '3').
./test $(echo -n ㌀ | iconv -f utf-8 -t utf-16)
My terminal happens to use UTF-8, which supports the character U+3300, so I have iconv convert from that to UTF-16. I get the output:
The string:
0000 ff fe 33 ..3
By the way, your program includes a hard coded size for the array:
hexDump("The string", str, 12);
You really shouldn't do that. If the array isn't that big then you get undefined behavior, and your post shows some garbage being printed out after the real argument (the garbage appears to be the beginning of the environment variable array). There's really no reason for this. Just use the right value:
hexDump("The string", str, strlen(str));