First and foremost: JSON and XML are not an option in this specific case, please don't suggest them. If this makes it easier to accept that fact, imagine that I intend to reinvent the wheel for self-education.
Back to the point:
I need to design a binary-safe data format to encode some datagrams I send to a particular dumb server that I write (in C if that matters).
To simplify the question, let's say that I'm sending only numbers, strings and arrays.
Important fact: Server does not (and should not) know anything about Unicode and stuff. It treats all strings as binary blobs (and never looks inside them).
The format that I originally devised is as follows:
<Number:size>\n<Value1>...<ValueN>
N\n<Value>\n
S\n<Number:size-in-bytes>\n<bytes>\n
A\n<Number:size>\n<Value0>...<ValueN>
Example:
[ 1, "foo", [] ]
Serializes as follows:
1 ; number of items in datagram A ; -- array -- 3 ; number of items in array N ; -- number -- 1 ; number value S ; -- string -- 3 ; string size in bytes foo ; string bytes A ; -- array -- 0 ; number of items in array
The problem is that I can not reliably get a string size in bytes in JavaScript.
So, the question is: how to change the format, so a string can be both saved in JS and loaded in C neatly.
I do not want to add Unicode support to the server.
And I do not quite want to decode strings on server (say, from base64 or simply to unescape \xNN sequences) — this would require work with dynamic string buffers, which, given how dumb the server is, is not so desirable...
Any clues?
It seems that reading UTF-8 in plain C is not that scary after all. So I'm extending the protocol to handle UTF-8 strings natively. (But will appreciate an answer to this question as it stands.)