cposixiconv

How are you supposed to know the number of nonconvertible character calling iconv multiple times?


POSIX says that iconv calls may fail with errno = E2BIG if the output buffer is too small to contain the output, although the previous characters have been converted and the shift state is correctly set:

If the output buffer is not large enough to hold the entire converted input, conversion shall stop just prior to the input bytes that would cause the output buffer to overflow. The variable pointed to by inbuf shall be updated to point to the byte following the last byte successfully used in the conversion. The value pointed to by inbytesleft shall be decremented to reflect the number of bytes still not converted in the input buffer. The variable pointed to by outbuf shall be updated to point to the byte following the last byte of converted output data. The value pointed to by outbytesleft shall be decremented to reflect the number of bytes still available in the output buffer. For state-dependent encodings, the conversion descriptor shall be updated to reflect the shift state in effect at the end of the last successfully converted byte sequence.

...

ERRORS

The iconv() function shall fail if:

...

[E2BIG]
Input conversion stopped due to lack of space in the output buffer.

This seems to suggest that the expected usage to convert a long input string is to perform iconv calls in a loop, and as long as you get E2BIG you can either reallocate the output buffer expanding it, or (if you are writing to a stream) flushing it and going on.

My question is about the return value: POSIX says:

RETURN VALUE

The iconv() function shall update the variables pointed to by the arguments to reflect the extent of the conversion and return the number of non-identical conversions performed. If the entire string in the input buffer is converted, the value pointed to by inbytesleft shall be 0. If the input conversion is stopped due to any conditions mentioned above, the value pointed to by inbytesleft shall be non-zero and errno shall be set to indicate the condition. If an error occurs, iconv() shall return (size_t)-1 and set errno to indicate the error.

This is not completely clear to me: in case of E2BIG, is the return value going to be (size_t)-1, or am I just supposed to check errno and inbytesleft? Checking actual implementations, e.g. the GNU one, I see that:

The conversion can stop for four reasons:

...

  1. The output buffer has no more room for the next converted character. In this case, it sets errno to E2BIG and returns (size_t) -1.

So I'd say that in a very "normal" conversion loop, it's expected not to be able to use the return value (that is meant to return the number of characters that were converted in a "lossy" way) for its regular purpose, except for the last converted chunk.

Am I correct in my understanding? So there's no way to perform a conversion with iconv that is incremental while also keeping track of non-reversibly converted characters?


Solution

  • POSIX says that iconv calls may fail with errno = E2BIG if the output buffer is too small to contain the output

    Yes, mostly.

    Part of the spec that you did not quote says that iconv() will fail and set errno to E2BIG when used to produce an initial shift-state reset sequence in the output buffer, but there is insufficient space to accommodate that.

    On the other hand, the first part of the specifications you've quoted says that conversion will stop and errno will be set under those circumstances, but that's not necessarily the same as the iconv() call failing.

    On the third hand, that circumstance is indeed documented as "The iconv() function shall fail if [...]", so on the presumption that the spec is self-consistent, yes, iconv() fails in that case.

    POSIX also says

    The value of errno should only be examined when it is indicated to be valid by a function's return value.

    So we should conclude that it would be pointless for iconv() to set any value for errno without also returning a value that indicates the errno conveys relevant information. Although it is possible for functions to do pointless things, the best and most consistent interpretation is that in those situations where iconv() provides information via errno, its return value will indicate that it has done so. For iconv(), the return value that does so is (size_t)-1.

    So I'd say that in a very "normal" conversion loop, it's expected not to be able to use the return value (that is meant to return the number of characters that were converted in a "lossy" way) for its regular purpose, except for the last converted chunk.

    Yes and no. Note well that the return value counting lossy conversions is the description given by the GNU spec. However, POSIX says that the return value indicates the number of non-identical conversions performed. Such conversions are not necessarily lossy. I'm uncertain whether the GNU spec is just worded poorly, or whether GNU iconv() diverges from POSIX in this regard.

    But whichever kind of conversions iconv()'s return value counts, yes, you get that information only when a complete input sequence is successfully converted in one call.

    So there's no way to perform a conversion with iconv that is incremental while also keeping track of non-reversibly converted characters?

    "No way" would be a little too strong, but certainly it's not easy in the general case, where the source encoding can be variable-width and / or stateful.

    If you knew how to break the input into suitable chunks, then you could feed them to iconv() one at a time, using the same conversion descriptor for each call. You can choose input chunk sizes based on the available output size, allowing for the possibility of the output being (much) larger than the input. As far as I am aware, the expansion ratio will never be greater than 4:1 for any inputs. There is a problem with manual chunking, however: given that there are variable-width encodings, you cannot reliably avoid partial code sequences at the end of your chunks without knowing and using information about the source encoding.

    You could nevertheless do the job by falling back to retry failed conversions, using the information provided by iconv() to choose a suitable chunk size for the next attempt. But to accommodate stateful encodings, that means falling back all the way to the beginning of the conversion (and therefore remembering the sizes of all the chunks successfully converted on the previous attempt) and using a fresh conversion descriptor. That may not be practical. But if you were content to ignore stateful encodings then you could retry individual chunks upon failure, without going all the way back to the beginning.