Note that locales are a part of C that has evolving support.
Following is about C23 and maybe back to C99.
This is about an minor corner case in strtod()
.
With locales like "turkish", should "\xDD" "NFINITY"
or "\xFD" "nfinity"
match "INFINITY"
or "infinity"
?
Unusual tolower()
and toupper()
A few locales map a case changed non-ASCII character to an ASCII one.
Example: Locale "turkish" (and others) has tolower(0xDD)
maps to 'i'
and toupper(0xFD)
maps to 'I'
.
#include <ctype.h>
#include <locale.h>
#include <stdio.h>
int main() {
void *ptr = setlocale(LC_ALL, "turkish");
if (ptr) {
#define TURKISH_I_WITH_DOT_ABOVE 0xDD
printf("TURKISH_I_WITH_DOT_ABOVE %3d, tolower(TURKISH_I_WITH_DOT_ABOVE):%3d, i: %3d\n",
TURKISH_I_WITH_DOT_ABOVE, tolower(TURKISH_I_WITH_DOT_ABOVE), 'i');
#define TURKISH_DOTLESS_I 0xFD
printf("TURKISH_DOTLESS_I %3d, toupper(TURKISH_DOTLESS_I) :%3d, I: %3d\n",
TURKISH_DOTLESS_I, toupper(TURKISH_DOTLESS_I), 'I');
}
}
Sample output:
TURKISH_I_WITH_DOT_ABOVE 221, tolower(TURKISH_I_WITH_DOT_ABOVE):105, i: 105
TURKISH_DOTLESS_I 253, toupper(TURKISH_DOTLESS_I) : 73, I: 73
Sample strtod()
strtod()
and friends are case insensitive in select ways.
Upper case and lower case characters are defined in the language and per locale.
C specifies: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
C23dr (§ 5.2.1 3)
A letter is an uppercase letter or a lowercase letter as defined previously in this subclause; in this document the term does not include other characters that are letters in other alphabets. (§ 5.2.1 4)
7.4.2.7 The islower function
int islower(int c);
Theislower
function tests for any character that is a lowercase letter or is one of a locale-specific set of characters for which none ofiscntrl
,isdigit
,ispunct
, orisspace
is true. In the "C" locale,islower
returns true only for the lowercase letters (as defined in 5.2.1).
(
strtod()
) The expected form of the subject sequence is an optional plus or minus sign, then one of the following:
...
— INF or INFINITY, ignoring case
C23 § 7.24.1.5 3
In other than the "C" locale, additional locale-specific subject sequence forms may be accepted. C23 § 7.24.1.5 7
It appears my C lib strtod()
does not ignore case per the locale and instead uses case per the 5.2.1 Character set spec.
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
void *ptr = setlocale(LC_ALL, "turkish");
if (ptr) {
char *end;
char infinity[] = "infinity";
double y = strtod(infinity, &end);
printf("strtod(%s, &end) --> %g %td\n", infinity, y, end - infinity);
#define TURKISH_I_WITH_DOT_ABOVE 0xDD
infinity[0] = (char) TURKISH_I_WITH_DOT_ABOVE;
y = strtod(infinity, &end);
printf("strtod(%s, &end) --> %g %td\n", infinity, y, end - infinity);
#define TURKISH_DOTLESS_I 0xFD
infinity[0] = (char) TURKISH_DOTLESS_I;
y = strtod(infinity, &end);
printf("strtod(%s, &end) --> %g %td\n", infinity, y, end - infinity);
}
}
Sample output:
strtod(infinity, &end) --> inf 8
strtod(�nfinity, &end) --> 0 0
strtod(�nfinity, &end) --> 0 0
Question
What are compliant behaviors concerning "INF or INFINITY, ignoring case"?
If it should use or not use case per locale, what cite supports that?
Perhaps either are allowed?
So it appears it one writes a strtod()
replacement, code using tolower()
or toupper()
risks non-compliance as it may allow "\xFD" "nfinity"
.
The C Standard is quite clear about the form of the string accepted by strtod()
and related functions defined in <stdlib.h>
:
After an optional sequence of whitespace characters and an optional sign, INF
or INFINITY
, ignoring case, is one of the accepted sequences.
5 A character sequence
INF
orINFINITY
is interpreted as an infinity, if representable in the return type, else like a floating constant that is too large for the range of the return type. [...] A pointer to the final string is stored in the object pointed to byendptr
, provided thatendptr
is not a null pointer.
6 In other than the "C" locale, additional locale-specific subject sequence forms may be accepted.
It would make sense for a library supporting one the Turkish locales to accept strings with dot-less lowercase i
and dotted uppercase I, as for example in the ISO-8859-9 locale the strings "\xDDNFINITY"
, "\xDDNF"
, "\xFDnfinity"
and "\xDDnf"
among other combinations such as "\xFDnf\xFDn\xFDty"
.
It seems paragraph 5 is a more general statement: it does not refer to any particular locale and probably implicitly applies to the "C" locale where I
converts to i
in lowercase. This seems a reasonable expectation but you might want to submit a defect report to the committee to clarify this point.
Let's analyse this ad absurdum:
Ignoring case means one can supply a string that converts to INFINITY
using toupper()
for each character which is a standard way to ignore case. In the Turkish locale using the ISO-8859-9 character set, "\xFDnf\xFDn\xFDty"
is such a string. Hence we should assume strtod
must convert it to INFINITY
... Yet this also means that "inf"
by the same method would be converted to "\xDDNF"
, hence not match INF
nor INFINITY
, unless paragraph 6 applies and "inf"
is handled explicitly as an extension.
As a conclusion, in the Turkish locale, it would be quite advisable for strtod
to support dotless i
and dotted I
, which can be generically written as:
// match INF and INFINITY, return 0 if fail, else length of word matched
int match_infinity(const char *p) {
static unsigned char ls[] = "infinity";
static unsigned char us[] = "INFINITY";
for (size_t i = 0; i < 8; i++) {
unsigned char c = p[i];
if (c != ls[i] && c != us[i] && tolower(c) != ls[i] && toupper(c) != us[i])
return i < 3 ? 0 : 3;
}
return 8;
}