I'm learning C and trying to understand how scanf works. I can't understand some terms: the term "input item", "initial subsequence", "matching sequence". I am reading this https://pubs.opengroup.org/onlinepubs/9699919799/functions/scanf.html. It says:
An input item shall be defined as the longest sequence of input bytes (up to any specified maximum field width, which may be measured in characters or bytes dependent on the conversion specifier) which is an initial subsequence of a matching sequence.
Assume that in format i have %d and in stdin 4.5. What is initial subsequence and what is matching sequence? And what is input item?
I thought that input item is what corresponds to this specifier, i.e. for %d corresonding symbols are numbers (maybe + - signs at the beginning), but then it says:
Except in the case of a % conversion specifier, the input item (or, in the case of a %n conversion specification, the count of input bytes) shall be converted to a type appropriate to the conversion character. If the input item is not a matching sequence, the execution of the conversion specification fails;
i.e. the input item may not correspond, which means it does not consist only of symbols that correspond to the specifier.
So, can you explain these terms to me? And tell me if this is where I'm looking for documentation for C functions? What sites are the best places to read documentation for C functions?
A key problem this text is dealing with is that sometimes whether a sequence of characters matches the pattern for a conversion depends on characters not yet read. For example, the input text “0x3” matches the pattern for %x
, but the input text “0xy” does not. At the point where we have read “0x”, “0x” by itself does not match the pattern, and we do not know whether the next character will form a matching sequence. We must read the next character to find out.
So the rule for when scanf
continues reading characters cannot be “As long as the characters match the goal pattern, keep reading.” If that were the rule, we would read “0”, see that matches a possible valid input for %x
, then read “x”, see that “0x” is not a valid input for %x
, and stop. That will not work, because it fails to read “0x3”, which is a valid input for %x
.
The rule must be that scanf
continues reading as long as the input could be a matching sequence if the coming characters complete a match. A way to say this technically is the characters read so far are an initial subsequence of a matching sequence.
Assume that in format I have %d and in stdin 4.5. What is initial subsequence and what is matching sequence? And what is input item?
A matching sequence for %d
is optionally a “-”, then one or more decimal digits. Consider the matching sequences “123” and “-123”. Initial subsequences of “123” are the empty string, “1”, “12”, and “123”. Initial subsequences of “-123” are the empty string, “-”, “-1”, “-12”, and “-123”. So an initial subsequence of a matching sequence for %d
is optionally a “-”, then zero or more decimal digits. Note that “-” is an initial subsequence but not a matching sequence.
It is more interesting to consider %e
, where “3.4e-5” is a matching sequence but “3.4,” or “3.4ex” is not. Now scanf
has to read two characters where it is unknown whether there will be a match. When scanf
has read “3.4”, that is a matching sequence, but it has to continue as long as the characters form an initial subsequence. Next we have “3.4e”, which is no longer a matching sequence but is still an initial subsequence. Then “3.4e-” is also an initial subsequence. With “3.4e-5” we once again have a matching sequence as well as an initial subsequence. scanf
must continue reading. If the next character is a space, “3.4e-5 ” is not an initial subsequence, so the space is rejected (and is “put back” into the input stream as if it had never been read), and “3.4e-5” is the input item. This input item is a matching sequence, so it is converted to a float
.
Now consider reading “3.4ex”. As above, at “3.4e”, scanf
must continue reading. It gets the “x” and sees that “3.4ex” is not an initial subsequence. It rejects the “x” and puts it back into the input stream. Now it is done reading, and “3.4e” is the input item. This is not a matching sequence, so it is a matching failure. scanf
does not perform a conversion for this, and it returns.
Note that with “3.4ex”, if we could put back two characters instead of just one, we could revert to “3.4”, match that for %e
, and convert it to float
. However, the C standard does not require I/O streams to support more than one character of put-back, and scanf
is specified to work with only one character of put-back. This is why scanf
is specified to read until a match is impossible, then put back the non-matching character, then see if what it did read is a match. If we had more levels of put-back, scanf
could read until a match is impossible, then put back all the characters needed to reduce the input to a matching sequence, and then convert that.
(Note that, as discussed in the comments, some scanf
implementations do not conform to this specification of the C standard and may put back multiple characters.)
What sites are the best places to read documentation for C functions?
There is no single site that is the best place, nor any small set of sites. The C standard is the most authoritative place to read about functions in the standard C library, but understanding them is informed by computer science theory and by history of the development of the C language. With this issue in scanf
, theory about parsing, formal languages, and finite-state machines is informative about how a computer has to go about reading characters to interpret them. Those are parts of study of a computer science education, not something you easily get from narrowly focused web sites.