clanguage-lawyerc-preprocessor

Why does the C Standard prohibit a partial preprocessing token at the end of a source file?


I'm reading ISO C draft standard n3096 and notice the following bold statement (§ 5.1.1.2 p1):

  1. The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment. Each comment is replaced by one character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.

This statement also appears in C90, and presumably every standard in between.

My question is: why does the Standard explicitly prohibit a source file from ending in a partial preprocessing token? Is this statement not made redundant by other declarations of undefined behavior?


First, I will provide my (mis)understanding.

As a working definition, because the C Standard does not mention "partial preprocessing token" in any other place, I defer to this informative footnote in the C++ Standard (n4928, § 5.2 [lex.phases]):

10) A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. ...

...where header-name is a kind of preprocessing-token, defined (§ 6.4 p1)

preprocessing-token:

  • header-name
  • identifier
  • pp-number
  • character-constant
  • string-literal
  • punctuator
  • each universal-character-name that cannot be one of the above
  • each non-white-space character that cannot be one of the above

From this, I believe the preprocessing tokens that can be "partial" include at least header-name, character-constant, string-literal, and hexadecimal-floating-constant (a kind of pp-number that is terminated by a binary-exponent-part).

Regarding the statement prohibiting partial preprocessing tokens at the end of a source file: I assume it is not redundant, and it prohibits some case not already prohibited elsewhere. Assuming a source file ending in a partial preprocessing token is non-empty, and given that a non-empty source file is already required to end in a new-line character not part of a line splice (§ 5.1.1.2 p1), I believe the aforementioned statement describes the case where the source file ends in a new-line not part of a line splice, and also ends in a partial preprocessing token; so the partial preprocessing token contains a new-line after translation phase 2.

But, preprocessing tokens do not contain new-lines after TP 2. (/* comments can though, which is why the prohibition of partial comments makes sense to me.)

(If it is relevant, I am also confused about the concept of a partial preprocessing token in the first place... It is apparently not a preprocessing token, but I thought there are no invalid preprocessing tokens because of the fallback in the preprocessing-token rule, "each non-white-space character that cannot be one of the above.")


Sorry for the long question...

Many thanks,


Solution

  • Your question seems to be almost exactly the same to the defect report 324 https://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_324.htm , which asks:

    Assuming there is a non-empty source file legally ending with a new-line character, what are examples of such partial preprocessing tokens that could end the file? And, generally, what the partial preprocessing tokens are?


    As a working definition, because the C Standard does not mention "partial preprocessing token" in

    From DR324:

    "Partial preprocessing token" is not itself a technical term; it is merely the English Language word "partial" modifying the technical term "preprocessing token". A preprocessing token is defined by the grammar non-terminal preprocessing-token in Subclause 6.4. A partial preprocessing token is therefore just part of a preprocessing token that is not the entire preprocessing token.


    why does the Standard explicitly prohibit a source file from ending in a partial preprocessing token? Is this statement not made redundant by other declarations of undefined behavior?

    From DR324:

    The statement that "source files shall not end in a partial preprocessing token or in a partial comment" has two implications. First, a preprocessing token may not begin in one file and end in another file. Second, the last preprocessing token in a source file must be well-formed and complete. For example, the last token may not be a string literal missing the close quote.


    Overall, I think the issue is about:

    // string.h
    "am I a string
    
    // main.c
    #include <string.h>\
    literal?"