c++regexcomments

Why is commenting out multiline comments in c++ inconsistent?


So we know that

// This doesn't affect anything

/*
This doesn't affect anything either
*/

/*
/* /* /*
This doesn't affect anything
*/
This does because comments aren't recursive

/* /*
This doesn't affect anything
*/ */
This throws an error because the second * / is unmatched since comments aren't recursive

I've heard that the reason they aren't recursive is because they would slow down the compiler, and I guess that makes sense. However nowadays when I'm parsing c++ code in a higher level language (say Python), I can simply use the regular expression

"\/[\/]+((?![\n])[\s\S])*\r*\n"

to match // single line comments, and use

"\/\*((?!\*\/)[\s\S])*\*\/"

to match /* multiline comments */, then loop through all single line comments, remove them, then loop through all multi-line comments and remove them. Or vice versa. But that's where I'm stuck. It seems that doing one or the other isn't sufficient, because:

// /*
An error is thrown because the /* is ignored
*/

/*
This doesn't affect things because of mysterious reasons
// */

and

/*
This throws an error because the second * / is unmatched
// */ */

What is the reason for this behavior? Is it also an artifact of the way the compilers parse things? To be clear I don't want to change the behavior of c++, I would just like to know the reasoning behind the second set of examples behaving they way they do.

Edit:

So yes, to be more explicit, my question is why the following three (seemingly reasonable) ways of explaining this behavior don't work:

  1. Simply ignore all characters on a line after // regardless of whether they are /* or * /, even if you are in a multiline comment.

  2. Allow a / * or */ followed by a // to still have effect.

  3. Both of the above.

I understand why nested comments aren't allowed, because they would require a stack and arbitrarily high amounts of memory. But these three cases would not.

Edit again:

If anyone is interested, here is the following code to extract comments of a c/c++ file in python following the correct commenting rules discussed here:

import re
commentScanner = re.Scanner([
  (r"\/[\/]+((?![\n])[\s\S])*\r*(\n{1})?", lambda scanner, token: ("//", token)),
  (r"\/\*((?!\*\/)[\s\S])*\*\/", lambda scanner, token: ("/* ... */", token)),
  (r"[\s\S]", lambda scanner, token: None)
])
commentScanner.scan("fds a45fsa//kjl fds4325lkjfa/*jfds/\nk\lj\/*4532jlfds5342a  l/*a/*b/*c\n//fdsafa\n\r\n/*jfd//a*/fd// fs54fdsa3\r\r//\r/*\r\n2a\n\n\nois")

Solution

  • It's not inconsistent. The existing behaviour is both easy to specify and easy to implement, and your compiler is implementing it correctly. See [lex.comment] in the standard.

    The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates with the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. [ Note: The comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment. — end note ]

    As you can see, // can be used to comment out both /* and */. It's just that comments don't nest, so if the // is already inside a /*, then the // has no effect at all.