javascriptregexblock-comments

Regex to perform global match on javascript block comments


Note, the goal here is not lexical analysis so please do not suggest lexing or parsing code. And, my apology for adding to the mess of "regex comments" questions but the best (most voted) bad answer (given the context of how the result would be used based on the question) is inadequate, (though I was able to start from there) and many of the other answers I've reviewed are simply irrelevant to what I'm trying to do.

I've built a regex which works in principle as expected here.


/(?:\n|^)(?:[^'"])*?(?:'(?:[^\\\r\n]|[\\]{2}|\\')*'|"(?:[^\\\r\n]|[\\]{2}|\\")*")*?(?:[^'"])*?(\/\*(?:[\s\S]*?)\*\/)/g

The final group matches block comments well, as reference in the above SO:

(\/\*(?:[\s\S]*?)\*\/)

Everything preceding the actual match is discarded, but used for the purpose of matching a valid block comment - i.e. not something found in a string literal.

Ignore the case where a regex can look like a block comment.

Assume that the input string is linted, not free-form javascript.


But in practice, I'm getting a duplicate on the first match and no other matches.

Why? And how might it be corrected to work in practice?

Thanks in advance for your help and any trouble the question may put you through. :)

Also (in the comments section) any potential pit falls are welcome, given the information below.

Extra information irrelevant to the direct question: The ultimate goal, as hinted in the example code, is to replace/collapse any nested or otherwise code structures in such a way so as to focus on the variable declarations at the top of the lexical scope for a given patch of code - for the purpose of hoisting variable declarations, to generate a template for a specific use case. I know that sounds like a load, but I believe it is possible and relatively straight forward - NOT ENTIRELY WITH SIMPLE REPLACEMENT - but none the less. For reference to what I mean by "possible", I would prefer to only collapse regexs, block comments and inline comments EDIT: and string literals /EDIT, then recursively collapse only variable scopes (or plain objects) in {blocks} (all of them which do not contain any nested blocks) until they are gone, then see what's left. If it seems like this won't work for any reason, please respond only in comments. Thank you!


Solution

  • This is one of those "ugh, yeah, of course!" moments.

    The exec() function will generate an array with 1 element, being the matched element. Except it doesn't, the first element is the full match, which is great unless there are capture groups. If there are, then in additional to result[0] being the full pattern match, result[1] will be the first capture group, result[2] the second, and so on.

    For example:

    1. (/l/g).exec("l") gives us ["l"]
    2. (/(l)/g).exec("l") gives us ["l", "l"]

    You RE isn't so much the problem (although running the string through a stream filter that takes out block comments is probably easier to work with) as it's more a case of the assumption that you can just use .join() on the exec results that's been causing you problems. If you have capture groups, and you have a result, join results.slice(1), or call results.splice(1,0) before joining to get rid of the leading element, so you don't accidentally include the full match.