javascriptparsingpreprocessorautomatic-semicolon-insertion

Automatic Semicolon Insertion in JavaScript without parsing


I'm writing a JavaScript preprocessor which automatically inserts semicolons in places where it's necessary. Don't ask why.

Now I know that the general way to tackle this problem is to write a JavaScript parser and add semicolons where necessary according to the rules in the specs. However I don't want to do so for the following reasons:

  1. I don't want to write a full fledged parser.
  2. I want to preserve comments and whitespace.

I've already (correctly) implemented the second and third rule for automatic semicolon insertion using a simple scanner.

The first rule however proves to be more of a challenge to implement. So I have three questions:

  1. Is it possible to implement the first rule using a simple scanner with lookaheads and lookbehinds?
  2. If it's possible then has someone already done it?
  3. If not then how should I tackle this problem?

For the sake of completeness here are the three rules:

  • When, as the program is parsed from left to right, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if one or more of the following conditions is true:

    1. The offending token is separated from the previous token by at least one LineTerminator.

    2. The offending token is }.

  • When, as the program is parsed from left to right, the end of the input stream of tokens is encountered and the parser is unable to parse the input token stream as a single complete ECMAScript Program, then a semicolon is automatically inserted at the end of the input stream.

  • When, as the program is parsed from left to right, a token is encountered that is allowed by some production of the grammar, but the production is a restricted production and the token would be the first token for a terminal or nonterminal immediately following the annotation "[no LineTerminator here]" within the restricted production (and therefore such a token is called a restricted token), and the restricted token is separated from the previous token by at least one LineTerminator, then a semicolon is automatically inserted before the restricted token.

However, there is an additional overriding condition on the preceding rules: a semicolon is never inserted automatically if the semicolon would then be parsed as an empty statement or if that semicolon would become one of the two semicolons in the header of a for statement (section 12.6.3).


Solution

  • There is no way to achieve what you want with a scanner (tokenizer) alone. This is because to answer "do we need a semicolon here?" you need to answer "Is the next token an offending token?" and to answer this, you need a JavaScript grammar because an offending token is defined as something that the grammar doesn't allow at this place.

    I had some success with creating a list of all tokens and then process that list in a second step (so I would have some context). Using this approach, you can fix some places by writing code like this:

    This approach works because mistakes aren't random. People make always the same mistakes. Most of the time, people forget the ; after the end of a line and looking for missing ; before a keyword is a good way to locate them.

    But this approach will only ever get you so far. If you must find all missing semicolons reliably, you must write a JavaScript parser (or reuse an existing one).