regexcompiler-constructionlexical-analysis

Avoiding overlap with similar regex patterns during tokenization


Background

I've made a couple simple compilers before, but I've never properly addressed this issue:

Say I have a token LT which searches the expression < and a token LTEQ which searches <=. A LT would match part of <= in this case, and I don't want this.

Other potential examples are:

Before, what I've done is as follows:

This is inefficient and only works in regex libraries that allow for a lambda to be passed in to process matches. Naturally I want to improve this.

The problem

How do I tokenize a string with tokens definitions that overlap as show with LT and LTEQ?

Possible solutions I've thought of but not tried:

Keeping a map of all used spans of characters and ignore matches within these spans Only having very simple token definitions and creating more complicated patterns from the simple ones


Solution

  • You should try to tokenise the input in one single sweep. You would create a "monster" regex that would match any token (without regarding context unless absolutely necessary -- that would come later), giving precedence to larger tokens over smaller ones.

    Here is an example with a very basic regex just to demonstrate the idea. It is implemented in JavaScript, and it parses a tiny Java-like input:

    let input = `
       for (int i = 0; i < 1e+4; i++) {
           int k = i <= n / 2 ? 9.42 : 3.14;
           System.out.println(k);
       }
    `;
    
    const regex = /[+-]?\d+(\.\d*)?(e[-+]?\d+)?|\w+|[<>=]=?|\+[+=]?|-[-=]?|\S/g;
    const tokens = input.match(regex);
    console.log(tokens);

    The final \S is the "anything else" match, so that this is guaranteed to capture all non-space text. Following this basic tokenisation you'd then possibly create an abstract syntax tree from it.