I've made a couple simple compilers before, but I've never properly addressed this issue:
Say I have a token LT
which searches the expression <
and a token LTEQ
which searches <=
.
A LT
would match part of <=
in this case, and I don't want this.
Other potential examples are:
===
vs ==
vs =
=>
vs >
vs =
vs >=
Before, what I've done is as follows:
||
: "true || false"
=> "true false"
)This is inefficient and only works in regex libraries that allow for a lambda to be passed in to process matches. Naturally I want to improve this.
How do I tokenize a string with tokens definitions that overlap as show with LT
and LTEQ
?
Possible solutions I've thought of but not tried:
Keeping a map of all used spans of characters and ignore matches within these spans Only having very simple token definitions and creating more complicated patterns from the simple ones
LTEQ
<=
would be created from LT
<
and EQ
=
You should try to tokenise the input in one single sweep. You would create a "monster" regex that would match any token (without regarding context unless absolutely necessary -- that would come later), giving precedence to larger tokens over smaller ones.
Here is an example with a very basic regex just to demonstrate the idea. It is implemented in JavaScript, and it parses a tiny Java-like input:
let input = `
for (int i = 0; i < 1e+4; i++) {
int k = i <= n / 2 ? 9.42 : 3.14;
System.out.println(k);
}
`;
const regex = /[+-]?\d+(\.\d*)?(e[-+]?\d+)?|\w+|[<>=]=?|\+[+=]?|-[-=]?|\S/g;
const tokens = input.match(regex);
console.log(tokens);
The final \S
is the "anything else" match, so that this is guaranteed to capture all non-space text. Following this basic tokenisation you'd then possibly create an abstract syntax tree from it.