parsingcompiler-constructionlexer

Why do lexers usually define a var as not being able to start with a number?


What's the difference between the token _123jh and 123jh that makes most lexers not include a number-starting identifier? I suppose one reason might be that a number-only token might be confusing, and so it's easier to just eliminate the leading-number entirely as opposed to allowing something like:

^(\d+[A-z_][A-z_0-9]*|[_A-z][A-z0-9]*)$.

Or are there other reasons as well (maybe a lexer cannot guarantee a single-char lookahead using this way)?


Solution

  • Because most languages support numbers, and in particular, floating point numbers in exponential notation, complex numbers, and hexadecimal. If a variable can begin with a digit, why can't it be all digits? And if so, how do you distinguish it from an actual number. Worse, 9e9 is a perfectly legal number, and isn't even all digits, so you can't say "it has to have at least one non-digit character", because legal numbers can have non-digit characters too.

    Then we add on the languages that allow arbitrary insertion of underscores within (but not at the ends of) numeric literals to aid with readability (e.g. Python allows 1_000_000 to represent the same thing as 1000000 but make the three digit groupings easier), or have suffixes to adjust the type (e.g. 5U/123L in C/C++, 7u8/999i16/etc. in Rust) and you end up in a position where allowing variable names to start with a digit means you need to impose all sorts of other, far more arbitrary restrictions on variable names to avoid conflicting with numeric literals.

    Basically, it's a lot easier to say "it has to start with a non-digit character" than to arbitrarily say 9e9 and 0xf are not a legal variable name, but 9ee9 and 0xg are.