parsingtokenizelexical-analysis

How does tokenization relates to formalism, lexical grammar, and regular language?


I am reading Bob Nystrom Crafting Compiler's and in the chapter 5 it says this

In the last chapter, the formalism we used for defining the lexical grammar— the rules for how characters get grouped into tokens—was called a regular language. That was fine for our scanner, which emits a flat sequence of tokens. But regular languages aren’t powerful enough to handle expressions which can nest arbitrarily deeply.

The thing is, in the last chapter he refers to, a tokenizer was implemented, where the core of it was using switch cases in Java.

You can find the implementation he refers to here

I am failing to see how such an implementation relates to

the formalism we used for defining the lexical grammar— the rules for how characters get grouped into tokens—was called a regular language

I mean, I did not have a feeling a formalism was defined? I also did not have the feeling I defined a lexical grammar. And apparently what I implemented alongside the book also somehow constitutes a regular language?

I guess my core issue is, I am failing to see the connection between the switch statement driven implementation I replicated, with concepts like formalism, lexical grammar, regular language etc.

I would appreciate further elucidation of these concepts and how they relate to the code in the book.


Solution

  • What you've implemented is effectively a maximal munch tokenisation algorithm, manually.

    The connection to regular languages is that tokens can often be defined as belonging to regular languages, which can be defined using regular expressions. As a result, many lexer generators take regular expressions as input.

    Regular languages have the desirable property that membership testing, "is the string x a member of regular language L?", is decidable. This can be proved by construction; we can, for any (vanilla) regular expression, create a deterministic finite automaton (DFA) that, when simulated on input, can terminate and tell us whether the string is a member of the language (if we are left in an "accepting" state). I thoroughly recommend Sipser's "Introduction to The Theory of Computation" for an introductory prerequisite for regular languages and their properties.

    So, if we know that some kind of state machine (simulating the idea of a deterministic finite automaton) can be constructed to tell us whether a string belongs to a given regular language, we can go about implementing such a state machine in code. You mentioned the usage of switch statements, these can be used to implement the transition ("delta") function, between states, of a DFA (where each case is the current input symbol - typically a single character or unicode codepoint, if your input is, say, utf-8).

    For example, suppose we defined the language of integer literals as being the regular expression: [0-9]+. We could then go about creating a DFA for this regular expression. It's worth noting that R+ is really sugar for RR*, which models the idea of "at least one R". Without going through the hassle of describing the algorithms one can use for constructing the related DFA, I can say that the state machine below is sufficient:

    enter image description here

    The nodes are states and the labelled edges between them represent valid transitions. The double circled node represents an accepting state. In other words, if we exhaust the input and the final state we were in is 1, we can accept.

    If we wished to implement this DFA explicitly in code, we could do it in many ways. We could explicitly represent the current state using an integer (where the valid states are 0 and 1, as above). Then, to represent the transitions (labelled edges in the diagram above), we could use a hardcoded table, a switch statement (with a case for each in '0', '1', '2', ..., '9'), or we could use a range check (such as a predicate like isdigit from the C standard library). This transformation is straightforward, so I won't do it here.

    The way it's done in Crafting Interpreters is more typical of handwritten lexers, but effectively models the same kind of idea. If we look at the code, we can find the following predicate test (which captures the idea of [0-9] as a regular expression):

    // ...
    if (isDigit(c)) {
       number();
    // ...
    

    If we imagine that the top level switch from which is code is taken is where the initial state is, then the call to number would be a transition into the state expecting to see more of the same ([0-9]*).

    If we look at number(), we find that it roughly models what we'd expect of the second state from the diagram:

    private void number() {
      while (isDigit(peek())) advance();
    
      // ... omitted part that handles decimal/fractional point ...
    
      addToken(NUMBER,
          Double.parseDouble(source.substring(start, current)));
    }
    

    The while loop directly models the idea of having a [0-9] transition that stays in the same state. The actual number() in Crafting Interpreters allows for lexing fractional components of doubles, so I've omitted it above. However, a regular expression could model the same.

    I recommend playing around with regular expressions, constructing DFAs (can refer to algorithms in the dragon book - "Compilers: Principles, Techniques, and Tools" or tutorials about Brzozowski derivatives), and then implementing them in code (by hand).

    Of course, the algorithm of tokenisation is a bit more involved than simply implementing a DFA for a singular regular expression. Programming languages typically consist of more than one type of token, and so the DFA would need to model the alternation of many different regular expressions. This is not very difficult, one need only adapt the DFA construction algorithm to work over many regular expressions at the same time. In practice, of course, we have to worry about switching lexer modes (effectively having many lexers), to deal with comments (which can be further complicated by the fact some languages permit comments to be nested, and so seeing /* and then waiting for the next */ is not sufficient for recognising a comment - this is equivalent to the balanced parentheses problem, which is not a regular language!), strings (with their character escape sequences), etc.

    The next major problem with the idea of DFAs as being the ideal formalisation of tokenisation is that the extent of input for DFAs is typically known. In the theory, we know the extent of the input for simulating a DFA (its length). In the tokenisation of programming languages, we don't know and, for practical reasons, want to get the largest possible lexeme each time (this is called "maximal munch tokenisation"). Therefore, the typical approach is to keep simulating the DFA until it fails to yield another state transition (there is no valid transition). In which case, we hope that we've just seen a valid token, reset the DFA to its initial state, and continue going. This is a simplification of how maximal munch tokenisation algorithms operate, but it's not far off. I recommend Reps' Maximal Munch Tokenisation in Linear Time paper for pseudocode and related discussion.

    What I hope to have explained is the connection between regular languages, their concrete implementation (simulating a DFA constructed from something describing the regular language - regular expressions), and the kind of skeletal algorithm we place such a transition function into in order to tokenise programming languages (the tokenisation algorithm). The fact is that it's not the ideal model, but it's approximately suitable. Many people use lexer generators with some handwritten, side-effectual, code to deal with the aforementioned problems with lexing real programming languages. Others, such as the author of Crafting Interpreters, prefer to implement it all in code. Either way, the logic of implementing a state machine for a bunch of regular lexemes, with some other bells and whistles for maximal munch and error handling, shines through.