flex-lexerjflex

JFlex maximum read length


Given a positional language like the old IBM RPG, we can have a line such as

CCCCCDIDENTIFIER     E S             10

Where characters

 1-5:  comment
   6:  specification type
7-21:  identifier name
...And so on

Now, given that JFlex is based on RegExp, we would have a RegExp such as:

[a-zA-Z][a-zA-Z0-9]{0,14} {0,14}

for the identifier name token.
This RegExp however can match tokens longer than the 15 characters possible for identifier name, requiring yypushbacks.

Thus, is there a way to limit how many characters JFlex reads for a particular token?


Solution

  • Regular expression based lexical analysis is really not the right tool to parse fixed-field inputs. You can just split the input into fields at the known character positions, which is way easier and a lot faster. And it doesn't require fussing with regular expressions.

    Anyway, [a-zA-Z][a-zA-Z0-9]{0,14}[ ]{0,14} wouldn't be the right expression even if it did properly handle the token length, since the token is really the word at the beginning, without space characters.

    In the case of fixed-length fields which contain something more complicated than a single identifier, you might want to feed the field into a lexer, using a StringReader or some such.


    Although I'm sure it's not useful, here's a regular expression which matches 15 characters which start with a word and are completed with spaces:

    [a-zA-Z][ ]{14} |
    [a-zA-Z][a-zA-Z0-9][ ]{13} |
    [a-zA-Z][a-zA-Z0-9]{2}[ ]{12} |
    [a-zA-Z][a-zA-Z0-9]{3}[ ]{11} |
    [a-zA-Z][a-zA-Z0-9]{4}[ ]{10} |
    [a-zA-Z][a-zA-Z0-9]{5}[ ]{9} |
    [a-zA-Z][a-zA-Z0-9]{6}[ ]{8} |
    [a-zA-Z][a-zA-Z0-9]{7}[ ]{7} |
    [a-zA-Z][a-zA-Z0-9]{8}[ ]{6} |
    [a-zA-Z][a-zA-Z0-9]{9}[ ]{5} |
    [a-zA-Z][a-zA-Z0-9]{10}[ ]{4} |
    [a-zA-Z][a-zA-Z0-9]{11}[ ]{3} |
    [a-zA-Z][a-zA-Z0-9]{12}[ ]{2} |
    [a-zA-Z][a-zA-Z0-9]{13}[ ] |
    [a-zA-Z][a-zA-Z0-9]{14}
    

    (That might have to be put on one very long line.)