c++regexlexlexical-analysis

Regular expression set max length for string literal


I am trying to figure out how to set the max length in a regular expression. My goal is to set my regular expression for string literals to a max length of 80.

Here is my expression if you need it:

["]([^"\\]|\\(.|\n))*["]|[']([^'\\]|\\(.|\n))*['] 

I've tried adding {0,80} at both the front and the end of the expression, but either all the strings break down into smaller identifiers or none do so far.

Thanks in advance for the help!

Edit:

Here's a better explanation of what I am trying to accomplish. Given "This string is over 80 characters long", when run through flex instead of being listed like:

line: 1, lexeme: |THIS STRING IS OVER 80 CHARACTERS LONG|, length: 81, token 4003

I would need it to be broken up like this:

line: 1, lexeme: |THIS|, length: 1, token 6000

line: 1, lexeme: |STRING|, length: 1, token 6000

line: 1, lexeme: |IS|, length: 1, token 6000

line: 1, lexeme: |OVER|, length: 1, token 6000

line: 1, lexeme: |80|, length: 1, token 6000

line: 1, lexeme: |CHARACTERS|, length: 1, token 6000

line: 1, lexeme: |LONG|, length: 1, token 6000

While string "THIS STRING IS NOT OVER 80 CHARACTERS LONG" would be shown as:

line: 1, lexeme: |THIS STRING IS NOT OVER 80 CHARACTERS LONG|, length: 50, token: 4003


Solution

  • I've tried adding {0,80} at both the front and the end of the expression

    The brace operator is not a length limit; it's a range of repetition counts. It has to go where a repetition operator (*, + or ?) would go: immediately after the subpattern being repeated.

    So in your case you might use: (I left out the ' alternative for clarity.)

        ["]([^"\\\n]|\\(.|\n)){0,80}["]
    

    Normally, I would advise you not to do this, or at least to do it with some care. (F)lex regular expressions are compiled into state transition tables, and the only way to compile a maximum repetition count is to copy the subpattern states once for each repetition. So the above pattern needs to make 80 copies of the state transitions for ([^"\\]|\\(.|\n)). (With a simple subpattern like that, the state blow-up might not be too serious. But with more complex subpatterns, you can end up with enormous transition tables.)

    Edit: Split long strings into tokens as though they weren't quoted.

    An edit to the question suggests that what is expected is to treat strings of length greater than 80 characters as though the quote marks had never been entered; that is, report them as individual word and number tokens without any intervening whitespace. That seems so idiosyncratic to me that I can't convince myself that I'm reading the requirement correctly, but in case I am, here's the outline of a possible approach.

    Let's suppose that the intention is that short strings should be reported as single tokens, while long strings should be reinterpreted as a series of tokens (possibly but not necessarily the same tokens as would be produced by an unquoted input). If that's the case, there are really two lexical analyses which need to be specified, and they will not use the same pattern rules. (For one thing, the rescan needs to recognise a quotation mark as the end of a literal, causing the scanner to revert to normal processing, while the original scan considers a quotation mark to start a string literal.)

    One possibility would be to just collect the entire long string and then use a different lexical scanner to break it into whatever parts seem useful, but that would require some complicated plumbing in order to record the resulting token stream and return it one token at a time to the yylex caller. (This would be reasonably easy if yylex were pushing tokens to a parser, but that's a whole other scenario.) So I'm going to discard this option other than to mention that it is possible.

    So the apparently simpler option is to ensure that the original scan halts on the 81st character, so that it can change the lexical rules and back up the scan to apply the new rules.

    (F)lex provides start conditions as a way of providing different lexical contexts. By using the BEGIN macro in a (f)lex action, it is possible to dynamically switch between start conditions, switching the scanner into a different context. (They're called "start conditions" because they change the scanner's state at the start of the token.)

    Each start condition (except the default start condition, which is called INITIAL) needs to be declared in the flex prologue. In this case, we'll only need one extra start condition, which we'll call SC_LONG_STRING. (By convention, start condition names are written in ALL CAPS since they are translated into either C macros or enumeration values.)

    Flex has two possible mechanisms for backing up a scan, either of which will work here. I'm going to show the explicit back-up because it's safer and more flexible; the alternative is to use the trailing context operator (/) which would work perfectly well in this solution, but not in other very similar contexts.

    So we start by declaring our start condition and then the rules for handling quoted strings in the default (INITIAL) lexical context:

    %x SC_LONG_STRING
    %%
    

    I'm only showing the double-quote rules since the single-quote rules are virtually identical. (Single-quote will require another start condition, because the termination pattern is different.)

    The first rule matches strings where there are at most 80 characters or escape sequences in the literal, using a repetition operator as described above.

    ["]([^"\\\n]|\\(.|\n)){0,80}["]    { yylval.str = strndup(yytext + 1, yyleng - 2);
                                         return TOKEN_STRING; 
                                       }
    

    The second rule matches exactly one additional non-quote character. It does not attempt to find the end of the string; that will be handled within the SC_LONG_STRING rules. The rule does two things:

    ["]([^"\\\n]|\\(.|\n)){81}         { BEGIN(SC_LONG_STRING); yyless(1); }
    

    The final rule is a fallback for unterminated strings; it will trigger if something that looks like a string was started but neither of the above rules matched it. That can only happen if a newline or end-of-file indicator is encountered before the closing quote:

    ["]([^"\\\n]|\\(.|\n)){0,80}       { yyerror("Unterminated string"); }
    

    Now, we need to specify the rules for SC_LONG_STRING. For simplicity, I'm going to assume that it is only desired to split the string into whitespace separated units; if you want to do a different analysis, you can change the patterns here.

    Start conditions are identified by writing the name of the start condition inside angle brackets. The start condition name is considered part of the pattern, so it should not be followed by a space (space characters aren't allowed in lex patterns unless they are quoted characters). Flex is more flexible; read the cited manual section for more details.

    The first rule simply returns to the INITIAL start condition when a double quote terminates the string. The second rule discards whitespace inside the long string, and the third rule passes the whitespace-separated components on to the caller. Finally, we need to consider the possible error of an unterminated long string, which will result in encountering a newline or end-of-file indicator.

    <SC_LONG_STRING>["]                      { BEGIN(INITIAL); }
    <SC_LONG_STRING>[ \t]+                   ;
    <SC_LONG_STRING>([^"\\ \n\t]|\\(.|\n))+  { yylval.str = strdup(yytext);
                                               return TOKEN_IDENTIFIER;
                                             }
    <SC_LONG_STRING>\n                       |
    <SC_LONG_STRING><<EOF>>                  { yyerror("Unterminated string"); }
    

    Original answer: Produce a meaningful error for long strings

    You need to specify what you are planning to do if the user enters a string which is too long If your scanner doesn't recognise it as a string, then it will return some kind of fall-back token which will probably induce a syntax error from the parser; that provides no useful feedback to the user, so they probably won't have a clue where the syntax error comes from. And you certainly cannot restart the lexical analysis in the middle of a string which happens to be too long: that will end up interpreting text which was supposed to be quoted as though it were tokens.

    A much better strategy is to recognise strings of any length, and then check the length in the action associated with the pattern. As a first approximation, you could try this:

    ["]([^"\\]|\\(.|\n)){0,80}["]   { if (yyleng <= 82) return STRING;
                                      else {
                                        yyerror("String literal exceeds 80 characters");
                                        return BAD_STRING;
                                      }
                                    }
    

    (Note: (F)lex sets the variable yyleng to the length of yytext, so there is never a need to call strlen(yytext). strlen needs to scan its argument to compute the length, so it's quite a bit less efficient. Moreover, even in cases where you need to call strlen, you should not use it to check if a string exceeds a maximum length. Instead, use strnlen, which will limit the length of the scan.)

    But that's just a first approximation, because it counts source characters, rather than the length of a string literal. So, for example, assuming you plan to allow hex escapes, the string literal "\x61" will be counted as though it has four characters, which could easily cause string literals containing escapes to be incorrectly rejected as too long.

    That problem is ameliorated but not solved by the pattern with a limited repeat count, because the pattern itself does not fully parse escape sequences. In the pattern ["]([^"\\]|\\(.|\n)){0,80}["], the \x61 escape sequence will be counted as three repetitions (\x, 6, 1), which is still more than the single character it represents. As another example, splices (\ followed by a newline) will be counted as one repetition, whereas they don't contribute to the length of the literal at all since they are simply removed.

    So if you want to correctly limit the length of a string literal, you will have to parse the source representation more precisely. (Or you would need to reparse it after identifying it, which seems wasteful.) This is usually done with a start condition.