parsinghaskelllexerhappy

how to distinguish tokens which have similar patterns in Lexer, but they occur in different contexts in the parser


I have two pretty similar patterns in Lexer.x first for numbers second for byte. Here they are.

$digit=0-9
$byte=[a-f0-9]


    $digit+                       { \s -> TNum  (readRational s) }
    $digit+.$digit+               { \s -> TNum  (readRational s) }
    $digit+.$digit+e$digit+       { \s -> TNum  (readRational s) }
    $digit+e$digit+               { \s -> TNum  (readRational s) }
    $byte$byte                        { \s -> TByte (encodeUtf8(pack s))     }

I have Parser.y

%token

        cnst                            { TNum  $$}
        byte                            { TByte  $$}
        '['                            { TOSB     }    
        ']'                            { TCSB     }

%%

Expr: 
 '[' byte ']' {$1}
| const {$1}

when I write, I got.

[ 11 ] parse error
11 ok

but when I put byte pattern in Lexer before numbers

$digit=0-9
$byte=[a-f0-9]

    $byte$byte                        { \s -> TByte (encodeUtf8(pack s))     }
    $digit+                       { \s -> TNum  (readRational s) }
    $digit+.$digit+               { \s -> TNum  (readRational s) }
    $digit+.$digit+e$digit+       { \s -> TNum  (readRational s) }
    $digit+e$digit+               { \s -> TNum  (readRational s) }

I got

[ 11 ] ok
11 parse error

I think that happens because Lexer makes tokens from string and then gives them to parser. And when parser wait for byte token it got number token and parser don't have opportunity to make from this value another token. What I should do in this situation?


Solution

  • In that case you should postpone parsing. You can for example make a TNumByte data constructor that stores the value as String:

    Token
        = TByte ByteString
        | TNum Rational
        | TNumByte String
        -- …

    For a sequence of $digits, it is not yet clear if we have to interpret this as byte or number, so we construct a TNumByte for this:

    $digit=0-9
    $byte=[a-f0-9]
    
    $digit$digit                  { TNumByte }
    $byte$byte                    { \s -> TByte (encodeUtf8(pack s)) }
    $digit+                       { \s -> TNum  (readRational s) }
    $digit+.$digit+               { \s -> TNum  (readRational s) }
    $digit+.$digit+e$digit+       { \s -> TNum  (readRational s) }
    $digit+e$digit+               { \s -> TNum  (readRational s) }

    then in the parser we can decide based on the context:

    %token
    
      cnst                           { TNum $$ }
      byte                           { TByte $$ }
      numbyte                        { TNumByte $$ }  -- 🖘 can be int or byte
      '['                            { TOSB }
      ']'                            { TCSB }
    
    %%
    
    Expr
      : '[' byte ']' { $2 }
      | '[' numbyte ']' { encodeUtf8(pack $2) }  -- 🖘 interpret as byte
      | const { $1 }
      | numbyte { readRational $1 }  -- 🖘 interpret as int
      ;