parsinggrammarlemon

Lemon Parser REPL


I'm trying to build a Smalltalk REPL based on LanguageKit, which uses a lemon gramma. Currently the parser only supports parsing complete class definitions, but not statements outside of the method syntax.

For example this gets parsed:

methodName [
    NSObject description.
    NSObject debugDescription.
]

But it will fail if I try to parse only the statements:

NSObject description.
NSObject debugDescription.

The following won't accept multiple statements (e.g. Transcript show: 'hello'. Transcript show: 'world'.):

file ::= statement_list(M).
{
    [p setAST:M];
}

Here the minimal gramma:

%include {
#include <assert.h>
#import <Foundation/Foundation.h>
#import <LanguageKit/LanguageKit.h>
#import <LanguageKit/LKToken.h>
#import "SmalltalkParser.h"

}
%name SmalltalkParse
%token_prefix TOKEN_
%token_type {id}
%extra_argument {SmalltalkParser *p}
%left PLUS MINUS STAR SLASH EQ LT GT COMMA.
%left WORD.


file ::= method(M).
{
    [p setAST:M];
}
file ::= statement_list(M).
{
    [p setAST:M];
}
file ::= statement(M).
{
    [p setAST:M];
}
file ::= .



method(M) ::= signature(S) LSQBRACK statement_list(E) RSQBRACK.
{
    M = [LKInstanceMethod methodWithSignature:S locals:nil statements:E];
}

signature(S) ::= WORD(M).
{
    S = [LKMessageSend messageWithSelectorName:M];
}
signature(S) ::= keyword_signature(M).
{
    S = M;
}


statement_list(L) ::= statements(T).
{
    L = T;
}
statement_list(L) ::= statements(T) statement(S).
{
    [T addObject:S];
    L = T;
}

statements(L) ::= statements(T) statement(S) STOP.
{
    [T addObject:S];
    L = T;
}
statements(L) ::= .
{
    L = [NSMutableArray array];
}

statement(S) ::= expression(E).
{
    S = E;
}

%syntax_error 
{
    [NSException raise:@"ParserError" format:@"Parsing failed"];
}

message(M) ::= simple_message(S).
{
    M = S;
}

simple_message(M) ::= WORD(S).
{
    M = [LKMessageSend messageWithSelectorName:S];
}

expression(E) ::= simple_expression(S).
{
    E = S;
}

simple_expression(E) ::= WORD(T) simple_message(M).
{
    [M setTarget:T];
    E = M;
}

The complete gramma can be found here: smalltalk.y. I have been reading other grammas and also searching stackoverflow, but don't see a difference for example with this gramma and don't understand why this isn't working.


Solution

  • Your grammar has parsing conflicts. These must be resolved if you have a hope of getting the grammar to work correctly.

    (The grammar also has an undefined non-terminal, keyword_signature, and an unused non-terminal message. To get it to compile without warnings, I simply removed them. I don't think it makes any difference to the analysis below.)

    Part of the conflict is very simple: you cannot have both

    file ::= statement_list .
    

    and

    file ::= statement .
    

    In fact, it is not clear to me why you would want to? Isn't a statement an example of a statement_list?

    The reason you cannot have both is that you have:

     statement_list ::= statements statement .
    

    and

     statements ::= .
    

    Taken together, that means that starting with statement_list, you can recognise a single statement. So your grammar is ambiguous; if the input is a single statement, it could be directly parsed as a file or it could be parsed as filestatement_liststatements statementstatement, with a different set of actions.

    You might not care; indeed, you might believe that the sequence of actions is identical. You might even be right about that. But the parser cannot know that, and it will not believe it. It sees the two parses as necessarily distinct. So it will report a conflict.

    In short, get rid of file ::= statement .. Then you can start working on the other parsing conflicts.


    The more fundamental problem is also based on the fact that statements can derive the empty sequence.

    Let's take a look at the grammar (simplified by removing all semantics):

    statement_list ::= statements .
    statement_list ::= statements statement .
    statements     ::= statements statement STOP .
    statements     ::= .
    

    If statement_list is not empty, whatever it matches must start with an empty statements followed by a statement. statement, in turn, must start with WORD, so statement_list must match input which starts with WORD. But before it can shift the WORD in order to continue the parse, it needs to first insert the empty statements. So it needs to do a reduction using the last rule quoted above before it can handle a WORD. (If this paragraph is not completely clear, please try rereading it, and if you still have questions, ask. It's important to understand this part.)

    None of that would be a problem if it were not for the fact that file could also be a method, and method also starts with a WORD. But, unlike statement_list, it really starts with WORD. It does not start with an empty statements, so if the parser creates an empty statements and the input is actually a method, then the parse will fail.

    As it happens, this particular conflict doesn't occur if you have file ::= statement instead of file ::= statement_list, because statement doesn't start with an empty statements either. That means that when the parser sees the WORD at the beginning of the input, it does not yet have to decide whether it is about to see a statement or a method. In both cases, the parse action is to shift the WORD and see what comes next.

    To solve this problem, we can observe that statement_list must contain at least one statement, and that all the statements inside a statement_list (except possibly the last one) must be terminated with a STOP (that is, .). If we start with that idea, it's pretty easy to produce an alternative grammar which does not require an empty list at the beginning:

    statement_list ::= statements .
    statement_list ::= statements STOP .
    statements ::= statement .
    statements ::= statements STOP statement .
    

    This differs from your grammar in that it considers a statement_list to be a non-empty list of dot-separated statements optionally terminated with a dot, while your grammar considers a statement_list to be a possibly empty list of dot-terminated statements followed by a single statement.


    Since I have now tested the grammar, I add the complete testable code as an illustration of what we ask for when we ask for a Minimal Complete Verifiable Example. (I used C and flex rather than Objective C but I don't think that makes any difference.)

    File parser.y:

    %include { #include <assert.h> }
    file ::= method.
    file ::= statement_list.
    file ::= .
    method ::= signature OBRAC statement_list CBRAC .
    signature ::= WORD .
    statement_list ::= statements STOP .
    statement_list ::= statements .
    statements ::= statements STOP statement .
    statements ::= statement .
    statement ::= expression .
    expression ::= simple_expression .
    simple_expression ::= WORD simple_message .
    simple_message ::= WORD .
    %extra_argument { int* status }
    %syntax_error { *status = 1; }
    %parse_failure { fprintf(stderr, "Parse failed.\n"); }
    %parse_accept { fprintf(stderr, "Parse succeeded.\n"); }
    

    File main.l:

    %{
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include "parser.h"
    void* ParseAlloc(void* (*allocProc)(size_t));
    void* Parse(void*, int, int, int*);
    void* ParseFree(void*, void(*freeProc)(void*));
    
    void synerr(const char* token) {
      fprintf(stderr, "Syntax error handling '%s'\n", token);
    }
    %}
    %option noinput nounput noyywrap nodefault
    %x FLUSH
    %%
        void* parser = ParseAlloc(malloc);
        int status = 0;
        #define SEND(typ, val) do {                        \
           if (Parse(parser, typ, val, &status), status) { \
             synerr(yytext); BEGIN(FLUSH);                 \
           }                                               \
        } while(0)
    [[:space:]]+ ;
    [[:alnum:]]+ { SEND(WORD, 0); }
    "["          { SEND(OBRAC, 0); }
    "]"          { SEND(CBRAC, 0); }
    "."          { SEND(STOP, 0); }
    .            { synerr(yytext); BEGIN(FLUSH); }
    <FLUSH>.+    ;
    <FLUSH>\n    { status = 0; BEGIN(INITIAL); }
    <<EOF>>      { if (status == 0) {
                     Parse(parser, 0, 0, &status);
                     if (status) synerr("EOF");
                   }
                   ParseFree(parser, free );
                   return 0;
                 }
    %%
    
    int main(int argc, char** argv) {
       return yylex();
    }
    

    Build procedure:

    $ lemon parser.y
    $ flex -o main.c main.l
    $ gcc -std=c11 -Wall -Wno-unused-variable -o catlan -D_XOPEN_SOURCE=800 main.c parser.c
    

    Tests:

    $ ./catlan <<< 'NSObject'
    Parse failed.
    Syntax error handling 'EOF'
    $ ./catlan <<< 'NSObject description'
    Parse succeeded.
    $ ./catlan <<< 'NSObject description.'
    Parse succeeded.
    $ ./catlan <<< 'NSObject description. OtherObject'
    Parse failed.
    Syntax error handling 'EOF'
    $ ./catlan <<< 'NSObject description. OtherObject otherDesc'
    Parse succeeded.
    $ ./catlan <<< 'NSObject description. OtherObject otherDesc.'
    Parse succeeded.
    $ ./catlan <<< 'NSObject description. OtherObject otherDesc extra words'
    Syntax error handling 'extra'
    Parse succeeded.
    $ ./catlan <<< 'method [ NSObject desc]'
    Parse succeeded.
    $ ./catlan <<< 'method [ NSObject desc.]'
    Parse succeeded.
    $ ./catlan <<< 'method [ NSObject desc extra words]'
    Syntax error handling 'extra'
    Parse failed.
    $ ./catlan <<< 'method [ NSObject desc. Second]'
    Syntax error handling ']'
    Parse failed.
    $ ./catlan <<< 'method [ NSObject desc. Second desc]'
    Parse succeeded.