gitparsingantlr4conventional-commits

Lexing token ambiguity in ANTLR4


I have a very interesting problem with parsing the following grammar (of Convnetional Commits) - which is a convention how git commit messages should be formatted.

<type>[optional scope]: <description>

[optional body]

[optional footer(s)]

Now, regarding my dilemma: what would be the best way to differentiate the body part from the footer part? According to the spec, those should be separated by two newline characters so at first I thought this would be good fit for ANTLR4 island grammars. I came up with something like what I posted here, but after some testing, I discovered it is not flexible - it won't work if the body is not there (body section is optional) but the footer is there.

I can think of a couple of ways to restrict the grammar to a certain language and implement this differentiation with semantic predicates but ideally, I would like to avoid that.

Now, I think that the problem boils down how to differentiate properly between KEY and SINGLE_LINE tokens which do conflict (in the next iteration of my implementation)

mode Text;
KEY: [a-z][a-z_-]+;
SINGLE_LINE: ~[\n]+;

MULTI_LINE: SINGLE_LINE (NEWLINE SINGLE_LINE)*;

NEXT: NEWLINE NEWLINE;

What would be the best way to differentiate between KEY and SINGLE_LINE?


Solution

  • I'd do something like this:

    ConventionalCommitsLexer.g4

    lexer grammar ConventionalCommitsLexer;
    
    options {
      caseInsensitive=true;
    }
    
    TYPE : [a-z]+;
    LPAR : '(' -> pushMode(Scope);
    COL  : ':' -> pushMode(Text);
    
    fragment SPACE : [ \t];
    
    mode Scope;
    
     SCOPE : ~[)]+;
     RPAR  : ')' SPACE* -> popMode;
    
    mode Text;
    
     COL2    : ':' -> type(COL);
     SPACES : SPACE+ -> skip;
     WORD   : ~[: \t\r\n]+;
     NL     : SPACE* '\r'? '\n' SPACE*;
    

    ConventionalCommitsParser.g4

    parser grammar ConventionalCommitsParser;
    
    options {
      tokenVocab=ConventionalCommitsLexer;
    }
    
    commit
     : TYPE scope? COL description ( NL NL body )? ( NL NL footer )? EOF
     ;
    
    scope
     : LPAR SCOPE RPAR
     ;
    
    description
     : word+
     ;
    
    // A 'body' cannot start with `WORD COL`, hence: `WORD WORD`
    body
     : WORD WORD word* ( NL word+ )*
     ;
    
    footer
     : key_value ( NL key_value )* NL?
     ;
    
    key_value
     : WORD COL word+
     ;
    
    word
     : WORD
     | COL
     ;
    

    Parsing the input (body + footer):

    fix(some_module): this is a commit description
        
    Some more in-depth description of what was fixed: this
    can be a multi-line text, not only a one-liner.
    
    Signed-off: john.doe@some.domain.com
    Another-Key: another value with : (colon)
    Some-Other-Key: some other value
    

    result:

    enter image description here

    Parsing the input (only body):

    fix(some_module): this is a commit description
        
    Some more in-depth description of what was fixed: this
    can be a multi-line text, not only a one-liner.
    

    result:

    enter image description here

    Parsing the input (only footer):

    fix(some_module): this is a commit description
    
    Signed-off: john.doe@some.domain.com
    Another-Key: another value with : (colon)
    Some-Other-Key: some other value
    

    result:

    enter image description here