I have a very interesting problem with parsing the following grammar (of Convnetional Commits) - which is a convention how git commit messages should be formatted.
<type>[optional scope]: <description>
[optional body]
[optional footer(s)]
fobar: this is value
format and newline separating them.Now, regarding my dilemma: what would be the best way to differentiate the body part from the footer part? According to the spec, those should be separated by two newline characters so at first I thought this would be good fit for ANTLR4 island grammars. I came up with something like what I posted here, but after some testing, I discovered it is not flexible - it won't work if the body is not there (body section is optional) but the footer is there.
I can think of a couple of ways to restrict the grammar to a certain language and implement this differentiation with semantic predicates but ideally, I would like to avoid that.
Now, I think that the problem boils down how to differentiate properly between KEY
and SINGLE_LINE
tokens which do conflict (in the next iteration of my implementation)
mode Text;
KEY: [a-z][a-z_-]+;
SINGLE_LINE: ~[\n]+;
MULTI_LINE: SINGLE_LINE (NEWLINE SINGLE_LINE)*;
NEXT: NEWLINE NEWLINE;
What would be the best way to differentiate between KEY
and SINGLE_LINE
?
I'd do something like this:
lexer grammar ConventionalCommitsLexer;
options {
caseInsensitive=true;
}
TYPE : [a-z]+;
LPAR : '(' -> pushMode(Scope);
COL : ':' -> pushMode(Text);
fragment SPACE : [ \t];
mode Scope;
SCOPE : ~[)]+;
RPAR : ')' SPACE* -> popMode;
mode Text;
COL2 : ':' -> type(COL);
SPACES : SPACE+ -> skip;
WORD : ~[: \t\r\n]+;
NL : SPACE* '\r'? '\n' SPACE*;
parser grammar ConventionalCommitsParser;
options {
tokenVocab=ConventionalCommitsLexer;
}
commit
: TYPE scope? COL description ( NL NL body )? ( NL NL footer )? EOF
;
scope
: LPAR SCOPE RPAR
;
description
: word+
;
// A 'body' cannot start with `WORD COL`, hence: `WORD WORD`
body
: WORD WORD word* ( NL word+ )*
;
footer
: key_value ( NL key_value )* NL?
;
key_value
: WORD COL word+
;
word
: WORD
| COL
;
Parsing the input (body + footer):
fix(some_module): this is a commit description
Some more in-depth description of what was fixed: this
can be a multi-line text, not only a one-liner.
Signed-off: john.doe@some.domain.com
Another-Key: another value with : (colon)
Some-Other-Key: some other value
result:
Parsing the input (only body):
fix(some_module): this is a commit description
Some more in-depth description of what was fixed: this
can be a multi-line text, not only a one-liner.
result:
Parsing the input (only footer):
fix(some_module): this is a commit description
Signed-off: john.doe@some.domain.com
Another-Key: another value with : (colon)
Some-Other-Key: some other value
result: