I have the ungrateful task to fix a bug in an old antlr2 parser which is used to parse an edifact file. Unfortunatly I'm not very familar with antlr2 or parser at all and I can not get it to work.
The edifact-files look like this:
ABC+Name+Surname+zip+city+street+country+1961219++0037141008'
XYZ+Company+++XYZ+zip+street'
LMN+20081010+1100'
There are several different segments, which start with a keyword. E.g. XYZ or ABC. The keywords are followed by different attribute values, all separated with a '+'. An attribute value may be empty. Each segment ends with an '.
The problem is, whenever a data attribute contains a keyword, the parser throws an error:
unexpected token: XYZ
XYZ+Company+++XYZ+zip+street'
This is an excerpt from the grammar file:
// $ANTLR 2.7.6
xyz: "XYZ" ELT_SEP!
(xyz1_1a:ANUM|xyz1_1b:NUM) {lq(90,xyz1_1a,xyz1_1b,"XYZ1-1"+LQ90)}? ELT_SEP!
(xyz1_2a:ANUM|xyz1_2b:NUM)? {lq_(90,xyz1_2a,xyz1_2b,"XYZ1-2"+LQ90)}? ELT_SEP!
(xyz1_3a:ANUM|xyz1_3b:NUM)? {lq_(90,xyz1_3a,xyz1_3b,"XYZ1-3"+LQ90)}? ELT_SEP!
(xyz2a:ANUM|xyz2b:NUM)? {lq_(3,xyz2a,xyz2b,"XYZ2"+LQ3)}? ELT_SEP!
(xyz3a:ANUM|xyz3b:NUM)? {lq_(6,xyz3a,xyz3b,"XYZ3"+LQ6)}? ELT_SEP!
(xyz4a:ANUM|xyz4b:NUM) {lq(30,xyz4a,xyz4b,"XYZ4"+LQ30)}?
(ELT_SEP! (xyz5a:ANUM|xyz5b:NUM)?)? {lq_(46,xyz5a,xyz5b,"XYZ5"+LQ46)}? SEG_TERM!
{
if (skipNachricht()) return;
Xyz xyz = new Xyz();
xyz.xyz1_1 = getText(nn(xyz1_1a, xyz1_1b));
xyz.xyz1_2 = getText(nn(xyz1_2a, xyz1_2b));
xyz.xyz1_3 = getText(nn(xyz1_3a, xyz1_3b));
xyz.xyz2 = getText(nn(xyz2a, xyz2b));
xyz.xyz3 = getText(nn(xyz3a, xyz3b));
xyz.xyz4 = getText(nn(xyz4a, xyz4b));
xyz.xyz5 = getText(nn(xyz5a, xyz5b));
handleXyz(xyz);
}
;
/*
* Lexer
*/
class EdifactLexer extends Lexer;
options {
k=2;
filter=true;
charVocabulary = '\3'..'\377'; // Latin
}
DEZ_SEP: ','
{
//System.out.println("Found dez_sep: " + getText());
}
;
ELT_SEP: '+'
{
//System.out.println("Found elt_sep: " + getText());
}
;
SEG_TERM: '\''
{
// System.out.println("Found seg_term: " + getText());
}
;
NUM: (('0'..'9')+ (',' ('0'..'9')+)? ('+' | '\''))
=> ('0'..'9')+ (',' ('0'..'9')+)?
{
//System.out.println("num_: " + getText());
}
|
((ESCAPED | ~('?' | '+' | '\'' | ',' | '\r' | '\n'))+ )
=> ( ESCAPED | ~('?' | '+' | '\'' | ',' | '\r' | '\n'))+
{
$setType(ANUM);
//System.out.println("anum: " + getText());
}
|
(WRONGLY_ESCAPED) => WRONGLY_ESCAPED
{$setType(WRONGLY_ESCAPED); }
;
protected
WRONGLY_ESCAPED: '?' ~('?' | ':' | '+' | '\'' | ',')
{
//System.out.println("Found wrong_escaped: " + getText());
}
;
protected
ESCAPED: '?'
( ',' {$setText(","); }
| '?' {$setText("?"); }
| '\'' {$setText("'"); }
| ':' {$setText(":"); }
| '+' {$setText("+"); }
)
{
//System.out.println("Found escaped: " + getText());
}
;
NEWLINE : ( "\r\n" // DOS
| '\r' // MAC
| '\n' // Unix
)
{ newline();
$setType(Token.SKIP);
}
;
Any help is really appreciated :).
It might not be the best solution but I finally found a way to solve my problem. So, if anybody is stumbling about a similar issue, this is my solution:
I wrote a method, that changes the token-type to ANUM if the current token-type matches any of my keywords:
void ckt() throws TokenStreamException, SemanticException {
if (mKeywordList.contains(LT(1).getType())) {
LT(1).setType(ANUM);
}
}
The method is called in my parser rule before trying to access an ANUM-Token:
xyz: "XYZ" ELT_SEP!
{ckt();}(xyz1_1a:ANUM|xyz1_1b:NUM) {lq(90,xyz1_1a,xyz1_1b,"XYZ1-1"+LQ90)}? ELT_SEP!
{ckt();}(xyz1_2a:ANUM|xyz1_2b:NUM)? {lq_(90,xyz1_2a,xyz1_2b,"XYZ1-2"+LQ90)}? ELT_SEP!
{ckt();}(xyz1_3a:ANUM|xyz1_3b:NUM)? {lq_(90,xyz1_3a,xyz1_3b,"XYZ1-3"+LQ90)}? ELT_SEP!
{ckt();}(xyz2a:ANUM|xyz2b:NUM)? {lq_(3,xyz2a,xyz2b,"XYZ2"+LQ3)}? ELT_SEP!
{ckt();}(xyz3a:ANUM|xyz3b:NUM)? {lq_(6,xyz3a,xyz3b,"XYZ3"+LQ6)}? ELT_SEP!
{ckt();}(xyz4a:ANUM|xyz4b:NUM) {lq(30,xyz4a,xyz4b,"XYZ4"+LQ30)}?
(ELT_SEP! {ckt();}(xyz5a:ANUM|xyz5b:NUM)?)? {lq_(46,xyz5a,xyz5b,"XYZ5"+LQ46)}? SEG_TERM!
{
if (skipNachricht()) return;
Xyz xyz = new Xyz();
xyz.xyz1_1 = getText(nn(xyz1_1a, xyz1_1b));
xyz.xyz1_2 = getText(nn(xyz1_2a, xyz1_2b));
xyz.xyz1_3 = getText(nn(xyz1_3a, xyz1_3b));
xyz.xyz2 = getText(nn(xyz2a, xyz2b));
xyz.xyz3 = getText(nn(xyz3a, xyz3b));
xyz.xyz4 = getText(nn(xyz4a, xyz4b));
xyz.xyz5 = getText(nn(xyz5a, xyz5b));
handleXyz(xyz);
}
;