I'm working on writing a parser for an existing file format, and the trouble is that the syntax does not have a way of indicating what is a comment other than the fact that everything that is not a record is treated as a comment. Records start with an ampersand &
(must be the first character on a new line) and end with a slash /
. The start of the record contains a 4-character string that indicates the type of record (e.g. &HEAD
or &TIME
) followed by parameters and their values. Records can span multiple lines as demonstrated with the &HEAD
line below.
The following code and grammar syntax works fine for comments (COM
) at the end of records and between records and even correctly ignores the END_OF_FILE_COMMENTS
thanks to \Z
matching the end of the file.
I tried using \A
to tell textX to ignore comments at the beginning of the file until it hits the first record. As demonstrated below, everything works fine if there is nothing before the first record.
import yaml
from textx import metamodel_from_str
def get_dict(elem) -> dict | list:
"""Convert textX model to lists and dicts for easy viewing."""
if isinstance(elem, list):
return [get_dict(i) for i in elem]
if '_tx_attrs' not in elem.__dir__():
return elem
d = {
attr: get_dict(getattr(elem, attr))
for attr in elem._tx_attrs
if getattr(elem, attr) is not None
}
if len(d) == 1 and list(d.keys())[0] == 'value':
return list(d.values())[0]
return d
grammar = """
Model: BEG_OF_FILE_COMMENTS? (
head=HEAD?
tail=TAIL?
time=TIME?
)# END_OF_FILE_COMMENTS? ;
/*** Namelists ***/
HEAD: "&HEAD" (
chid=CHID?
title=TITLE?
)# "/" COM?;
TAIL: "&TAIL" "/" COM?;
TIME: "&TIME" (
t_begin=T_BEGIN?
t_end=T_END?
)# "/" COM?;
/*** Namelist Parameters ***/
CHID: 'CHID' '=' value=QUOTED_STR SEP; // HEAD
TITLE: 'TITLE' '=' value=QUOTED_STR SEP; // HEAD
T_BEGIN: 'T_BEGIN' '=' value=NUMBER SEP; // TIME
T_END: 'T_END' '=' value=NUMBER SEP; // TIME
/*** Generic ***/
SEP: ","?;
COM: /(?ms).*?[\n\r](?=&)/;
BEG_OF_FILE_COMMENTS: /(?ms)\A.*?[\n\r](?=&)/;
END_OF_FILE_COMMENTS: /(?ms).*?[\n\r]\Z/;
QUOTED_STR: /[\'\a"](.*?)[\'\"]/;
"""
model_str = """
&HEAD CHID='PBD',
TITLE='Tools'/Comments & Stuff
Comments & Stuff
Comments & Stuff
&TIME T_BEGIN=0, T_END=60/Comments & Stuff
Comments & Stuff
Comments & Stuff
"""
meta = metamodel_from_str(grammar, use_regexp_group=True)
model = meta.model_from_str(model_str)
print(yaml.dump(get_dict(model)))
Output:
head:
chid: PBD
title: Tools
time:
t_begin: 0
t_end: 60
However, when comments are put before the first record, the model is entirely empty despite the regex pattern working on regex101
model_str = '''
Comments & Stuff
&HEAD CHID='PBD',
TITLE='Tools'/Comments & Stuff
Comments & Stuff
Comments & Stuff
&TIME T_BEGIN=0,
T_END=60/Comments & Stuff
Comments & Stuff
Comments & Stuff
'''
{}
Model
or BEG_OF_FILE_COMMENTS
syntax?Comment
rule from language comments is tried between each two consecutive tokens, thus can't be used in this case as it would be tried inside of records, and comments can be anything (actually, for this language comment can't be a part of the record).BEG_OF_FILE_COMMENTS
can't match \A
as before the match whitespace skipping is applied, thus you will be past the beginning when the rule gets the change to match. You can either use noskipws
modifier for the rule, or remove that constraint altogether. But, removing the constraint now consumes records, to prevent that add [^&]
at the beginning of the rule, like this:BEG_OF_FILE_COMMENTS: /(?ms)[^&].*?[\n\r](?=&)/;
I've tested it with both comment at the beginning and without. Also, note that you can always pass debug=True
to (meta)model_from_str
to see how matching is done during parsing.