pythonregextextx

textX Ignore Text Before First Record


I'm working on writing a parser for an existing file format, and the trouble is that the syntax does not have a way of indicating what is a comment other than the fact that everything that is not a record is treated as a comment. Records start with an ampersand & (must be the first character on a new line) and end with a slash /. The start of the record contains a 4-character string that indicates the type of record (e.g. &HEAD or &TIME) followed by parameters and their values. Records can span multiple lines as demonstrated with the &HEAD line below.

The Good

The following code and grammar syntax works fine for comments (COM) at the end of records and between records and even correctly ignores the END_OF_FILE_COMMENTS thanks to \Z matching the end of the file.

I tried using \A to tell textX to ignore comments at the beginning of the file until it hits the first record. As demonstrated below, everything works fine if there is nothing before the first record.

import yaml
from textx import metamodel_from_str


def get_dict(elem) -> dict | list:
    """Convert textX model to lists and dicts for easy viewing."""
    if isinstance(elem, list):
        return [get_dict(i) for i in elem]
    if '_tx_attrs' not in elem.__dir__():
        return elem

    d = {
        attr: get_dict(getattr(elem, attr))
        for attr in elem._tx_attrs
        if getattr(elem, attr) is not None
    }
    if len(d) == 1 and list(d.keys())[0] == 'value':
        return list(d.values())[0]
    return d
grammar = """
Model: BEG_OF_FILE_COMMENTS? (
    head=HEAD?
    tail=TAIL?
    time=TIME?
)# END_OF_FILE_COMMENTS? ;

/*** Namelists ***/
HEAD: "&HEAD" (
    chid=CHID?
    title=TITLE?
)# "/" COM?;

TAIL: "&TAIL" "/" COM?;

TIME: "&TIME" (
    t_begin=T_BEGIN?
    t_end=T_END?
)# "/" COM?;

/*** Namelist Parameters ***/
CHID: 'CHID' '=' value=QUOTED_STR SEP; // HEAD
TITLE: 'TITLE' '=' value=QUOTED_STR SEP; // HEAD

T_BEGIN: 'T_BEGIN' '=' value=NUMBER SEP; // TIME
T_END: 'T_END' '=' value=NUMBER SEP; // TIME

/*** Generic ***/
SEP: ","?;
COM: /(?ms).*?[\n\r](?=&)/;
BEG_OF_FILE_COMMENTS: /(?ms)\A.*?[\n\r](?=&)/;
END_OF_FILE_COMMENTS: /(?ms).*?[\n\r]\Z/;
QUOTED_STR: /[\'\a"](.*?)[\'\"]/;
"""
model_str = """
&HEAD CHID='PBD', 
      TITLE='Tools'/Comments & Stuff
Comments & Stuff
Comments & Stuff
&TIME T_BEGIN=0, T_END=60/Comments & Stuff
Comments & Stuff
Comments & Stuff
"""
meta = metamodel_from_str(grammar, use_regexp_group=True)
model = meta.model_from_str(model_str)

print(yaml.dump(get_dict(model)))

Output:

head:
  chid: PBD
  title: Tools
time:
  t_begin: 0
  t_end: 60

The Bad

However, when comments are put before the first record, the model is entirely empty despite the regex pattern working on regex101

model_str = '''
Comments & Stuff
&HEAD CHID='PBD', 
      TITLE='Tools'/Comments & Stuff
Comments & Stuff
Comments & Stuff
&TIME T_BEGIN=0, 
      T_END=60/Comments & Stuff
Comments & Stuff
Comments & Stuff
'''
{}

The Questions

  1. Is there a way to make this work with language comments?
  2. If the answer two the first question is "no," What am I missing in my Model or BEG_OF_FILE_COMMENTS syntax?

Solution

    1. Comment rule from language comments is tried between each two consecutive tokens, thus can't be used in this case as it would be tried inside of records, and comments can be anything (actually, for this language comment can't be a part of the record).
    2. The problem is BEG_OF_FILE_COMMENTS can't match \A as before the match whitespace skipping is applied, thus you will be past the beginning when the rule gets the change to match. You can either use noskipws modifier for the rule, or remove that constraint altogether. But, removing the constraint now consumes records, to prevent that add [^&] at the beginning of the rule, like this:
    BEG_OF_FILE_COMMENTS: /(?ms)[^&].*?[\n\r](?=&)/;
    

    I've tested it with both comment at the beginning and without. Also, note that you can always pass debug=True to (meta)model_from_str to see how matching is done during parsing.