pythonregexgrammarebnflark-parser

LALR Grammar for transforming text to csv


I have a processor trace output that has the following format:

Time    Cycle   PC  Instr   Decoded instruction Register and memory contents
    905ns              86 00000e36 00a005b3 c.add            x11,  x0, x10       x11=00000e5c x10:00000e5c
    915ns              87 00000e38 00000693 c.addi           x13,  x0, 0         x13=00000000
    925ns              88 00000e3a 00000613 c.addi           x12,  x0, 0         x12=00000000
    935ns              89 00000e3c 00000513 c.addi           x10,  x0, 0         x10=00000000
    945ns              90 00000e3e 2b40006f c.jal             x0, 692           
    975ns              93 000010f2 0d01a703 lw               x14, 208(x3)        x14=00002b20  x3:00003288  PA:00003358
    985ns              94 000010f6 00a00333 c.add             x6,  x0, x10        x6=00000000 x10:00000000
    995ns              95 000010f8 14872783 lw               x15, 328(x14)       x15=00000000 x14:00002b20  PA:00002c68
   1015ns              97 000010fc 00079563 c.bne            x15,  x0, 10        x15:00000000

Allegedly, this is \t separated, however this is not the case, as inline spaces are found here and there. I want to transform this into a .csv format with a header row and the entries following. For example:

Time,Cycle,PC,Instr,Decoded instruction,Register and memory contents
905ns,86,00000e36,00a005b3,"c.add x11, x0, x10", x11=00000e5c x10:00000e5c
915ns,87,00000e38,00000693,"c.addi x13, x0, 0", x13=00000000
...

To do that, I am using Lark in python3 (>=3.10). And I came up with the following grammar for the source format:

Lark Grammar

start: header NEWLINE entries+

# Header is expected to be 
# Time\tCycle\tPC\tInstr\tDecoded instruction\tRegister and memory contents
header: HEADER_FIELD+
         

# Entries are expected to be e.g.,
#     85ns               4 00000180 00003197 auipc             x3, 0x3000          x3=00003180
entries: TIME                \
         CYCLE               \
         PC                  \
         INSTR               \
         DECODED_INSTRUCTION \
         reg_and_mem? NEWLINE

reg_and_mem: REG_AND_MEM+ 

///////////////
// TERMINALS //
///////////////

HEADER_FIELD: /
    [a-z ]+  # Characters that are optionally separated by a single space
/xi          

TIME: /
    [\d\.]+    # One or more digits
    [smunp]s   # Time unit
/x

CYCLE: INT

PC: HEXDIGIT+

INSTR: HEXDIGIT+

DECODED_INSTRUCTION: /
    [a-z\.]+             # Instruction mnemonic
    ([-a-z0-9, ()]+)?    # Optional operand part (rd,rs1,rs2, etc.)      
    (?=                  # Stop when 
        x[0-9]{1,2}[=:]  # Either you hit an xN= or xN:
        |PA:             # or you meet PA:
        |\s+$            # or there is no REG_AND_MEM and you meet a \n
    )
/xi


REG_AND_MEM: /
    (?:[x[0-9]+|PA)
    [=|:]
    [0-9a-f]+
/xi

///////////////
// IMPORTS   //
///////////////

%import common.HEXDIGIT
%import common.NUMBER
%import common.INT
%import common.UCASE_LETTER
%import common.CNAME
%import common.NUMBER
%import common.WS_INLINE
%import common.WS
%import common.NEWLINE

///////////////
// IGNORE    //
///////////////

%ignore WS_INLINE

Here is my simple driver code:

import lark


class TraceTransformer(lark.Transformer):

    def start(self, args):
        return lark.Discard

    def header(self, fields):

        return [str(field) for field in fields]

    def entries(self, args):
        print(args)
        ...

                               # the grammar provided above
                               # stored in the same directory
                               # as this file
parser = lark.Lark(grammar=open("grammar.lark").read(),
                   start="start",
                   parser="lalr",
                   transformer=TraceTransformer())

# This is parsed by the grammar without problems
# Note that I omit from  the  c.addi the operand
# part and its still parsed. This is ok as  some
# mnemonics do not have operands  (e.g., fence).
dummy_text_ok1 = r"""Time    Cycle   PC  Instr   Decoded instruction Register and memory contents
    905ns              86 00000e36 00a005b3 c.add            x11,  x0, x10       x11=00000e5c x10:00000e5c
    915ns              87 00000e38 00000693 c.addi           x13,  x0, 0         x13=00000000
    925ns              88 00000e3a 00000613 c.addi                  x12=00000000
    935ns              89 00000e3c 00000513 c.addi           x10,  x0, 0         x10=00000000"""

# Now here starts trouble. Note that here we don't
# have a REG_AND_MEM part on the jump instruction.
# However this is still parsed with no errors.
dummy_text_ok2 = r"""Time    Cycle   PC  Instr   Decoded instruction Register and memory
945ns              90 00000e3e 2b40006f c.jal             x0, 692
"""

# But here, when the parser meets the line of cjal
# where there is no REG_AND_MEM part and a  follow
# up entry exists we have an issue.
dummy_text_problematic = r"""Time    Cycle   PC  Instr   Decoded instruction Register and memory contents
    905ns              86 00000e36 00a005b3 c.add            x11,  x0, x10       x11=00000e5c x10:00000e5c
    915ns              87 00000e38 00000693 c.addi           x13,  x0, 0         x13=00000000
    925ns              88 00000e3a 00000613 c.addi           x12,  x0, 0         x12=00000000
    935ns              89 00000e3c 00000513 c.addi           x10,  x0, 0         x10=00000000
    945ns              90 00000e3e 2b40006f c.jal             x0, 692           
    975ns              93 000010f2 0d01a703 lw               x14, 208(x3)        x14=00002b20  x3:00003288  PA:00003358
    985ns              94 000010f6 00a00333 c.add             x6,  x0, x10        x6=00000000 x10:00000000
    995ns              95 000010f8 14872783 lw               x15, 328(x14)       x15=00000000 x14:00002b20  PA:00002c68
   1015ns              97 000010fc 00079563 c.bne            x15,  x0, 10        x15:00000000
"""

parser.parse(dummy_text_ok1) 
parser.parse(dummy_text_ok2)
parser.parse(dummy_text_problematic) 

The Runtime Error

No terminal matches 'c' in the current parser context, at line 6 col 45

945ns              90 00000e3e 2b40006f c.jal             x0, 692                                        
                                         ^
Expected one of:
        * DECODED_INSTRUCTION

So this indicates that the DECODED_INSTRUCTION rule is not behaving as expected.

The Rule

DECODED_INSTRUCTION: /
    [a-z\.]+             # Instruction mnemonic
    ([-a-z0-9, ()]+)?    # Optional operand part (rd,rs1,rs2, etc.)      
    (?=                  # Stop when 
        x[0-9]{1,2}[=:]  # Either you hit an xN= or xN:
        |PA:             # or you meet PA:
        |\s+$            # or there is no REG_AND_MEM and you meet a \n
    )
/xi

This rule is really heavy, it has to match the whole ISA of the processor, which is in RISC-V btw. So here step-by-step I have

Now, this was tricky. Instead of accounting from every possible instruction variation in my rules above, I thought to leverage the fact that there exist characters in the following column (Register and memory contents) which do not exist in any instruction variation of the ISA. This is where the look-ahead part of the regex comes in place. I stop when

However, the last case does not seem to work as intended, as shown in the above example. The way I see it, this seems OK to either stop when you meet one of the two criteria, OR you have encountered a new line (implying that the following part is omitted for the current entry). Did I blunder something in the regex part?


Solution

  • For $ to mean end-of-line, you need to add the m, i.e. MULTILINE flag

    DECODED_INSTRUCTION: /
        ...
    /xim