I have a processor trace output that has the following format:
Time Cycle PC Instr Decoded instruction Register and memory contents
905ns 86 00000e36 00a005b3 c.add x11, x0, x10 x11=00000e5c x10:00000e5c
915ns 87 00000e38 00000693 c.addi x13, x0, 0 x13=00000000
925ns 88 00000e3a 00000613 c.addi x12, x0, 0 x12=00000000
935ns 89 00000e3c 00000513 c.addi x10, x0, 0 x10=00000000
945ns 90 00000e3e 2b40006f c.jal x0, 692
975ns 93 000010f2 0d01a703 lw x14, 208(x3) x14=00002b20 x3:00003288 PA:00003358
985ns 94 000010f6 00a00333 c.add x6, x0, x10 x6=00000000 x10:00000000
995ns 95 000010f8 14872783 lw x15, 328(x14) x15=00000000 x14:00002b20 PA:00002c68
1015ns 97 000010fc 00079563 c.bne x15, x0, 10 x15:00000000
Allegedly, this is \t
separated, however this is not the case, as inline spaces are found here and there. I want to transform this into a .csv
format with a header row and the entries following. For example:
Time,Cycle,PC,Instr,Decoded instruction,Register and memory contents
905ns,86,00000e36,00a005b3,"c.add x11, x0, x10", x11=00000e5c x10:00000e5c
915ns,87,00000e38,00000693,"c.addi x13, x0, 0", x13=00000000
...
To do that, I am using Lark in python3 (>=3.10). And I came up with the following grammar for the source format:
start: header NEWLINE entries+
# Header is expected to be
# Time\tCycle\tPC\tInstr\tDecoded instruction\tRegister and memory contents
header: HEADER_FIELD+
# Entries are expected to be e.g.,
# 85ns 4 00000180 00003197 auipc x3, 0x3000 x3=00003180
entries: TIME \
CYCLE \
PC \
INSTR \
DECODED_INSTRUCTION \
reg_and_mem? NEWLINE
reg_and_mem: REG_AND_MEM+
///////////////
// TERMINALS //
///////////////
HEADER_FIELD: /
[a-z ]+ # Characters that are optionally separated by a single space
/xi
TIME: /
[\d\.]+ # One or more digits
[smunp]s # Time unit
/x
CYCLE: INT
PC: HEXDIGIT+
INSTR: HEXDIGIT+
DECODED_INSTRUCTION: /
[a-z\.]+ # Instruction mnemonic
([-a-z0-9, ()]+)? # Optional operand part (rd,rs1,rs2, etc.)
(?= # Stop when
x[0-9]{1,2}[=:] # Either you hit an xN= or xN:
|PA: # or you meet PA:
|\s+$ # or there is no REG_AND_MEM and you meet a \n
)
/xi
REG_AND_MEM: /
(?:[x[0-9]+|PA)
[=|:]
[0-9a-f]+
/xi
///////////////
// IMPORTS //
///////////////
%import common.HEXDIGIT
%import common.NUMBER
%import common.INT
%import common.UCASE_LETTER
%import common.CNAME
%import common.NUMBER
%import common.WS_INLINE
%import common.WS
%import common.NEWLINE
///////////////
// IGNORE //
///////////////
%ignore WS_INLINE
Here is my simple driver code:
import lark
class TraceTransformer(lark.Transformer):
def start(self, args):
return lark.Discard
def header(self, fields):
return [str(field) for field in fields]
def entries(self, args):
print(args)
...
# the grammar provided above
# stored in the same directory
# as this file
parser = lark.Lark(grammar=open("grammar.lark").read(),
start="start",
parser="lalr",
transformer=TraceTransformer())
# This is parsed by the grammar without problems
# Note that I omit from the c.addi the operand
# part and its still parsed. This is ok as some
# mnemonics do not have operands (e.g., fence).
dummy_text_ok1 = r"""Time Cycle PC Instr Decoded instruction Register and memory contents
905ns 86 00000e36 00a005b3 c.add x11, x0, x10 x11=00000e5c x10:00000e5c
915ns 87 00000e38 00000693 c.addi x13, x0, 0 x13=00000000
925ns 88 00000e3a 00000613 c.addi x12=00000000
935ns 89 00000e3c 00000513 c.addi x10, x0, 0 x10=00000000"""
# Now here starts trouble. Note that here we don't
# have a REG_AND_MEM part on the jump instruction.
# However this is still parsed with no errors.
dummy_text_ok2 = r"""Time Cycle PC Instr Decoded instruction Register and memory
945ns 90 00000e3e 2b40006f c.jal x0, 692
"""
# But here, when the parser meets the line of cjal
# where there is no REG_AND_MEM part and a follow
# up entry exists we have an issue.
dummy_text_problematic = r"""Time Cycle PC Instr Decoded instruction Register and memory contents
905ns 86 00000e36 00a005b3 c.add x11, x0, x10 x11=00000e5c x10:00000e5c
915ns 87 00000e38 00000693 c.addi x13, x0, 0 x13=00000000
925ns 88 00000e3a 00000613 c.addi x12, x0, 0 x12=00000000
935ns 89 00000e3c 00000513 c.addi x10, x0, 0 x10=00000000
945ns 90 00000e3e 2b40006f c.jal x0, 692
975ns 93 000010f2 0d01a703 lw x14, 208(x3) x14=00002b20 x3:00003288 PA:00003358
985ns 94 000010f6 00a00333 c.add x6, x0, x10 x6=00000000 x10:00000000
995ns 95 000010f8 14872783 lw x15, 328(x14) x15=00000000 x14:00002b20 PA:00002c68
1015ns 97 000010fc 00079563 c.bne x15, x0, 10 x15:00000000
"""
parser.parse(dummy_text_ok1)
parser.parse(dummy_text_ok2)
parser.parse(dummy_text_problematic)
No terminal matches 'c' in the current parser context, at line 6 col 45
945ns 90 00000e3e 2b40006f c.jal x0, 692
^
Expected one of:
* DECODED_INSTRUCTION
So this indicates that the DECODED_INSTRUCTION
rule is not behaving as expected.
DECODED_INSTRUCTION: /
[a-z\.]+ # Instruction mnemonic
([-a-z0-9, ()]+)? # Optional operand part (rd,rs1,rs2, etc.)
(?= # Stop when
x[0-9]{1,2}[=:] # Either you hit an xN= or xN:
|PA: # or you meet PA:
|\s+$ # or there is no REG_AND_MEM and you meet a \n
)
/xi
This rule is really heavy, it has to match the whole ISA of the processor, which is in RISC-V btw. So here step-by-step I have
Now, this was tricky. Instead of accounting from every possible instruction variation in my rules above, I thought to leverage the fact that there exist characters in the following column (Register and memory contents) which do not exist in any instruction variation of the ISA. This is where the look-ahead part of the regex comes in place. I stop when
$
) as the field does not exist.However, the last case does not seem to work as intended, as shown in the above example. The way I see it, this seems OK to either stop when you meet one of the two criteria, OR you have encountered a new line (implying that the following part is omitted for the current entry). Did I blunder something in the regex part?
For $
to mean end-of-line, you need to add the m
, i.e. MULTILINE flag
DECODED_INSTRUCTION: /
...
/xim