I want to be able to parse files in the Textile markup lanaguage (https://textile-lang.com/) in order to convert it to LaTeX. The file I have is a bit of an extension of Textile, since it adds fields and footnotes. An example file is given below.
test.textile
#[contents]#
p. This is a paragraph.
With some bullet points:
* Bullet 1
* Bullet 2
* Bullet 3
And a code block:
bc.. # Program to display the Fibonacci sequence up to n-th term
"""\
* Program to display the Fibonacci sequence up to n-th term
"""
search_string = r"<5>"
nterms = int(input("How many terms? "))
# first two terms
n1, n2 = 0, 1
count = 0
p. And after the block of code another paragraph, with a footnote<1>.
bc. fn1. This is the footnote contents
p. And here is another paragraph
#[second_field]#
Some more contents
To parse the file I have the following parser.
parser.py
from lark import Lark
def read_file(filename):
with open(filename) as f:
return f.read()
grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements')
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree)
And the following grammar. This grammar does not parse bullet points and footnotes yet, because I run into another problem already.
grammar.lark
elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: FIELD_NAME
content: contents*
?contents: paragraph
| code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? STR
FIELD_START: /(\A|[\r\n]{2,})#\[/
FIELD_NAME: /[^\]]+/
FIELD_END: /\]#[\r\n]/
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
STR: /.+/s
When I run the parser, I get the following output.
output
Tree(Token('RULE', 'elements'), [
Tree(Token('RULE', 'element'), [
Tree(Token('RULE', 'field'), [
Token('FIELD_START', '#['),
Tree(Token('RULE', 'field_name'), [
Token('FIELD_NAME', 'contents')]),
Token('FIELD_END', ']#\n')]),
Tree(Token('RULE', 'content'), [
Tree(Token('RULE', 'paragraph'), [
Token('PARAGRAPH_START', 'p. '),
Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\nbc.. # Program to display the Fibonacci sequence up to n-th term\n\n"""\\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"<5>"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\np. And after the block of code another paragraph, with a footnote<1>.\n\nbc. fn1. This is the footnote contents\n\np. And here is another paragraph\n\n#[second_field]#\nSome more contents\n')])])])])
The whole rest of the file is parsed as the paragraph, which is sort of correct, since /.+/s
can match anything. So, I change the definition of STR to /.+?/s
to make it non-greedy, but now the output is as follows (pretty printed):
output
elements
element
field
#[
field_name contents
]#
content
paragraph
p.
T
paragraph h
paragraph i
paragraph s
paragraph
paragraph i
paragraph s
paragraph
paragraph a
paragraph
--snip--
paragraph
paragraph #
paragraph [
paragraph s
paragraph e
paragraph c
paragraph o
paragraph n
paragraph d
paragraph _
paragraph f
paragraph i
paragraph e
paragraph l
paragraph d
paragraph ]
paragraph #
It parses each character as a paragraph. And it still parses the whole file as paragraph elements.
My first solution to this problem was to create a lexer, which creates tokens for FIELD_START, FIELD_END, PARAGRAPH_START, CODE_BLOCK_START and footnote-related tokens.
My lexer looks as follows:
from lark.lexer import Lexer, Token
import re
class MyLexer(Lexer):
def __init__(self, *args, **kwargs):
pass
def lex(self, data):
tokens = {
"FIELD_START": r"(?:\A|[\r\n]{2,})#\[",
"FIELD_END": r"\]#[\r\n]",
"FOOTNOTE_ANCHOR": r"<\d>",
"FOOTNOTE_START": r"bc. fn\d. ",
"PARAGRAPH_START": r"p\. ",
"CODE_BLOCK_START": r"bc\.\.? ",
}
regex = '|'.join([f"({r})" for r in tokens.values()])
for x in re.split(regex, data):
if not x:
continue
for token_type, token_regex in tokens.items():
if re.match(token_regex, x):
yield Token(token_type, x)
break
else:
yield Token("STR", x)
parser = Lark(grammar, lexer=MyLexer, start='elements')
It creates a regex based on the given tokens, then splits the whole string by the regex and returns everything as a token, either a defined token, or "STR". The new grammar looks as follows:
elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: STR
content: contents*
?contents: STR
| paragraph
| code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? paragraph_contents+
?paragraph_contents: STR
| FOOTNOTE_ANCHOR
| footnote
footnote: FOOTNOTE_START STR
%declare FIELD_START FIELD_END FOOTNOTE_ANCHOR FOOTNOTE_START PARAGRAPH_START CODE_BLOCK_START STR
The output of the parser is as follows:
Tree(Token('RULE', 'elements'), [
Tree(Token('RULE', 'element'), [
Tree(Token('RULE', 'field'), [
Token('FIELD_START', '#['),
Tree(Token('RULE', 'field_name'), [
Token('STR', 'contents')]),
Token('FIELD_END', ']#\n')]),
Tree(Token('RULE', 'content'), [
Tree(Token('RULE', 'paragraph'), [
Token('PARAGRAPH_START', 'p. '),
Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\n')]),
Tree(Token('RULE', 'code_block'), [
Token('CODE_BLOCK_START', 'bc.. '),
Token('STR', '# Program to display the Fibonacci sequence up to n-th term\n\n"""\\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"')]),
Tree(Token('RULE', 'paragraph'), [
Token('FOOTNOTE_ANCHOR', '<5>'),
Token('STR', '"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\n')]),
Tree(Token('RULE', 'paragraph'), [
Token('PARAGRAPH_START', 'p. '),
Token('STR', 'And after the block of code another paragraph, with a footnote'),
Token('FOOTNOTE_ANCHOR', '<1>'),
Token('STR', '.\n\n'),
Tree(Token('RULE', 'footnote'), [
Token('FOOTNOTE_START', 'bc. fn1. '),
Token('STR', 'This is the footnote contents\n\n')])]),
Tree(Token('RULE', 'paragraph'), [
Token('PARAGRAPH_START', 'p. '),
Token('STR', 'And here is another paragraph')])])]),
Tree(Token('RULE', 'element'), [
Tree(Token('RULE', 'field'), [
Token('FIELD_START', '\n\n#['),
Tree(Token('RULE', 'field_name'), [
Token('STR', 'second_field')]),
Token('FIELD_END', ']#\n')]),
Tree(Token('RULE', 'content'), [
Token('STR', 'Some more contents\n')])])])
This correctly parses the different fields and footnotes, however, the code block is interrupted by a detected FOOTNOTE_ANCHOR. Because the Lexer does not know context it tries to parse footnote anchors in code as well. The same problem would occur when trying to parse bullet points.
What is the best solution to the problem? Do I really need a lexer? Is my lexer implemented correctly? (I can find very little examples on how to use a custom lexer for text). Can I maybe only lex some tokens and leave the rest to a "parent" lexer?
Based on recognizing multi-line sections with lark grammar I was able to find a solution.
The important part is to not use /.+/s
to match multiple lines, as then the parser will not have the opportunity to match other tokens. It is better to match line by line, so that the parser has opportunity to match a new rule for each line. I also switched the parser to "lalr", it did not work with the standard parser.
parser.py
from lark import Lark
def read_file(filename):
with open(filename) as f:
return f.read()
grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements', parser="lalr")
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree.pretty())
grammar.lark
elements: element+
?element: field content
?field: NEWLINE* FIELD_START field_name FIELD_END NEWLINE
field_name: FIELD_NAME
content: contents*
?contents: paragraph
| code_block
code_block: CODE_BLOCK_START (LINE NEWLINE)+
paragraph: PARAGRAPH_START? (paragraph_line | bullets | footnote)+
bullets: (BULLET paragraph_line)+
footnote: FOOTNOTE_START LINE NEWLINE
paragraph_line: (PARAGRAPH_LINE | FOOTNOTE_ANCHOR)+ NEWLINE
FIELD_START: "#["
FIELD_NAME: /[^\]]+/
FIELD_END: "]#"
FOOTNOTE_ANCHOR: /<\d>/
FOOTNOTE_START: /bc\. fn\d\. /
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
LINE.-1: /.+/
BULLET.-2: "*"
PARAGRAPH_LINE.-3: /.+?(?=(<\d>|\r|\n))/
%import common.NEWLINE
The output:
elements
element
field
#[
field_name contents
]#
content
paragraph
p.
paragraph_line
This is a paragraph.
paragraph_line
With some bullet points:
bullets
*
paragraph_line
Bullet 1
*
paragraph_line
Bullet 2
*
paragraph_line
Bullet 3
paragraph_line
And a code block:
code_block
bc..
# Program to display the Fibonacci sequence up to n-th term
"""\
* Program to display the Fibonacci sequence up to n-th term
"""
search_string = r"<5>"
nterms = int(input("How many terms? "))
# first two terms
n1, n2 = 0, 1
count = 0
paragraph
p.
paragraph_line
And after the block of code another paragraph, with a footnote
<1>
.
footnote
bc. fn1.
This is the footnote contents
paragraph
p.
paragraph_line
And here is another paragraph
element
field
#[
field_name second_field
]#
content
paragraph
paragraph_line
Some more contents
Note that the parser also correctly pasrses bullets and footnotes. In order to parse footnote anchors within a line, I have made a special PARAGRAPH_LINE
, which stops at the first footnote it encounters or at the end of the line. Also note that line has precedence over bullets, so bullets will not be matched in a code block, (since it looks for a normal line), only in a paragraph.