After giving up on parsimonous I tried PyPEG. I've had much more success in that I've achieved my initial goal, but can't seem to handle comments properly.
I've distilled the issue into the following code.
You can see that not all the test cases work if the block comment has code before it (testcase 4 and 5) then a Line is generated rather than a BlockComment.
Is there a way to get PyPEG do this itself or do I need to postprocess Lines to find BlockComments that exist in multiple Lines.
import pypeg2 as pp
import re
import pprint
nl = pp.RegEx(r"[\r\n]+")
symbols = "\"\-\[\]\\!#$%&'()¬*+£,./:;<=>?@^_‘{|}~"
text = re.compile(r"[\w" + symbols + "]+", re.UNICODE)
# Partial definition as we use it before it's fully defined
class Code(pp.List):
pass
class Text(str):
grammar = text
class Line(pp.List):
grammar = pp.maybe_some(Text), nl
class LineComment(Line):
grammar = re.compile(r".*?//.*"), nl
class BlockComment(pp.Literal):
grammar = pp.comment_c, pp.maybe_some(Text)
Code.grammar = pp.maybe_some([BlockComment, LineComment, Line])
comments = """
/*
Block comment 1
*/
// Line Comment1
Test2 // EOL Comment2
/*
Block comment 2*/
/* Block
comment 3 */
Test4 start /*
Block comment 4
*/ Test4 end
Test5 start /* Block comment 5 */ Test5 end
/* Block comment 6 */
"""
parsed = pp.parse(comments, Code, whitespace=pp.RegEx(r"[ \t]"))
pprint.pprint(list(parsed))
Your pattern for text
will also match comments; since it's applied greedily, it's impossible for a comment to be matched unless it happens to be at the beginning of a line. So you need to make sure that the match stops when a comment delimiter is encountered.
You could try something like the following:
# I removed / from the list.
symbols = "\"\-\[\]\\!#$%&'()¬*+£,.:;<=>?@^_‘{|}~"
text = re.compile(r"([\w" + symbols + "]|/(?![/*]))+", re.UNICODE)
Although I have to say that the list of symbols
seems somewhat arbitrary to me. I would have just used
text = re.compile(r"([^/\r\n]|/(?![/*]))+", re.UNICODE)