pythonparsingpegparsimoniouspypeg

How to handle all possible C like block comment styles in PyPEG


After giving up on parsimonous I tried PyPEG. I've had much more success in that I've achieved my initial goal, but can't seem to handle comments properly.

I've distilled the issue into the following code.

You can see that not all the test cases work if the block comment has code before it (testcase 4 and 5) then a Line is generated rather than a BlockComment.

Is there a way to get PyPEG do this itself or do I need to postprocess Lines to find BlockComments that exist in multiple Lines.

import pypeg2 as pp
import re
import pprint

nl = pp.RegEx(r"[\r\n]+")
symbols = "\"\-\[\]\\!#$%&'()¬*+£,./:;<=>?@^_‘{|}~"

text = re.compile(r"[\w" + symbols + "]+", re.UNICODE)


# Partial definition as we use it before it's fully defined
class Code(pp.List):
    pass


class Text(str):
    grammar = text


class Line(pp.List):
    grammar = pp.maybe_some(Text), nl


class LineComment(Line):
    grammar = re.compile(r".*?//.*"), nl


class BlockComment(pp.Literal):
    grammar = pp.comment_c, pp.maybe_some(Text)


Code.grammar = pp.maybe_some([BlockComment, LineComment, Line])


comments = """
/*
Block comment 1
*/

// Line Comment1

Test2 // EOL Comment2

/*
Block comment 2*/

/* Block
comment 3 */

Test4 start /*
Block comment 4
*/ Test4 end

Test5 start /* Block comment 5 */ Test5 end

      /* Block comment 6 */

"""

parsed = pp.parse(comments, Code, whitespace=pp.RegEx(r"[ \t]"))
pprint.pprint(list(parsed))




Solution

  • Your pattern for text will also match comments; since it's applied greedily, it's impossible for a comment to be matched unless it happens to be at the beginning of a line. So you need to make sure that the match stops when a comment delimiter is encountered.

    You could try something like the following:

    # I removed / from the list.
    symbols = "\"\-\[\]\\!#$%&'()¬*+£,.:;<=>?@^_‘{|}~"
    
    text = re.compile(r"([\w" + symbols + "]|/(?![/*]))+", re.UNICODE)
    

    Although I have to say that the list of symbols seems somewhat arbitrary to me. I would have just used

    text = re.compile(r"([^/\r\n]|/(?![/*]))+", re.UNICODE)