Taking as a starting point the code example from the Parslet's own creator (available in this link) I need to extend it so as to retrieve all the non-commented text from a file written in a C-like syntax.
The provided example is able to successfully parse C-style comments, treating these areas as regular line spaces. However, this simple example only expects 'a' characters in the non-commented areas of the file such as the input example:
a
// line comment
a a a // line comment
a /* inline comment */ a
/* multiline
comment */
The rule used to detect the non-commented text is simply:
rule(:expression) { (str('a').as(:a) >> spaces).as(:exp) }
Therefore, what I need is to generalize the previous rule to get all the other (non-commented) text from a more generic file such as:
word0
// line comment
word1 // line comment
phrase /* inline comment */ something
/* multiline
comment */
I am new to Parsing Expression Grammars and neither of my previous trials succeeded.
The general idea is that everything is code (aka non-comment) until one of the sequences //
or /*
appears. You can reflect this with a rule like this:
rule(:code) {
(str('/*').absent? >> str('//').absent? >> any).repeat(1).as(:code)
}
As mentioned in my comment, there is a small problem with strings, though. When a comment occurs inside a string, it obviously is part of the string. If you were to remove comments from your code, you would then alter the meaning of this code. Therefore, we have to let the parser know what a string is, and that any character inside there belongs to it. Another thing are escape sequences. For example the string "foo \" bar /*baz*/"
, which contains a literal double quote, would actually be parsed as "foo \"
, followed by some code again. This is of course something that needs to be addressed. I have written a complete parser that handles all of the above cases:
require 'parslet'
class CommentParser < Parslet::Parser
rule(:eof) {
any.absent?
}
rule(:block_comment_text) {
(str('*/').absent? >> any).repeat.as(:comment)
}
rule(:block_comment) {
str('/*') >> block_comment_text >> str('*/')
}
rule(:line_comment_text) {
(str("\n").absent? >> any).repeat.as(:comment)
}
rule(:line_comment) {
str('//') >> line_comment_text >> (str("\n").present? | eof)
}
rule(:string_text) {
(str('"').absent? >> str('\\').maybe >> any).repeat
}
rule(:string) {
str('"') >> string_text >> str('"')
}
rule(:code_without_strings) {
(str('"').absent? >> str('/*').absent? >> str('//').absent? >> any).repeat(1)
}
rule(:code) {
(code_without_strings | string).repeat(1).as(:code)
}
rule(:code_with_comments) {
(code | block_comment | line_comment).repeat
}
root(:code_with_comments)
end
It will parse your input
word0
// line comment
word1 // line comment
phrase /* inline comment */ something
/* multiline
comment */
to this AST
[{:code=>"\n word0\n "@0},
{:comment=>" line comment"@13},
{:code=>"\n word1 "@26},
{:comment=>" line comment"@37},
{:code=>"\n phrase "@50},
{:comment=>" inline comment "@61},
{:code=>" something \n "@79},
{:comment=>" multiline\n comment "@94},
{:code=>"\n"@116}]
To extract everything except the comments you can do:
input = <<-CODE
word0
// line comment
word1 // line comment
phrase /* inline comment */ something
/* multiline
comment */
CODE
ast = CommentParser.new.parse(input)
puts ast.map{|node| node[:code] }.join
which will produce
word0
word1
phrase something