rubyparsingpegparslet

How do I handle C-style comments in Ruby using Parslet?


Taking as a starting point the code example from the Parslet's own creator (available in this link) I need to extend it so as to retrieve all the non-commented text from a file written in a C-like syntax.

The provided example is able to successfully parse C-style comments, treating these areas as regular line spaces. However, this simple example only expects 'a' characters in the non-commented areas of the file such as the input example:

         a
      // line comment
      a a a // line comment
      a /* inline comment */ a 
      /* multiline
      comment */

The rule used to detect the non-commented text is simply:

   rule(:expression) { (str('a').as(:a) >> spaces).as(:exp) }

Therefore, what I need is to generalize the previous rule to get all the other (non-commented) text from a more generic file such as:

     word0
  // line comment
   word1 // line comment
  phrase /* inline comment */ something 
  /* multiline
  comment */

I am new to Parsing Expression Grammars and neither of my previous trials succeeded.


Solution

  • The general idea is that everything is code (aka non-comment) until one of the sequences // or /* appears. You can reflect this with a rule like this:

    rule(:code) {
      (str('/*').absent? >> str('//').absent? >> any).repeat(1).as(:code)
    }
    

    As mentioned in my comment, there is a small problem with strings, though. When a comment occurs inside a string, it obviously is part of the string. If you were to remove comments from your code, you would then alter the meaning of this code. Therefore, we have to let the parser know what a string is, and that any character inside there belongs to it. Another thing are escape sequences. For example the string "foo \" bar /*baz*/", which contains a literal double quote, would actually be parsed as "foo \", followed by some code again. This is of course something that needs to be addressed. I have written a complete parser that handles all of the above cases:

    require 'parslet'
    
    class CommentParser < Parslet::Parser
      rule(:eof) { 
        any.absent? 
      }
    
      rule(:block_comment_text) {
        (str('*/').absent? >> any).repeat.as(:comment)
      }
    
      rule(:block_comment) {
        str('/*') >> block_comment_text >> str('*/')
      }
    
      rule(:line_comment_text) {
        (str("\n").absent? >> any).repeat.as(:comment)
      }
    
      rule(:line_comment) {
        str('//') >> line_comment_text >> (str("\n").present? | eof)
      }
    
      rule(:string_text) {
        (str('"').absent? >> str('\\').maybe >> any).repeat
      }
    
      rule(:string) {
        str('"') >> string_text >> str('"')
      }
    
      rule(:code_without_strings) {
        (str('"').absent? >> str('/*').absent? >> str('//').absent? >> any).repeat(1)
      }
    
      rule(:code) {
        (code_without_strings | string).repeat(1).as(:code)
      }
    
      rule(:code_with_comments) {
        (code | block_comment | line_comment).repeat
      }
    
      root(:code_with_comments)
    end
    

    It will parse your input

         word0
      // line comment
       word1 // line comment
      phrase /* inline comment */ something 
      /* multiline
      comment */
    

    to this AST

    [{:code=>"\n   word0\n "@0},
     {:comment=>" line comment"@13},
     {:code=>"\n  word1 "@26},
     {:comment=>" line comment"@37},
     {:code=>"\n phrase "@50},
     {:comment=>" inline comment "@61},
     {:code=>" something \n "@79},
     {:comment=>" multiline\n comment "@94},
     {:code=>"\n"@116}]
    

    To extract everything except the comments you can do:

    input = <<-CODE
         word0
      // line comment
       word1 // line comment
      phrase /* inline comment */ something 
      /* multiline
      comment */
    CODE
    
    ast = CommentParser.new.parse(input)
    puts ast.map{|node| node[:code] }.join
    

    which will produce

       word0
    
      word1
     phrase  something