scalacommentspreprocessorscalac

How to remove comments from the scala code


Any ideas how to remove comments from the scala code so that:

Here is example code with comments:

object TestCode {

  val a = "A" // a = "AA"
  val b = "B" /* b = "BB" */
  val c = "C" /* multi line comment
  /* c = "CC" nested */ // FOO
  */ // c = "CCC"
  val d = """D""" // d = """DD /* """
  val e = '"' // e = '"' = char literal
  val f = '\"' // f = '\"' = char literal
  val codeStr = " \"  \"\" \"\"\"/* This is literal */\"\"\" val x = \"\"\"5\"\"\" \" "
  "/* This is a literal */" // This is a comment 3
  "// This is a literal with extra comment end string */" // This is a comment 4
  "/* This is a litral with extra comment begin string" // This is a comment 5

}

Code compiles (with warnings about pure expressions).

The C preprocessor gets quite close but fails with nested comments.

object TestCode {
  val a = "A"
  val b = "B"
  val c = "C"
  */
  val d = """D"""
  val e = '"'
  val f = '\"'
  val codeStr = " \"  \"\" \"\"\"/* This is literal */\"\"\" val x = \"\"\"5\"\"\" \" "
  "/* This is a literal */"
  "// This is a literal with extra comment end string */"
  "/* This is a litral with extra comment begin string"
}

I also tried this regex solution but it seems that it fails in case of quote char literals and nested comments as you see:

str.replaceAll("//.*|/\\*(?s:.*?)\\*/|(\"(?:(?<!\\\\)(?:\\\\\\\\)*\\\\\"|[^\r\n\"])*\")", "$1")
res1: String = """object TestCode {

  val a = "A" 
  val b = "B" 
  val c = "C"  
  */ 
  val d = """D""" 
  val e = '"' // e = '"' = char literal
  val f = '\"' // f = '\"' = char literal
  val codeStr = " \"  \"\" \"\"\"/* This is literal */\"\"\" val x = \"\"\"5\"\"\" \" "
  "/* This is a literal */" 
  "// This is a literal with extra comment end string */" 
  "/* This is a literal with extra comment begin string" 

}"""

Scala compiler can do the job but for my understanding there is no compiler option to do just the comment removal.


Solution

  • I used Mateusz Kubuszok's proposal and used ScalaMeta for the implementation.

    This is the scala-cli script file: SourceCodeCommentRemover.scala

    //> using scala "2.13.5"
    //> using lib "org.scalameta::scalameta:4.9.7"
    
    import scala.meta._
    import java.io.{File, PrintWriter}
    
    object CommentRemover {
      def main(args: Array[String]): Unit = {
        if (args.length != 2) {
          println("Usage: CommentRemover <input file> <output file>")
          sys.exit(1)
        }
    
        val inputFile = new File(args(0))
        val outputFile = new File(args(1))
    
        if (!inputFile.exists()) {
          println(s"Input file ${inputFile.getAbsolutePath} does not exist.")
          sys.exit(1)
        }
    
        val sourceCode = {
          import scala.io.Source
          Source.fromFile(inputFile).mkString
        }
    
        println(s"Original source: BEGIN\n${sourceCode}\nEND")
    
        val tree = sourceCode.parse[Source] match {
          case parsers.Parsed.Success(tree) => tree
          case parsers.Parsed.Error(_, msg, _) =>
            println(s"Failed to parse the input file: $msg")
            sys.exit(1)
        }
    
        val codeWithoutComments = tree.tokens.collect {
          case token if !token.is[Token.Comment] => token.text
        }.mkString
    
        println(s"Comments removed: BEGIN\n${codeWithoutComments}\nEND")
    
        val writer = new PrintWriter(outputFile)
        try {
          writer.write(codeWithoutComments)
        } finally {
          writer.close()
        }
    
        println(s"Comments removed. Output written to ${outputFile.getAbsolutePath}.")
      }
    }
    

    This is the test input file: StackOverflowTestCode.scala

    object TestCode {
    
      val a = "A" // a = "AA"
      val b = "B" /* b = "BB" */
      val c = "C" /* multi line comment
      /* c = "CC" nested */ // FOO
      */ // c = "CCC"
      val d = """D""" // d = """DD /* """
      val e = '"' // e = '"' = char literal
      val f = '\"' // f = '\"' = char literal
      val codeStr = " \"  \"\" \"\"\"/* This is literal */\"\"\" val x = \"\"\"5\"\"\" \" "
      "/* This is a literal */" // This is a comment 3
      "// This is a literal with extra comment end string */" // This is a comment 4
      "/* This is a litral with extra comment begin string" // This is a comment 5
    
    }
    

    Run the script:

    scala-cli run SourceCodeCommentRemover.scala -- StackOverflowTestCode.scala out.scala
    
    cat  out.scala
    
    object TestCode {
    
      val a = "A" 
      val b = "B" 
      val c = "C"  
      val d = """D""" 
      val e = '"' 
      val f = '\"' 
      val codeStr = " \"  \"\" \"\"\"/* This is literal */\"\"\" val x = \"\"\"5\"\"\" \" "
      "/* This is a literal */" 
      "// This is a literal with extra comment end string */" 
      "/* This is a litral with extra comment begin string" 
    
    }
    

    Scala-cli version:

    scala-cli --version
    Scala CLI version: 1.4.1
    Scala version (default): 3.4.2