pythonjsonregexbashjson-c

Convert JSONC to JSON via regex


I have JSONC as input (it is a superset of JSON that supports comments like //this one and /* this one */), and want to transform it into normal JSON (standard) using Python regex, but I'm not sure this can be solved with regex only. I know it can be done via semantic processing, maybe something like tree-sitter, but I'm looking for a regex-based solution. Since we don't use /* */ it's fine to have a regex only with removing comments with //.

Note that:

  1. It is guaranteed that when you properly remove everything that's in the comments, you get valid JSON;
  2. The input is always pretty-formatted;

Here is an example input with a failing sed attempt at the top:

  //tried this sed -r 's#\s//[^}]*##'
//  also tried this '%^*3//s39()'
[
  {
    "test1" : "http://test.com",
    "test2" : "http://test.com",//test
    // any thing
    "ok" : 3,  //here 2
    "//networkpath1" : true, //whynot
    "//networkpath2" : true 
// ok

  },//eof
  {
    "statement" : "I like test cases"
}//eof
]

Here is another failing attempt:

comment_re = re.compile(r'\s//[^}]*')
cleaned = comment_re.sub('', jsonStr)

This removes too much when // occurs in a string literal.

How can I make this work also for such inputs?

NB: A solution is already helpful if it doesn't deal with /* this type of comments */ so no need to cover for that.


Solution

  • You could match quoted strings as a capture group and re-inject those in the result, so to avoid that you would match any of the comment delimiters in such strings:

    comment_re = re.compile(
        r'//.*|/\*[\s\S]*?\*/|("(\\.|.)*?")',  # capture group for quoted strings
    )
    
    cleaned = comment_re.sub(r'\1', jsonStr)  # re-inject quoted strings
    

    Here it is not a requirement that the JSONC input be formatted with specific indentation and line separators.