regexscalaparsingcurly-braceswml

Scala: Regular Expression pattern match with curly braces?


so I am creating an WML like language for my assignment and as a first step, I am supposed to create regular expressions to recognize the following:

//single = "{"
//double = "{{"
//triple = "{{{"

here is my code for the second one:

val double = "\\{\\{\\b".r

and my Test is:

println(double.findAllIn("{{ s{{ { {{{ {{ {{x").toArray.mkString(" "))

Bit it doesn't print anything ! It's supposed to print the first, second, fifth and 6th token. I have tried every single combination of \b and \B and even \{{2,2} instead of \{\{ but it's still not working. Any help??

As a side question, If I wanted it to match just the first and fifth tokens, what would I need to do?


Solution

  • I tested your code (Scala 2.12.2 REPL), and in contrary to your "it doesn't print anything" statement, it actually prints "{{" occurrence from "{{x" substring.

    This is because x is a word character and \b matches a position between second { and x. Keep in mind that { isn't a word character, unlike x.

    As per this tutorial

    It matches at a position that is called a "word boundary". This match is zero-length

    There are three different positions that qualify as word boundaries:

    1) Before the first character in the string, if the first character is a word character

    ...

    As for solution, it depends on precise definition, but lookarounds seemed to work for me:

    "(?<!\\{)\\{{2}(?!\\{)".r
    

    It matched "first, second, fifth and 6th token". The expression says match "{{" not preceded and not followed by "{".

    For side-question:

    "(?<![^ ])\\{\\{(?![^ ])".r //match `{` surrounded by spaces or line boundaries
    

    Or, depending on your interpretation of "space":

    "(?<!\\S)\\{\\{(?!\\S)".r
    

    matched 1st and 5th tokens. I couldn't use positive lookarounds coz I wanted to take line beginnings and endings (boundaries) into account automatically. So double negation by ! and [^ ] created an effect of implicit inclusion of ^ and $. Alternatively, you could use:

    "(?<=^|\\s)\\{\\{(?=\\s|$)".r
    

    You can read about lookarounds here. Basically they match the symbol or expression as boundary; simply saying they match stuff but don't include it in the matched string itself.

    Some examples of lookarounds


    P.S. Just to make your life easier, Scala has """ for escaping, so let's say instead of:

    "(?<!\\S)\\{\\{(?!\\S)".r
    

    you can just:

    """(?<!\S)\{\{(?!\S)""".r