javaregexstring

Getting data between single and double quotes (special case)


I am writing a String parser that I use to parse all strings from a text file, The strings can be inside single or double quotes, Pretty simple right? well not really. I wrote a regex to match strings how I want. but it's giving me StackOverFlow error on big strings (I am aware java isn't really good with regex stuff on big strings), This is the regex pattern (['"])(?:(?!\1|\\).|\\.)*\1

This works good for all the string inputs that I need, but as soon as theres a big string it throws StackOverFlow error, I have read similar questions based on this, such as this which suggests to use StringUtils.substringsBetween, but that fails on strings like '""', "\\\""

So my question is what should I do to solve this issue? I can provide more context if needed, Just comment.

Edit: After testing the answer

Code:

public static void main(String[] args) {

    final String regex = "'([^']*)'|\"(.*)\"";
    final String string = "local b = { [\"\\\\\"] = \"\\\\\\\\\", [\"\\\"\"] = \"\\\\\\\"\", [\"\\b\"] = \"\\\\b\", [\"\\f\"] = \"\\\\f\", [\"\\n\"] = \"\\\\n\", [\"\\r\"] = \"\\\\r\", [\"\\t\"] = \"\\\\t\" }\n" +
            "local c = { [\"\\\\/\"] = \"/\" }";

    final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
    final Matcher matcher = pattern.matcher(string);

    while (matcher.find()) {
        System.out.println("Full match: " + matcher.group(0));
        for (int i = 1; i <= matcher.groupCount(); i++) {
            System.out.println("Group " + i + ": " + matcher.group(i));
        }
    }
}

Output:

Full match: "\\"] = "\\\\", ["\""] = "\\\"", ["\b"] = "\\b", ["\f"] = "\\f", ["\n"] = "\\n", ["\r"] = "\\r", ["\t"] = "\\t"
Group 1: null
Group 2: \\"] = "\\\\", ["\""] = "\\\"", ["\b"] = "\\b", ["\f"] = "\\f", ["\n"] = "\\n", ["\r"] = "\\r", ["\t"] = "\\t
Full match: "\\/"] = "/"
Group 1: null
Group 2: \\/"] = "/

It's not handling the escaped quotes correctly.


Solution

  • I would try without capture quote type/lookahead/backref to improve performance. See this question for escaped characters in quoted strings. It contains a nice answer that is unrolled. Try like

    '[^\\']*(?:\\.[^\\']*)*'|"[^\\"]*(?:\\.[^\\"]*)*"
    

    As a Java String:

    String regex = "'[^\\\\']*(?:\\\\.[^\\\\']*)*'|\"[^\\\\\"]*(?:\\\\.[^\\\\\"]*)*\"";
    

    The left side handles single quoted, the right double quoted strings. If either kind overbalances the other in your source, put that preferably on the left side of the pipe.

    See this a demo at regex101 (if you need to capture what's inside the quotes, use groups)